public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0
@ 2020-11-12 23:04 Daniel Engel
  2020-11-26  9:14 ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2020-11-12 23:04 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7165 bytes --]

Hi, 

This patch adds an efficient assembly-language implementation of IEEE-754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-1).  This is the libgcc portion of a larger library originally described in 2018:

    https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html

Since that time, I've separated the libm functions for submission to newlib.  The remaining libgcc functions in the attached patch have the following characteristics:

    Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
    __clzsi2                        42                  23              0       exact
    __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
    __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
   
    __umulsidi3                     44                  24              0       exact
    __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
    __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
    __ashldi3 (__aeabi_llsl)        22                  13              0       exact
    __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
    __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
   
    __aeabi_lcmp                    20                   13             0       exact
    __aeabi_ulcmp                   16                  10              0       exact
   
    __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
    __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
    __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
    __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
    __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
   
    __shared_float                  178        
    __shared_float (OPTIMIZE_SIZE)  154   
        
    __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
    __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
    __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
    __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
    __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
    __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
    __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
    __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
   
    __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
    __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
    __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
    __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
    __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
    __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
    __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
    __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
    __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
   
    __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
    __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
    __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
    __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
    __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
   
    __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
    __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
    __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
    __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
    __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
     
    __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
    __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
    __aeabi_h2f                     34+__shared_float 34             8     exact
    __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp

Copyright assignment is on file with the FSF.  

I've built the gcc-arm-none-eabi cross-compiler using the 20201108 snapshot of GCC plus this patch, and successfully compiled a test program:

    extern int main (void)
    {
        volatile int x = 1;
        volatile unsigned long long int y = 10;
        volatile long long int z = x / y; // 64-bit division
      
        volatile float a = x; // 32-bit casting
        volatile float b = y; // 64 bit casting
        volatile float c = z / b; // float division
        volatile float d = a + c; // float addition
        volatile float e = c * b; // float multiplication
        volatile float f = d - e - c; // float subtraction
      
        if (f != c) // float comparison
            y -= (long long int)d; // float casting
    }

As one point of comparison, the test program links to 876 bytes of libgcc code from the patched toolchain, vs 10276 bytes from the latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a 90% size reduction.  

I have extensive test vectors, and have passed these tests on an STM32F051.  These vectors were derived from UCB [1], Testfloat [2], and IEEECC754 [3] sources, plus some of my own creation.  Unfortunately, I'm not sure how "make check" should work for a cross compiler run time library.  

Although I believe this patch can be incorporated as-is, there are at least two points that might bear discussion: 

* I'm not sure where or how they would be integrated, but I would be happy to provide sources for my test vectors.  

* The library is currently built for the ARM v6m architecture only.  It is likely that some of the other Cortex variants would benefit from these routines.  However, I would need some guidance on this to proceed without introducing regressions.  I do not currently have a test strategy for architectures beyond Cortex M0, and I have NOT profiled the existing thumb-2 implementations (ieee754-sf.S) for comparison.

I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.  

Thanks,
Daniel Engel

[1] http://www.netlib.org/fp/ucbtest.tgz
[2] http://www.jhauser.us/arithmetic/TestFloat.html
[3] http://win-www.uia.ac.be/u/cant/ieeecc754.html

[-- Attachment #2: cortex-m0-fplib-20201112.patch --]
[-- Type: application/octet-stream, Size: 133513 bytes --]

diff -ruN libgcc/config/arm/bpabi-v6m.S libgcc/config/arm/bpabi-v6m.S
--- libgcc/config/arm/bpabi-v6m.S	2020-11-08 14:32:11.000000000 -0800
+++ libgcc/config/arm/bpabi-v6m.S	2020-11-12 09:06:46.383424089 -0800
@@ -33,212 +33,6 @@
 	.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-FUNC_START aeabi_lcmp
-	cmp	xxh, yyh
-	beq	1f
-	bgt	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-	RET
-1:
-	subs	r0, xxl, yyl
-	beq	1f
-	bhi	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-1:
-	RET
-	FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-	
-#ifdef L_aeabi_ulcmp
-
-FUNC_START aeabi_ulcmp
-	cmp	xxh, yyh
-	bne	1f
-	subs	r0, xxl, yyl
-	beq	2f
-1:
-	bcs	1f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-1:
-	movs	r0, #1
-2:
-	RET
-	FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
-.macro test_div_by_zero signed
-	cmp	yyh, #0
-	bne	7f
-	cmp	yyl, #0
-	bne	7f
-	cmp	xxh, #0
-	.ifc	\signed, unsigned
-	bne	2f
-	cmp	xxl, #0
-2:
-	beq	3f
-	movs	xxh, #0
-	mvns	xxh, xxh		@ 0xffffffff
-	movs	xxl, xxh
-3:
-	.else
-	blt	6f
-	bgt	4f
-	cmp	xxl, #0
-	beq	5f
-4:	movs	xxl, #0
-	mvns	xxl, xxl		@ 0xffffffff
-	lsrs	xxh, xxl, #1		@ 0x7fffffff
-	b	5f
-6:	movs	xxh, #0x80
-	lsls	xxh, xxh, #24		@ 0x80000000
-	movs	xxl, #0
-5:
-	.endif
-	@ tailcalls are tricky on v6-m.
-	push	{r0, r1, r2}
-	ldr	r0, 1f
-	adr	r1, 1f
-	adds	r0, r1
-	str	r0, [sp, #8]
-	@ We know we are not on armv4t, so pop pc is safe.
-	pop	{r0, r1, pc}
-	.align	2
-1:
-	.word	__aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-	test_div_by_zero signed
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__gnu_ldivmod_helper)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_ldivmod
-
-#endif /* L_aeabi_ldivmod */
-
-#ifdef L_aeabi_uldivmod
-
-FUNC_START aeabi_uldivmod
-	test_div_by_zero unsigned
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__udivmoddi4)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_uldivmod
-	
-#endif /* L_aeabi_uldivmod */
-
-#ifdef L_arm_addsubsf3
-
-FUNC_START aeabi_frsub
-
-      push	{r4, lr}
-      movs	r4, #1
-      lsls	r4, #31
-      eors	r0, r0, r4
-      bl	__aeabi_fadd
-      pop	{r4, pc}
-
-      FUNC_END aeabi_frsub
-
-#endif /* L_arm_addsubsf3 */
-
-#ifdef L_arm_cmpsf2
-
-FUNC_START aeabi_cfrcmple
-
-	mov	ip, r0
-	movs	r0, r1
-	mov	r1, ip
-	b	6f
-
-FUNC_START aeabi_cfcmpeq
-FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
-
-	@ The status-returning routines are required to preserve all
-	@ registers except ip, lr, and cpsr.
-6:	push	{r0, r1, r2, r3, r4, lr}
-	bl	__lesf2
-	@ Set the Z flag correctly, and the C flag unconditionally.
-	cmp	r0, #0
-	@ Clear the C flag if the return value was -1, indicating
-	@ that the first operand was smaller than the second.
-	bmi	1f
-	movs	r1, #0
-	cmn	r0, r1
-1:
-	pop	{r0, r1, r2, r3, r4, pc}
-
-	FUNC_END aeabi_cfcmple
-	FUNC_END aeabi_cfcmpeq
-	FUNC_END aeabi_cfrcmple
-
-FUNC_START	aeabi_fcmpeq
-
-	push	{r4, lr}
-	bl	__eqsf2
-	negs	r0, r0
-	adds	r0, r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmpeq
-
-.macro COMPARISON cond, helper, mode=sf2
-FUNC_START	aeabi_fcmp\cond
-
-	push	{r4, lr}
-	bl	__\helper\mode
-	cmp	r0, #0
-	b\cond	1f
-	movs	r0, #0
-	pop	{r4, pc}
-1:
-	movs	r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmp\cond
-.endm
-
-COMPARISON lt, le
-COMPARISON le, le
-COMPARISON gt, ge
-COMPARISON ge, ge
-
-#endif /* L_arm_cmpsf2 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff -ruN libgcc/config/arm/cm0/clz2.S libgcc/config/arm/cm0/clz2.S
--- libgcc/config/arm/cm0/clz2.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/clz2.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,122 @@
+/* clz2.S: Cortex M0 optimized 'clz' functions 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// int __clzdi2(long long)
+// Counts leading zeros in a 64 bit double word.
+// Expects the argument  in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.libgcc.clz2,"x"
+CM0_FUNC_START clzdi2
+    CFI_START_FUNCTION
+
+        // Assume all the bits in the argument are zero.
+        movs    r2,     #64
+
+        // If the upper word is ZERO, calculate 32 + __clzsi2(lower).
+        cmp     r1,     #0
+        beq     LSYM(__clz16)
+
+        // The upper word is non-zero, so calculate __clzsi2(upper).
+        movs    r0,     r1
+
+        // Fall through.
+
+
+// int __clzsi2(int)
+// Counts leading zeros in a 32 bit word.
+// Expects the argument in $r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+CM0_FUNC_START clzsi2
+        // Assume all the bits in the argument are zero
+        movs    r2,     #32
+
+    LSYM(__clz16):
+        // Size optimized: 22 bytes, 51 clocks
+        // Speed optimized: 42 bytes, 23 clocks
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Binary search starts at half the word width.
+        movs    r3,     #16
+
+    LSYM(__clz_loop):
+        // Test the upper 'n' bits of the operand for ZERO.
+        movs    r1,     r0
+        lsrs    r1,     r3
+        beq     LSYM(__clz_skip)
+
+        // When the test fails, discard the lower bits of the register,
+        //  and deduct the count of discarded bits from the result.
+        movs    r0,     r1
+        subs    r2,     r3
+
+    LSYM(__clz_skip):
+        // Decrease the shift distance for the next test.
+        lsrs    r3,     #1
+        bne     LSYM(__clz_loop)
+  #else
+        // Unrolled binary search.
+        lsrs    r1,     r0,     #16
+        beq     LSYM(__clz8)
+        movs    r0,     r1
+        subs    r2,     #16
+
+    LSYM(__clz8):
+        lsrs    r1,     r0,     #8
+        beq     LSYM(__clz4)
+        movs    r0,     r1
+        subs    r2,     #8
+
+    LSYM(__clz4):
+        lsrs    r1,     r0,     #4
+        beq     LSYM(__clz2)
+        movs    r0,     r1
+        subs    r2,     #4
+
+    LSYM(__clz2):
+        lsrs    r1,     r0,     #2
+        beq     LSYM(__clz1)
+        movs    r0,     r1
+        subs    r2,     #2
+
+    LSYM(__clz1):
+        // Convert remainder {0,1,2,3} to {0,1,2,2} (no 'ldr' cache hit).
+        lsrs    r1,     r0,     #1
+        bics    r0,     r1
+  #endif
+
+        // Account for the remainder.
+        subs    r0,     r2,     r0
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clzsi2
+CM0_FUNC_END clzdi2
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/div.S libgcc/config/arm/cm0/div.S
--- libgcc/config/arm/cm0/div.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/div.S	2020-11-10 21:33:20.981886999 -0800
@@ -0,0 +1,180 @@
+/* div.S: Cortex M0 optimized 32-bit integer division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// int __aeabi_idiv0(int)
+// Helper function for division by 0.
+.section .text.libgcc.idiv0,"x"
+CM0_FUNC_START aeabi_idiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        // Return {0, numerator}.
+        movs    r1,     r0
+        eors    r0,     r0
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_idiv0
+
+
+// int __aeabi_idiv(int, int)
+// idiv_return __aeabi_idivmod(int, int)
+// Returns signed $r0 after division by $r1.
+// Also returns the signed remainder in $r1.
+.section .text.libgcc.idiv,"x"
+CM0_FUNC_START aeabi_idivmod
+FUNC_ALIAS aeabi_idiv aeabi_idivmod
+FUNC_ALIAS divsi3 aeabi_idivmod
+    CFI_START_FUNCTION
+
+        // Extend the sign of the denominator.
+        asrs    r3,     r1,     #31
+
+        // Absolute value of the denominator, abort on division by zero.
+        eors    r1,     r3
+        subs    r1,     r3
+        beq     SYM(__aeabi_idiv0)
+
+        // Absolute value of the numerator.
+        asrs    r2,     r0,     #31
+        eors    r0,     r2
+        subs    r0,     r2
+
+        // Keep the sign of the numerator in bit[31] (for the remainder).
+        // Save the XOR of the signs in bits[15:0] (for the quotient).
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        lsrs    rT,     r3,     #16
+        eors    rT,     r2
+
+        // Handle division as unsigned.
+        bl      LSYM(__internal_uidivmod)
+
+        // Set the sign of the remainder.
+        asrs    r2,     rT,     #31
+        eors    r1,     r2
+        subs    r1,     r2
+
+        // Set the sign of the quotient.
+        sxth    r3,     rT
+        eors    r0,     r3
+        subs    r0,     r3
+
+    LSYM(__idivmod_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsi3
+CM0_FUNC_END aeabi_idiv
+CM0_FUNC_END aeabi_idivmod
+
+
+// int __aeabi_uidiv(unsigned int, unsigned int)
+// idiv_return __aeabi_uidivmod(unsigned int, unsigned int)
+// Returns unsigned $r0 after division by $r1.
+// Also returns the remainder in $r1.
+.section .text.libgcc.uidiv,"x"
+CM0_FUNC_START aeabi_uidivmod
+FUNC_ALIAS aeabi_uidiv aeabi_uidivmod
+FUNC_ALIAS udivsi3 aeabi_uidivmod
+    CFI_START_FUNCTION
+
+        // Abort on division by zero.
+        tst     r1,     r1
+        beq     SYM(__aeabi_idiv0)
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+    LSYM(__internal_uidivmod):
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // The loop is destructive, save a copy of the numerator.
+        mov     ip,     r0
+
+        // Set up binary search.
+        movs    r3,     #16
+        movs    r2,     #1
+
+    LSYM(__uidivmod_align):
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    r0,     r3
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_skip)
+
+        // Multiply the denominator and the result together.
+        lsls    r1,     r3
+        lsls    r2,     r3
+
+    LSYM(__uidivmod_skip):
+        // Restore the numerator, and iterate until search goes to 0.
+        mov     r0,     ip
+        lsrs    r3,     #1
+        bne     LSYM(__uidivmod_align)
+
+        // In The result $r3 has been conveniently initialized to 0.
+        b       LSYM(__uidivmod_entry)
+
+    LSYM(__uidivmod_loop):
+        // Scale the denominator and the quotient together.
+        lsrs    r1,     #1
+        lsrs    r2,     #1
+        beq     LSYM(__uidivmod_return)
+
+    LSYM(__uidivmod_entry):
+        // Test if the denominator is smaller than the numerator.
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_loop)
+
+        // If the denominator is smaller, the next bit of the result is '1'.
+        // If the new remainder goes to 0, exit early.
+        adds    r3,     r2
+        subs    r0,     r1
+        bne     LSYM(__uidivmod_loop)
+
+    LSYM(__uidivmod_return):
+        mov     r1,     r0
+        mov     r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END udivsi3
+CM0_FUNC_END aeabi_uidiv
+CM0_FUNC_END aeabi_uidivmod
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fadd.S libgcc/config/arm/cm0/fadd.S
--- libgcc/config/arm/cm0/fadd.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fadd.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,301 @@
+/* fadd.S: Cortex M0 optimized 32-bit float addition
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// float __aeabi_frsub(float, float)
+// Returns the floating point difference of $r1 - $r0 in $r0.
+.section .text.libgcc.frsub,"x"
+CM0_FUNC_START aeabi_frsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r0 is NAN before modifying.
+        lsls    r2,     r0,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     LSYM(__internal_fadd)
+      #endif
+
+        // Flip sign and run through fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r0,     r2
+        b       LSYM(__internal_fadd)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_frsub
+
+
+// float __aeabi_fsub(float, float)
+// Returns the floating point difference of $r0 - $r1 in $r0.
+.section .text.libgcc.fsub,"x"
+CM0_FUNC_START aeabi_fsub
+FUNC_ALIAS subsf3 aeabi_fsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r1 is NAN before modifying.
+        lsls    r2,     r1,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     LSYM(__internal_fadd)
+      #endif
+
+        // Flip sign and run through fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r1,     r2
+        b       LSYM(__internal_fadd)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END subsf3
+CM0_FUNC_END aeabi_fsub
+
+
+// float __aeabi_fadd(float, float)
+// Returns the floating point sum of $r0 + $r1 in $r0.
+.section .text.libgcc.fadd,"x"
+CM0_FUNC_START aeabi_fadd
+FUNC_ALIAS addsf3 aeabi_fadd
+    CFI_START_FUNCTION
+
+    LSYM(__internal_fadd):
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Drop the sign bit to compare absolute value.
+        lsls    r2,     r0,     #1
+        lsls    r3,     r1,     #1
+
+        // Save the logical difference of original values.
+        // This actually makes the following swap slightly faster.
+        eors    r1,     r0
+
+        // Compare exponents+mantissa.
+        // MAYBE: Speedup for equal values?  This would have to separately
+        //  check for NAN/INF and then either:
+        // * Increase the exponent by '1' (for multiply by 2), or
+        // * Return +0
+        cmp     r2,     r3
+        bhs     LSYM(__fadd_ordered)
+
+        // Reorder operands so the larger absolute value is in r2,
+        //  the corresponding original operand is in $r0,
+        //  and the smaller absolute value is in $r3.
+        movs    r3,     r2
+        eors    r0,     r1
+        lsls    r2,     r0,     #1
+
+    LSYM(__fadd_ordered):
+        // Extract the exponent of the larger operand.
+        // If INF/NAN, then it becomes an automatic result.
+        lsrs    r2,     #24
+        cmp     r2,     #255
+        beq     LSYM(__fadd_special)
+
+        // Save the sign of the result.
+        lsrs    rT,     r0,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // If the original value of $r1 was to +/-0,
+        //  $r0 becomes the automatic result.
+        // Because $r0 is known to be a finite value, return directly.
+        // It's actually important that +/-0 not go through the normal
+        //  process, to keep "-0 +/- 0"  from being turned into +0.
+        cmp     r3,     #0
+        beq     LSYM(__fadd_zero)
+
+        // Extract the second exponent.
+        lsrs    r3,     #24
+
+        // Calculate the difference of exponents (always positive).
+        subs    r3,     r2,     r3
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // If the smaller operand is more than 25 bits less significant
+        //  than the larger, the larger operand is an automatic result.
+        // The smaller operand can't affect the result, even after rounding.
+        cmp     r3,     #25
+        bhi     LSYM(__fadd_return)
+      #endif
+
+        // Isolate both mantissas, recovering the smaller.
+        lsls    rT,     r0,     #9
+        lsls    r0,     r1,     #9
+        eors    r0,     rT // 26
+
+        // If the larger operand is normal, restore the implicit '1'.
+        // If subnormal, the second operand will also be subnormal.
+        cmp     r2,     #0
+        beq     LSYM(__fadd_normal)
+        adds    rT,     #1
+        rors    rT,     rT
+
+        // If the smaller operand is also normal, restore the implicit '1'.
+        // If subnormal, the smaller operand effectively remains multiplied
+        //  by 2 w.r.t the first.  This compensates for subnormal exponents,
+        //  which are technically still -126, not -127.
+        cmp     r2,     r3
+        beq     LSYM(__fadd_normal)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fadd_normal):
+        // Provide a spare bit for overflow.
+        // Normal values will be aligned in bits [30:7]
+        // Subnormal values will be aligned in bits [30:8]
+        lsrs    rT,     #1
+        lsrs    r0,     #1
+
+        // If signs weren't matched, negate the smaller operand (branchless).
+        asrs    r1,     #31
+        eors    r0,     r1
+        subs    r0,     r1
+
+        // Keep a copy of the small mantissa for the remainder.
+        movs    r1,     r0
+
+        // Align the small mantissa for addition.
+        asrs    r1,     r3
+
+        // Isolate the remainder.
+        // NOTE: Given the various cases above, the remainder will only
+        //  be used as a boolean for rounding ties to even.  It is not
+        //  necessary to negate the remainder for subtraction operations.
+        rsbs    r3,     #0
+        adds    r3,     #32
+        lsls    r0,     r3
+
+        // Because operands are ordered, the result will never be negative.
+        // If the result of subtraction is 0, the overall result must be +0.
+        // If the overall result in $r1 is 0, then the remainder in $r0
+        //  must also be 0, so no register copy is necessary on return.
+        adds    r1,     rT
+        beq     LSYM(__fadd_return)
+
+        // The large operand was aligned in bits [29:7]...
+        // If the larger operand was normal, the implicit '1' went in bit [30].
+        //
+        // After addition, the MSB of the result may be in bit:
+        //    31,  if the result overflowed.
+        //    30,  the usual case.
+        //    29,  if there was a subtraction of operands with exponents
+        //          differing by more than 1.
+        //  < 28, if there was a subtraction of operands with exponents +/-1,
+        //  < 28, if both operands were subnormal.
+
+        // In the last case (both subnormal), the alignment shift will be 8,
+        //  the exponent will be 0, and no rounding is necessary.
+        cmp     r2,     #0
+        bne     SYM(__fp_assemble) // 46
+
+        // Subnormal overflow automatically forms the correct exponent.
+        lsrs    r0,     r1,     #8
+        add     r0,     ip
+
+    LSYM(__fadd_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fadd_special):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If $r1 is (also) NAN, force it in place of $r0.
+        // As the smaller NAN, it is more likely to be signaling.
+        movs    rT,     #255
+        lsls    rT,     #24
+        cmp     r3,     rT
+        bls     LSYM(__fadd_ordered2)
+
+        eors    r0,     r1
+      #endif
+
+    LSYM(__fadd_ordered2):
+        // There are several possible cases to consider here:
+        //  1. Any NAN/NAN combination
+        //  2. Any NAN/INF combination
+        //  3. Any NAN/value combination
+        //  4. INF/INF with matching signs
+        //  5. INF/INF with mismatched signs.
+        //  6. Any INF/value combination.
+        // In all cases but the case 5, it is safe to return $r0.
+        // In the special case, a new NAN must be constructed.
+        // First, check the mantissa to see if $r0 is NAN.
+        lsls    r2,     r0,     #9
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fadd_return)
+      #endif
+
+    LSYM(__fadd_zero):
+        // Next, check for an INF/value combination.
+        lsls    r2,     r1,     #1
+        bne     LSYM(__fadd_return)
+
+        // Finally, check for matching sign on INF/INF.
+        // Also accepts matching signs when +/-0 are added.
+        bcc     LSYM(__fadd_return)
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(SUBTRACTED_INFINITY)
+      #endif
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // Restore original operands.
+        eors    r1,     r0
+      #endif
+
+        // Identify mismatched 0.
+        lsls    r2,     r0,     #1
+        bne     SYM(__fp_exception)
+
+        // Force mismatched 0 to +0.
+        eors    r0,     r0
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END addsf3
+CM0_FUNC_END aeabi_fadd
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fcmp.S libgcc/config/arm/cm0/fcmp.S
--- libgcc/config/arm/cm0/fcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fcmp.S	2020-11-10 21:33:20.981886999 -0800
@@ -0,0 +1,555 @@
+/* fcmp.S: Cortex M0 optimized 32-bit float comparison
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// int __cmpsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * +1 if ($r0 > $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * -1 if ($r0 < $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.cmpsf2,"x"
+CM0_FUNC_START cmpsf2
+FUNC_ALIAS lesf2 cmpsf2
+FUNC_ALIAS ltsf2 cmpsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+
+// int,int __internal_cmpsf2(float, float, int)
+// Internal function expects a set of control flags in $r2.
+// If ordered, returns a comparison type { 0, 1, 2 } in $r3
+CM0_FUNC_START internal_cmpsf2
+
+        // When operand signs are considered, the comparison result falls
+        //  within one of the following quadrants:
+        //
+        // $r0  $r1  $r0-$r1* flags  result
+        //  +    +      >      C=0     GT
+        //  +    +      =      Z=1     EQ
+        //  +    +      <      C=1     LT
+        //  +    -      >      C=1     GT
+        //  +    -      =      C=1     GT
+        //  +    -      <      C=1     GT
+        //  -    +      >      C=0     LT
+        //  -    +      =      C=0     LT
+        //  -    +      <      C=0     LT
+        //  -    -      >      C=0     LT
+        //  -    -      =      Z=1     EQ
+        //  -    -      <      C=1     GT
+        //
+        // *When interpeted as a subtraction of unsigned integers
+        //
+        // From the table, it is clear that in the presence of any negative
+        //  operand, the natural result simply needs to be reversed.
+        // Save the 'N' flag for later use.
+        movs    r3,     r0
+        orrs    r3,     r1
+        mov     ip,     r3
+
+        // Keep the absolute value of the second argument for NAN testing.
+        lsls    r3,     r1,     #1
+
+        // With the absolute value of the second argument safely stored,
+        //  recycle $r1 to calculate the difference of the arguments.
+        subs    r1,     r0,     r1
+
+        // Save the 'C' flag for use later.
+        // Effectively shifts all the flags 1 bit left.
+        adcs    r2,     r2
+
+        // Absolute value of the first argument.
+        lsls    r0,     #1
+
+        // Identify the largest absolute value between the two arguments.
+        cmp     r0,     r3
+        bhs     LSYM(__fcmp_sorted)
+
+        // Keep the larger absolute value for NAN testing.
+        // NOTE: When the arguments are respectively a signaling NAN and a
+        //  quiet NAN, the quiet NAN has precedence.  This has consequences
+        //  if TRAP_NANS is enabled, but the flags indicate that exceptions
+        //  for quiet NANs should be suppressed.  After the signaling NAN is
+        //  discarded, no exception is raised, although it should have been.
+        // This could be avoided by using a fifth register to save both
+        //  arguments until the signaling bit can be tested, but that seems
+        //  like an excessive amount of ugly code for an ambiguous case.
+        movs    r0,     r3
+
+    LSYM(__fcmp_sorted):
+        // If $r3 is NAN, the result is unordered.
+        movs    r3,     #255
+        lsls    r3,     #24
+        cmp     r0,     r3
+        bhi     LSYM(__fcmp_unordered)
+
+        // Positive and negative zero must be considered equal.
+        // If the larger absolute value is +/-0, both must have been +/-0.
+        subs    r3,     r0,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Test for regular equality.
+        subs    r3,     r1,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Isolate the saved 'C', and invert if either argument was negative.
+        // Remembering that the original subtraction was $r1 - $r0,
+        //  the result will be 1 if 'C' was set (gt), or 0 for not 'C' (lt).
+        lsls    r3,     r2,     #31
+        add     r3,     ip
+        lsrs    r3,     #31
+
+        // HACK: Clear the 'C' bit
+        adds    r3,     #0
+
+    LSYM(__fcmp_zero):
+        // After everything is combined, the temp result will be
+        //  2 (gt), 1 (eq), or 0 (lt).
+        adcs    r3,     r3
+
+        // Return directly if the 3-way comparison flag is set.
+        // Also shifts the condition mask into bits[2:0].
+        lsrs    r2,     #2 // 26
+        bcs     LSYM(__fcmp_return)
+
+        // If the bit corresponding to the comparison result is set in the
+        //  accepance mask, a '1' will fall out into the result.
+        movs    r0,     #1
+        lsrs    r2,     r3
+        ands    r0,     r2
+        RETx    lr // 33
+
+    LSYM(__fcmp_unordered):
+        // Set up the requested UNORDERED result.
+        // Remember the shift in the flags (above).
+        lsrs    r2,     #6
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // TODO: ... The
+
+
+  #endif
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Always raise an exception if FCMP_RAISE_EXCEPTIONS was specified.
+        bcs     LSYM(__fcmp_trap)
+
+        // If FCMP_NO_EXCEPTIONS was specified, no exceptions on quiet NANs.
+        // The comparison flags are moot, so $r1 can serve as scratch space.
+        lsrs    r1,     r0,     #24
+        bcs     LSYM(__fcmp_return2)
+
+    LSYM(__fcmp_trap):
+        // Restore the NAN (sans sign) for an argument to the exception.
+        // As an IRQ, the handler restores all registers, including $r3.
+        // NOTE: The service handler may not return.
+        lsrs    r0,     #1
+        movs    r3,     #(UNORDERED_COMPARISON)
+        svc     #(SVC_TRAP_NAN)
+  #endif
+
+     LSYM(__fcmp_return2):
+        // HACK: Work around result register mapping.
+        // This could probably be eliminated by remapping the flags register.
+        movs    r3,     r2
+
+    LSYM(__fcmp_return):
+        // Finish setting up the result.
+        // The subtraction allows a negative result from an 8 bit set of flags.
+        //  (See the variations on the FCMP_UN parameter, above).
+        subs    r0,     r3,     #1
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ltsf2
+CM0_FUNC_END lesf2
+CM0_FUNC_END cmpsf2
+
+
+// int __eqsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1)
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1), or either argument is NAN
+// Uses $r2, $r3, and $ip as scratch space.
+CM0_FUNC_START eqsf2
+FUNC_ALIAS nesf2 eqsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END nesf2
+CM0_FUNC_END eqsf2
+
+
+// int __gesf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.gesf2,"x"
+CM0_FUNC_START gesf2
+FUNC_ALIAS gtsf2 gesf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_NEGATIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END gtsf2
+CM0_FUNC_END gesf2
+
+
+// int __aeabi_fcmpeq(float, float)
+// Returns '1' in $r1 if ($r0 == $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpeq,"x"
+CM0_FUNC_START aeabi_fcmpeq
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpeq
+
+
+// int __aeabi_fcmpne(float, float) [non-standard]
+// Returns '1' in $r1 if ($r0 != $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpne,"x"
+CM0_FUNC_START aeabi_fcmpne
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_NE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpne
+
+
+// int __aeabi_fcmplt(float, float)
+// Returns '1' in $r1 if ($r0 < $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmplt,"x"
+CM0_FUNC_START aeabi_fcmplt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmplt
+
+
+// int __aeabi_fcmple(float, float)
+// Returns '1' in $r1 if ($r0 <= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmple,"x"
+CM0_FUNC_START aeabi_fcmple
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmple
+
+
+// int __aeabi_fcmpge(float, float)
+// Returns '1' in $r1 if ($r0 >= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpge,"x"
+CM0_FUNC_START aeabi_fcmpge
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpge
+
+
+// int __aeabi_fcmpgt(float, float)
+// Returns '1' in $r1 if ($r0 > $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpgt,"x"
+CM0_FUNC_START aeabi_fcmpgt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpgt
+
+
+// int __aeabi_fcmpun(float, float)
+// Returns '1' in $r1 if $r0 and $r1 are unordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpun,"x"
+CM0_FUNC_START aeabi_fcmpun
+FUNC_ALIAS unordsf2 aeabi_fcmpun
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END unordsf2
+CM0_FUNC_END aeabi_fcmpun
+
+#if 0
+
+
+// void __aeabi_cfrcmple(float, float)
+// Reverse three-way compare of $r1 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 > $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+.section .text.libgcc.cfrcmple,"x"
+CM0_FUNC_START aeabi_cfrcmple
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+        // Reverse the order of the arguments.
+        ldr     r0,     [sp, #4]
+        ldr     r1,     [sp, #0]
+
+        // Don't just fall through into cfcmple(), else registers will get pushed twice.
+        b       SYM(__real_cfrcmple)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_cfrcmple
+
+
+// void __aeabi_cfcmpeq(float, float)
+// NOTE: This function only applies if __aeabi_cfcmple() can raise exceptions.
+// Three-way compare of $r0 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 < $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+#if defined(TRAP_NANS) && TRAP_NANS
+  .section .text.libgcc.cfcmpeq,"x"
+  CM0_FUNC_START aeabi_cfcmpeq
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+        // No exceptions on quiet NAN.
+        // On an unordered result, 'C' should be '1' and 'Z' should be '0'.
+        // A subtraction giving -1 sets these flags correctly.
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS)
+        b       LSYM(__real_cfcmpeq)
+
+    CFI_END_FUNCTION
+  CM0_FUNC_END aeabi_cfcmpeq
+#endif
+
+// void __aeabi_cfcmple(float, float)
+// Three-way compare of $r0 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 < $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+.section .text.libgcc.cfcmple,"x"
+CM0_FUNC_START aeabi_cfcmple
+
+  // __aeabi_cfcmpeq() is defined separately when TRAP_NANS is enabled.
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    FUNC_ALIAS aeabi_cfcmpeq aeabi_cfcmple
+  #endif
+
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+    LSYM(__real_cfrcmple):
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // The result in $r0 will be ignored, but do raise exceptions.
+        // On an unordered result, 'C' should be '1' and 'Z' should be '0'.
+        // A subtraction giving -1 sets these flags correctly.
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS)
+  #endif
+
+    LSYM(__real_cfcmpeq):
+        // __internal_cmpsf2() always sets the APSR flags on return.
+        bl      LSYM(__internal_cmpsf2)
+
+        // Because __aeabi_cfcmpeq() wants the 'C' flag set on equal values,
+        //  magic is required.   For the possible intermediate values in $r3:
+        //  * 0b01 gives C = 0 and Z = 0 for $r0 < $r1
+        //  * 0b10 gives C = 1 and Z = 1 for $r0 == $r1
+        //  * 0b11 gives C = 1 and Z = 0 for $r0 > $r1 (or unordered)
+        cmp    r1,     #0
+
+        // Cleanup.
+        pop    { r0-r3, pc }
+        .cfi_restore_state
+
+    CFI_END_FUNCTION
+
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    CM0_FUNC_END aeabi_cfcmpeq
+  #endif
+
+CM0_FUNC_END aeabi_cfcmple
+
+
+// int isgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 > $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isgreaterf,"x"
+CM0_FUNC_START isgreaterf
+MATH_ALIAS isgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterf
+CM0_FUNC_END isgreaterf
+
+
+// int isgreaterequalf(float, float)
+// Returns '1' in $r0 if ($r0 >= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isgreaterequalf,"x"
+CM0_FUNC_START isgreaterequalf
+MATH_ALIAS isgreaterequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterequalf
+CM0_FUNC_END isgreaterequalf
+
+
+// int islessf(float, float)
+// Returns '1' in $r0 if ($r0 < $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessf,"x"
+CM0_FUNC_START islessf
+MATH_ALIAS islessf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessf
+CM0_FUNC_END islessf
+
+
+// int islessequalf(float, float)
+// Returns '1' in $r0 if ($r0 <= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessequalf,"x"
+CM0_FUNC_START islessequalf
+MATH_ALIAS islessequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessequalf
+CM0_FUNC_END islessequalf
+
+
+// int islessgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 != $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessgreaterf,"x"
+CM0_FUNC_START islessgreaterf
+MATH_ALIAS islessgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessgreaterf
+CM0_FUNC_END islessgreaterf
+
+
+// int isunorderedf(float, float)
+// Returns '1' in $r0 if either $r0 or $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isunorderedf,"x"
+CM0_FUNC_START isunorderedf
+MATH_ALIAS isunorderedf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isunorderedf
+CM0_FUNC_END isunorderedf
+
+
+#endif
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fconv.S libgcc/config/arm/cm0/fconv.S
--- libgcc/config/arm/cm0/fconv.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fconv.S	2020-11-10 21:33:20.981886999 -0800
@@ -0,0 +1,346 @@
+/* fconv.S: Cortex M0 optimized 32- and 64-bit float conversions
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+// Reference: <libgcc/config/arm/fp16.c>
+
+// double __aeabi_f2d(float)
+// Converts a single-precision float in $r0 to double-precision in $r1:$r0.
+// Rounding, overflow, and underflow are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.libgcc.f2d,"x"
+CM0_FUNC_START aeabi_f2d
+FUNC_ALIAS extendsfdf2 aeabi_f2d
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r1,     r0,     #31
+        lsls    r1,     #31
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Test for zero.
+        lsls    r0,     #1
+        beq     LSYM(__f2d_return) // 7
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        movs    r2,     r0
+        bl      SYM(__fp_normalize2) __PLT__ // +4+8
+
+        // Set up the exponent bias.  For INF/NAN values, the bias
+        //  is 1791 (2047 - 255 - 1), where the last '1' accounts
+        //  for the implicit '1' in the mantissa.
+        movs    r0,     #3
+        lsls    r0,     #9
+        adds    r0,     #255
+
+        // Test for INF/NAN, promote exponent if necessary
+        cmp     r2,     #255
+        beq     LSYM(__f2d_indefinite)
+
+        // For normal values, the exponent bias is 895 (1023 - 127 - 1),
+        //  which is half of the prepared INF/NAN bias.
+        lsrs    r0,     #1
+
+    LSYM(__f2d_indefinite):
+        // Assemble exponent with bias correction.
+        adds    r2,     r0
+        lsls    r2,     #20
+        adds    r1,     r2
+
+        // Assemble the high word of the mantissa.
+        lsrs    r0,     r3,     #11
+        add     r1,     r0
+
+        // Remainder of the mantissa in the low word of the result.
+        lsls    r0,     r3,     #21
+
+    LSYM(__f2d_return):
+        pop     { rT, pc } // 38
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END extendsfdf2
+CM0_FUNC_END aeabi_f2d
+
+
+// float __aeabi_d2f(double)
+// Converts a double-precision float in $r1:$r0 to single-precision in $r0.
+// Values out of range become ZERO or INF; returns the upper 23 bits of NAN.
+// Rounds to nearest, ties to even.  The ARM ABI does not appear to specify a
+//  rounding mode, so no problems here.  Unfortunately, GCC specifies rounding
+//  towards zero, which makes this implementation incompatible.
+// (It would be easy enough to truncate normal values, but single-precision
+//  subnormals would require a significantly more complex approach.)
+.section .text.libgcc.d2f,"x"
+CM0_FUNC_START aeabi_d2f
+// FUNC_ALIAS truncdfsf2 aeabi_d2f // incompatible
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r2,     r1,     #31
+        lsls    r2,     #31
+        mov     ip,     r2
+
+        // Isolate the exponent (11 bits).
+        lsls    r2,     r1,     #1
+        lsrs    r2,     #21
+
+        // Isolate the mantissa.  It's safe to always add the implicit '1' --
+        //  even for subnormals -- since they will underflow in every case.
+        lsls    r1,     #12
+        adds    r1,     #1
+        rors    r1,     r1
+        lsrs    r3,     r0,     #21
+        adds    r1,     r3
+        lsls    r0,     #11 // 11
+
+        // Test for INF/NAN (r3 = 2047)
+        mvns    r3,     r2
+        lsrs    r3,     #21
+        cmp     r3,     r2
+        beq     LSYM(__d2f_indefinite)
+
+        // Adjust exponent bias.  Offset is 127 - 1023, less 1 more since
+        //  __fp_assemble() expects the exponent relative to bit[30].
+        lsrs    r3,     #1
+        subs    r2,     r3
+        adds    r2,     #126
+
+    LSYM(__d2f_assemble):
+        // Use the standard formatting for overflow and underflow.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(__fp_assemble) // 24-28 + 30
+                .cfi_restore_state
+
+    LSYM(__d2f_indefinite):
+        // Test for INF.  If the mantissa, exclusive of the implicit '1',
+        //  is equal to '0', the result will be INF.
+        lsls    r3,     r1,     #1
+        orrs    r3,     r0
+        beq     LSYM(__d2f_assemble) // 20
+
+        // Construct NAN with the upper 22 bits of the mantissa, setting bit[21]
+        //  to ensure a valid NAN without changing bit[22] (quiet)
+        subs    r2,     #0xD
+        lsls    r0,     r2,     #20
+        lsrs    r1,     #8
+        orrs    r0,     r1
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        add     r0,     ip
+      #endif
+
+        RETx    lr // 27
+
+    CFI_END_FUNCTION
+// CM0_FUNC_END truncdfsf2
+CM0_FUNC_END aeabi_d2f
+
+
+// float __aeabi_h2f(short hf)
+// Converts a half-precision float in $r0 to single-precision.
+// Rounding, overflow, and underflow conditions are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.libgcc.h2f,"x"
+CM0_FUNC_START aeabi_h2f
+    CFI_START_FUNCTION
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the mantissa and exponent.
+        lsls    r2,     r0,     #17
+
+        // Isolate the sign.
+        lsrs    r0,     #15
+        lsls    r0,     #31
+
+        // Align the exponent at bit[24] for normalization.
+        // If zero, return the original sign.
+        lsrs    r2,     #3
+        beq     LSYM(__h2f_return) // 8
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        bl      SYM(__fp_normalize2) __PLT__ // +4+8
+
+        // Set up the exponent bias.  For INF/NAN values, the bias is 223,
+        //  where the last '1' accounts for the implicit '1' in the mantissa.
+        adds    r2,     #(255 - 31 - 1)
+
+        // Test for INF/NAN.
+        cmp     r2,     #254
+        beq     LSYM(__h2f_assemble)
+
+        // For normal values, the bias should have been 111.
+        // However, this adjustment now is faster than branching.
+        subs    r2,     #((255 - 31 - 1) - (127 - 15 - 1))
+
+    LSYM(__h2f_assemble):
+        // Combine exponent and sign.
+        lsls    r2,     #23
+        adds    r0,     r2
+
+        // Combine mantissa.
+        lsrs    r3,     #8
+        add     r0,     r3
+
+    LSYM(__h2f_return):
+        pop     { rT, pc } // 34
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_h2f
+
+
+// short __aeabi_f2h(float f)
+// Converts a single-precision float in $r1 to half-precision,
+//  rounding to nearest, ties to even.
+// Values out of range become ZERO or INF; returns the upper 12 bits of NAN.
+// Values out of range are forced to either ZERO or INF.
+.section .text.libgcc.f2h,"x"
+CM0_FUNC_START aeabi_f2h
+    CFI_START_FUNCTION
+
+        // Set up the sign.
+        lsrs    r2,     r0,     #31
+        lsls    r2,     #15
+
+        // Save the exponent and mantissa.
+        // If ZERO, return the original sign.
+        lsls    r0,     #1
+        beq     LSYM(__f2h_return)
+
+        // Isolate the exponent, check for NAN.
+        lsrs    r1,     r0,     #24
+        cmp     r1,     #255
+        beq     LSYM(__f2h_indefinite)
+
+        // Check for overflow.
+        cmp     r1,     #(127 + 15)
+        bhi     LSYM(__f2h_overflow)
+
+        // Isolate the mantissa, adding back the implicit '1'.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0 // 12
+
+        // Adjust exponent bias for half-precision, including '1' to
+        //  account for the mantissa's implicit '1'.
+        subs    r1,     #(127 - 15 + 1)
+        bmi     LSYM(__f2h_underflow)
+
+        // Combine the exponent and sign.
+        lsls    r1,     #10
+        adds    r2,     r1
+
+        // Split the mantissa (11 bits) and remainder (13 bits).
+        lsls    r3,     r0,     #12
+        lsrs    r0,     #21
+
+     LSYM(__f2h_round):
+        // If the carry bit is '0', always round down.
+        bcc     LSYM(__f2h_return)
+
+        // Carry was set.  If a tie (no remainder) and the
+        //  LSB of the result are '0', round down (to even).
+        lsls    r1,     r0,     #31
+        orrs    r1,     r3
+        beq     LSYM(__f2h_return)
+
+        // Round up, ties to even.
+        adds    r0,     #1
+
+     LSYM(__f2h_return):
+        // Combine mantissa and exponent.
+        adds    r0,     r2
+        RETx    lr // 25 - 34
+
+    LSYM(__f2h_underflow):
+        // Align the remainder. The remainder consists of the last 12 bits
+        //  of the mantissa plus the magnitude of underflow.
+        movs    r3,     r0
+        adds    r1,     #12
+        lsls    r3,     r1
+
+        // Align the mantissa.  The MSB of the remainder must be
+        // shifted out into last the 'C' flag for rounding.
+        subs    r1,     #33
+        rsbs    r1,     #0
+        lsrs    r0,     r1
+        b       LSYM(__f2h_round) // 25
+
+    LSYM(__f2h_overflow):
+        // Create single-precision INF from which to construct half-precision.
+        movs    r0,     #255
+        lsls    r0,     #24 // 13
+
+    LSYM(__f2h_indefinite):
+        // Check for INF.
+        lsls    r3,     r0,     #8
+        beq     LSYM(__f2h_infinite)
+
+        // Set bit[8] to ensure a valid NAN without changing bit[9] (quiet).
+        adds    r2,     #128
+        adds    r2,     #128
+
+    LSYM(__f2h_infinite):
+        // Construct the result from the upper 22 bits of the mantissa
+        //  and the lower 5 bits of the exponent.
+        lsls    r0,     #3
+        lsrs    r0,     #17
+
+        // Combine with the sign (and possibly NAN flag).
+        orrs    r0,     r2
+        RETx    lr // 23
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_f2h
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fdiv.S libgcc/config/arm/cm0/fdiv.S
--- libgcc/config/arm/cm0/fdiv.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fdiv.S	2020-11-12 09:46:26.939907002 -0800
@@ -0,0 +1,258 @@
+/* fdiv.S: Cortex M0 optimized 32-bit float division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// float __aeabi_fdiv(float, float)
+// Returns $r0 after division by $r1.
+.section .text.libgcc.fdiv,"x"
+CM0_FUNC_START aeabi_fdiv
+FUNC_ALIAS divsf3 aeabi_fdiv
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save for the sign of the result.
+        movs    r3,     r1
+        eors    r3,     r0
+        lsrs    rT,     r3,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for divide by 0.  Automatically catches 0/0.
+        lsls    r2,     r1,     #1
+        beq     LSYM(__fdiv_by_zero)
+
+        // Check for INF/INF, or a number divided by itself.
+        lsls    r3,     #1
+        beq     LSYM(__fdiv_equal)
+
+        // Check the numerator for INF/NAN.
+        eors    r3,     r2
+        cmp     r3,     rT
+        bhs     LSYM(__fdiv_special1)
+
+        // Check the denominator for INF/NAN.
+        cmp     r2,     rT
+        bhs     LSYM(__fdiv_special2)
+
+        // Check the numerator for zero.
+        cmp     r3,     #0
+        beq     SYM(__fp_zero)
+
+        // No action if the numerator is subnormal.
+        //  The mantissa will normalize naturally in the division loop.
+        lsls    r0,     #9
+        lsrs    r1,     r3,     #24
+        beq     LSYM(__fdiv_denominator)
+
+        // Restore the numerator's implicit '1'.
+        adds    r0,     #1
+        rors    r0,     r0 // 26
+
+    LSYM(__fdiv_denominator):
+        // The denominator must be normalized and left aligned.
+        bl      SYM(__fp_normalize2) // +4+8
+
+        // 25 bits of precision will be sufficient.
+        movs    rT,     #64
+
+        // Run division.
+        bl      SYM(__internal_fdiv) // 41
+        b       SYM(__fp_assemble)
+
+    LSYM(__fdiv_equal):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_INF_BY_INF)
+      #endif
+
+        // The absolute value of both operands are equal, but not 0.
+        // If both operands are INF, create a new NAN.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If both operands are NAN, return the NAN in $r0.
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Return 1.0f, with appropriate sign.
+        movs    r0,     #127
+        lsls    r0,     #23
+        add     r0,     ip
+
+    LSYM(__fdiv_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fdiv_special2):
+        // The denominator is either INF or NAN, numerator is neither.
+        // Also, the denominator is not equal to 0.
+        // If the denominator is INF, the result goes to 0.
+        beq     SYM(__fp_zero)
+
+        // The only other option is NAN, fall through to branch.
+        mov     r0,     r1
+
+    LSYM(__fdiv_special1):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // The numerator is INF or NAN.  If NAN, return it directly.
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fdiv_return)
+      #endif
+
+        // If INF, the result will be INF if the denominator is finite.
+        // The denominator won't be either INF or 0,
+        //  so fall through the exception trap to check for NAN.
+        movs    r0,     r1
+
+    LSYM(__fdiv_by_zero):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_0_BY_0)
+      #endif
+
+        // The denominator is 0.
+        // If the numerator is also 0, the result will be a new NAN.
+        // Otherwise the result will be INF, with the correct sign.
+        lsls    r2,     r0,     #1
+        beq     SYM(__fp_exception)
+
+        // The result should be NAN if the numerator is NAN.  Otherwise,
+        //  the result is INF regardless of the numerator value.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Recreate INF with the correct sign.
+        b       SYM(__fp_infinity)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsf3
+CM0_FUNC_END aeabi_fdiv
+
+
+// Division helper, possibly to be shared with atan2.
+// Expects the numerator mantissa in $r0, exponent in $r1,
+//  plus the denominator mantissa in $r3, exponent in $r2, and
+//  a bit pattern in $rT that controls the result precision.
+// Returns quotient in $r1, exponent in $r2, pseudo remainder in $r0.
+.section .text.libgcc.fdiv2,"x"
+CM0_FUNC_START internal_fdiv
+    CFI_START_FUNCTION
+
+        // Initialize the exponent, relative to bit[30].
+        subs    r2,     r1,     r2
+
+    SYM(__internal_fdiv2):
+        // The exponent should be (expN - 127) - (expD - 127) + 127.
+        // An additional offset of 25 is required to account for the
+        //  minimum number of bits in the result (before rounding).
+        // However, drop '1' because the offset is relative to bit[30],
+        //  while the result is calculated relative to bit[31].
+        adds    r2,     #(127 + 25 - 1)
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Dividing by a power of 2?
+        lsls    r1,     r3,     #1
+        beq     LSYM(__fdiv_simple) // 47
+      #endif
+
+        // Initialize the result.
+        eors    r1,     r1
+
+        // Clear the MSB, so that when the numerator is smaller than
+        //  the denominator, there is one bit free for a left shift.
+        // After a single shift, the numerator is guaranteed to be larger.
+        // The denominator ends up in r3, and the numerator ends up in r0,
+        //  so that the numerator serves as a psuedo-remainder in rounding.
+        // Shift the numerator one additional bit to compensate for the
+        //  pre-incrementing loop.
+        lsrs    r0,     #2
+        lsrs    r3,     #1 // 49
+
+    LSYM(__fdiv_loop):
+        // Once the MSB of the output reaches the MSB of the register,
+        //  the result has been calculated to the required precision.
+        lsls    r1,     #1
+        bmi     LSYM(__fdiv_break)
+
+        // Shift the numerator/remainder left to set up the next bit.
+        subs    r2,     #1
+        lsls    r0,     #1
+
+        // Test if the numerator/remainder is smaller than the denominator,
+        //  do nothing if it is.
+        cmp     r0,     r3
+        blo     LSYM(__fdiv_loop)
+
+        // If the numerator/remainder is greater or equal, set the next bit,
+        //  and subtract the denominator.
+        adds    r1,     rT
+        subs    r0,     r3
+
+        // Short-circuit if the remainder goes to 0.
+        // Even with the overhead of "subnormal" alignment,
+        //  this is usually much faster than continuing.
+        bne     LSYM(__fdiv_loop) // 11*25
+
+        // Compensate the alignment of the result.
+        // The remainder does not need compensation, it's already 0.
+        lsls    r1,     #1 // 61 + 202 (underflow)
+
+    LSYM(__fdiv_break):
+        RETx    lr  // 331 + 30,
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fdiv_simple):
+        // The numerator becomes the result, with a remainder of 0.
+        movs    r1,     r0
+        eors    r0,     r0
+        subs    r2,     #25
+        RETx    lr   // 53 + 30
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END internal_fdiv
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/ffixed.S libgcc/config/arm/cm0/ffixed.S
--- libgcc/config/arm/cm0/ffixed.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/ffixed.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,340 @@
+/* ffixed.S: Cortex M0 optimized float->int conversion
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+// int __aeabi_f2iz(float)
+// Converts a float in $r0 to signed integer, rounding toward 0.
+// Values out of range are forced to either INT_MAX or INT_MIN.
+// NAN becomes zero.
+.section .text.libgcc.f2iz,"x"
+CM0_FUNC_START aeabi_f2iz
+FUNC_ALIAS fixsfsi aeabi_f2iz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #33
+        b       LSYM(__real_f2lz)
+  #else
+        // Flag for signed conversion.
+        movs    r3,     #1
+
+
+    LSYM(__real_f2iz):
+        // Isolate the sign of the result.
+        asrs    r1,     r0,     #31
+        lsls    r0,     #1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Check for zero to avoid spurious underflow exception on -0.
+        beq     LSYM(__f2iz_return)
+  #endif
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Test for NAN.
+        // Otherwise, NAN will be converted like +/-INF.
+        cmp     r2,     #255
+        beq     LSYM(__f2iz_nan)
+  #endif
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 31.
+        //  * An exponent of 128 will result in a shift of 30.
+        //  *  ...
+        //  * An exponent of 157 will result in a shift of 1.
+        //  * An exponent of 158 will result in no shift at all.
+        //  * An exponent larger than 158 will result in overflow.
+        rsbs    r2,     #0
+        adds    r2,     #158
+
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r3
+        blt     LSYM(__f2iz_overflow)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Would also catch -0.0f if not handled earlier.
+        cmn     r3,     r1
+        blt     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Save a copy for remainder testing
+        movs    r3,     r0
+  #endif
+
+        // Truncate the fraction.
+        lsrs    r0,     r2
+
+        // Two's complement negation, if applicable.
+        // Bonus: the sign in $r1 provides a suitable long long result.
+        eors    r0,     r1
+        subs    r0,     r1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // If any bits set in the remainder, raise FE_INEXACT
+        rsbs    r2,     #0
+        adds    r2,     #32
+        lsls    r3,     r2
+        bne     LSYM(__f2iz_inexact)
+  #endif
+
+    LSYM(__f2iz_return):
+        RETx    lr
+
+    LSYM(__f2iz_overflow):
+        // Positive unsigned integers (r1 == 0, r3 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r1 == -1, r3 == 0), return 0x00000000.
+        // Positive signed integers (r1 == 0, r3 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r1 == -1, r3 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^31).
+        mvns    r0,     r1
+        lsls    r3,     #31
+        eors    r0,     r3
+        RETx    lr
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+    LSYM(__f2iz_inexact):
+        // TODO: Another class of exceptions that doesn't overwrite $r0.
+        bkpt    #0
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_INEXACT)
+      #endif
+
+        b       SYM(__fp_exception)
+  #endif
+
+    LSYM(__f2iz_nan):
+        // Check for INF
+        lsls    r2,     r0,     #9
+        beq     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_UNDEFINED)
+      #endif
+
+        b       SYM(__fp_exception)
+  #else
+
+  #endif
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+
+        // TODO: Extend to long long
+
+        // TODO: bl  fp_check_nan
+      #endif
+
+        // Return long long 0 on NAN.
+        eors    r0,     r0
+        eors    r1,     r1
+        RETx    lr
+
+  #endif // !__OPTIMIZE_SIZE__
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfsi
+CM0_FUNC_END aeabi_f2iz
+
+
+// unsigned int __aeabi_f2uiz(float)
+// Converts a float in $r0 to unsigned integer, rounding toward 0.
+// Values out of range are forced to UINT_MAX.
+// Negative values and NAN all become zero.
+.section .text.libgcc.f2uiz,"x"
+CM0_FUNC_START aeabi_f2uiz
+FUNC_ALIAS fixunssfsi aeabi_f2uiz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #32
+        b       LSYM(__real_f2lz)
+  #else
+        // Flag for unsigned conversion.
+        movs    r3,     #0
+        b       LSYM(__real_f2iz)
+  #endif // !__OPTIMIZE_SIZE__
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfsi
+CM0_FUNC_END aeabi_f2uiz
+
+
+// long long aeabi_f2lz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to either INT64_MAX or INT64_MIN.
+// NAN becomes zero.
+.section .text.libgcc.f2lz,"x"
+CM0_FUNC_START aeabi_f2lz
+FUNC_ALIAS fixsfdi aeabi_f2lz
+    CFI_START_FUNCTION
+
+        movs    r1,     #1
+
+    LSYM(__real_f2lz):
+        // Split the sign of the result from the mantissa/exponent field.
+        // Handle +/-0 specially to avoid spurious exceptions.
+        asrs    r3,     r0,     #31
+        lsls    r0,     #1
+        beq     LSYM(__f2lz_zero)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Specifically, is the LSB of $r1 clear when $r3 is equal to '-1'?
+        //
+        // $r3 (sign)   >=     $r2 (flag)
+        // 0xFFFFFFFF   false   0x00000000
+        // 0x00000000   true    0x00000000
+        // 0xFFFFFFFF   true    0x80000000
+        // 0x00000000   true    0x80000000
+        //
+        // (NOTE: This test will also trap -0.0f, unless handled earlier.)
+/****/  lsls    r2,     r1,     #31
+        cmp     r3,     r2
+        blt     LSYM(__f2lz_overflow)
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+//   #if defined(TRAP_NANS) && TRAP_NANS
+//         // Test for NAN.
+//         // Otherwise, NAN will be converted like +/-INF.
+//         cmp     r2,     #255
+//         beq     LSYM(__f2lz_nan)
+//   #endif
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 63.
+        //  * An exponent of 128 will result in a shift of 62.
+        //  *  ...
+        //  * An exponent of 189 will result in a shift of 1.
+        //  * An exponent of 190 will result in no shift at all.
+        //  * An exponent larger than 190 will result in overflow
+        //     (189 in the case of signed integers).
+        rsbs    r2,     #0
+        adds    r2,     #190
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r1
+        blt     LSYM(__f2lz_overflow)
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate the upper word.
+        // If the shift is greater than 32, gives an automatic '0'.
+/**/    movs    r1,     r0
+/**/    lsrs    r1,     r2
+
+        // Reduce the shift for the lower word.
+        // If the original shift was less than 32, the result may be split
+        //  between the upper and lower words.
+/**/    subs    r2,     #32 // 18
+/**/    blt     LSYM(__f2lz_split)
+
+        // Shift is still positive, keep moving right.
+        lsrs    r0,     r2
+
+        // TODO: Remainder test.
+        // $r1 is technically free, as long as it's zero by the time
+        //  this is over.
+
+    LSYM(__f2lz_return):
+        // Two's complement negation, if the original was negative.
+        eors    r0,     r3
+/**/    eors    r1,     r3
+        subs    r0,     r3
+/**/    sbcs    r1,     r3
+        RETx    lr // 27 - 33
+
+    LSYM(__f2lz_split):
+        // Shift was negative, calculate the remainder
+        rsbs    r2,     #0
+        lsls    r0,     r2
+        b       LSYM(__f2lz_return)
+
+    LSYM(__f2lz_zero):
+        eors    r1,     r1
+        RETx    lr
+
+    LSYM(__f2lz_overflow):
+        // Positive unsigned integers (r3 == 0, r1 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r3 == -1, r1 == 0), return 0x00000000.
+        // Positive signed integers (r3 == 0, r1 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r3 == -1, r1 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^63).
+        mvns    r0,     r3
+
+        // For 32-bit results
+/***/   lsls    r2,     r1,     #26
+        lsls    r1,     #31
+/***/   ands    r2,     r1
+/***/   eors    r0,     r2
+
+//    LSYM(__f2lz_zero):
+        eors    r1,     r0
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfdi
+CM0_FUNC_END aeabi_f2lz
+
+
+// unsigned long long __aeabi_f2ulz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to UINT64_MAX.
+// Negative values and NAN all become zero.
+.section .text.libgcc.f2ulz,"x"
+CM0_FUNC_START aeabi_f2ulz
+FUNC_ALIAS fixunssfdi aeabi_f2ulz
+    CFI_START_FUNCTION
+
+        eors    r1,     r1
+        b       LSYM(__real_f2lz)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfdi
+CM0_FUNC_END aeabi_f2ulz
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/ffloat.S libgcc/config/arm/cm0/ffloat.S
--- libgcc/config/arm/cm0/ffloat.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/ffloat.S	2020-11-10 21:33:20.981886999 -0800
@@ -0,0 +1,96 @@
+/* ffixed.S: Cortex M0 optimized int->float conversion
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+// float __aeabi_i2f(int)
+// Converts a signed integer in $r0 to float.
+.section .text.libgcc.il2f,"x"
+CM0_FUNC_START aeabi_i2f
+FUNC_ALIAS floatsisf aeabi_i2f
+    CFI_START_FUNCTION
+
+        // Sign extension to long long.
+        asrs    r1,     r0,     #31
+
+// float __aeabi_l2f(long long)
+// Converts a signed 64-bit integer in $r1:$r0 to a float in $r0.
+CM0_FUNC_START aeabi_l2f
+FUNC_ALIAS floatdisf aeabi_l2f
+
+        // Save the sign.
+        asrs    r3,     r1,     #31
+
+        // Absolute value of the input.
+        eors    r0,     r3
+        eors    r1,     r3
+        subs    r0,     r3
+        sbcs    r1,     r3
+
+        b       LSYM(__internal_uil2f) // 8, 9
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatdisf
+CM0_FUNC_END aeabi_l2f
+CM0_FUNC_END floatsisf
+CM0_FUNC_END aeabi_i2f
+
+
+// float __aeabi_ui2f(unsigned)
+// Converts a unsigned integer in $r0 to float.
+.section .text.libgcc.uil2f,"x"
+CM0_FUNC_START aeabi_ui2f
+FUNC_ALIAS floatunsisf aeabi_ui2f
+    CFI_START_FUNCTION
+
+        // Convert to unsigned long long with upper bits of 0.
+        eors    r1,     r1
+
+// float __aeabi_ul2f(unsigned long long)
+// Converts a unsigned 64-bit integer in $r1:$r0 to a float in $r0.
+CM0_FUNC_START aeabi_ul2f
+FUNC_ALIAS floatundisf aeabi_ul2f
+
+        // Sign is always positive.
+        eors    r3,     r3
+
+    LSYM(__internal_uil2f):
+        // Default exponent, relative to bit[30] of $r1.
+        movs    r2,     #(189)
+
+        // Format the sign.
+        lsls    r3,     #31
+        mov     ip,     r3
+
+        push    { rT, lr }
+        b       SYM(__fp_assemble) // { 10, 11, 18, 19 } + 30-227
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatundisf
+CM0_FUNC_END aeabi_ul2f
+CM0_FUNC_END floatunsisf
+CM0_FUNC_END aeabi_ui2f
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fmul.S libgcc/config/arm/cm0/fmul.S
--- libgcc/config/arm/cm0/fmul.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fmul.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,214 @@
+/* fmul.S: Cortex M0 optimized 32-bit float multiplication
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+// float __aeabi_fmul(float, float)
+// Returns $r0 after multiplication by $r1.
+.section .text.libgcc.fmul,"x"
+CM0_FUNC_START aeabi_fmul
+FUNC_ALIAS mulsf3 aeabi_fmul
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the sign of the result.
+        movs    rT,     r1
+        eors    rT,     r0
+        lsrs    rT,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for multiplication by zero.
+        lsls    r2,     r0,     #1
+        beq     LSYM(__fmul_zero1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_zero2)
+
+        // Check for INF/NAN.
+        cmp     r3,     rT
+        bhs     LSYM(__fmul_special2)
+
+        cmp     r2,     rT
+        bhs     LSYM(__fmul_special1)
+
+        // Because neither operand is INF/NAN, the result will be finite.
+        // It is now safe to modify the original operand registers.
+        lsls    r0,     #9
+
+        // Isolate the first exponent.  When normal, add back the implicit '1'.
+        // The result is always aligned with the MSB in bit [31].
+        // Subnormal mantissas remain effectively multiplied by 2x relative to
+        //  normals, but this works because the weight of a subnormal is -126.
+        lsrs    r2,     #24
+        beq     LSYM(__fmul_normalize2)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fmul_normalize2):
+        // IMPORTANT: exp10i() jumps in here!
+        // Repeat for the mantissa of the second operand.
+        // Short-circuit when the mantissa is 1.0, as the
+        //  first mantissa is already prepared in $r0
+        lsls    r1,     #9
+
+        // When normal, add back the implicit '1'.
+        lsrs    r3,     #24
+        beq     LSYM(__fmul_go)
+        adds    r1,     #1
+        rors    r1,     r1
+
+    LSYM(__fmul_go):
+        // Calculate the final exponent, relative to bit [30].
+        adds    rT,     r2,     r3
+        subs    rT,     #127 // 30
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Short-circuit on multiplication by powers of 2.
+        lsls    r3,     r0,     #1
+        beq     LSYM(__fmul_simple1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_simple2)
+  #endif
+
+        // Save $ip across the call.
+        // (Alternatively, could push/pop a separate register,
+        //  but the four instructions here are equivally fast)
+        //  without imposing on the stack.
+        add     rT,     ip
+
+        // 32x32 unsigned multiplication, 64 bit result.
+        bl      SYM(__umulsidi3) __PLT__ // +22
+
+        // Separate the saved exponent and sign.
+        sxth    r2,     rT
+        subs    rT,     r2
+        mov     ip,     rT
+
+        b       SYM(__fp_assemble) // 62
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fmul_simple2):
+        // Move the high bits of the result to $r1.
+        movs    r1,     r0
+
+    LSYM(__fmul_simple1):
+        // Clear the remainder.
+        eors    r0,     r0
+
+        // Adjust mantissa to match the exponent, relative to bit[30].
+        subs    r2,     rT,     #1
+        b       SYM(__fp_assemble) // 42
+  #endif
+
+    LSYM(__fmul_zero1):
+        // $r0 was equal to 0, set up to check $r1 for INF/NAN.
+        lsls    r2,     r1,     #1
+
+    LSYM(__fmul_zero2):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(INFINITY_TIMES_ZERO)
+      #endif
+
+        // Check the non-zero operand for INF/NAN.
+        // If NAN, it should be returned.
+        // If INF, the result should be NAN.
+        // Otherwise, the result will be +/-0.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+        // If the second operand is finite, the result is 0.
+        blo     SYM(__fp_zero)
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Restore values that got mixed in zero testing, then go back
+        //  to sort out which one is the NAN.
+        lsls    r3,     r1,     #1
+        lsls    r2,     r0,     #1
+      #elif defined(TRAP_NANS) && TRAP_NANS
+        // Return NAN with the sign bit cleared.
+        lsrs    r0,     r2,     #1
+        b       SYM(__fp_check_nan)
+      #else
+        lsrs    r0,     r2,     #1
+        // Return NAN with the sign bit cleared.
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    LSYM(__fmul_special2):
+        // $r1 is INF/NAN.  In case of INF, check $r0 for NAN.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // Force swap if $r0 is not NAN.
+        bls     LSYM(__fmul_swap)
+
+        // $r0 is NAN, keep if $r1 is INF
+        cmp     r3,     rT
+        beq     LSYM(__fmul_special1)
+
+        // Both are NAN, keep the smaller value (more likely to signal).
+        cmp     r2,     r3
+      #endif
+
+        // Prefer the NAN already in $r0.
+        //  (If TRAP_NANS, this is the smaller NAN).
+        bhi     LSYM(__fmul_special1)
+
+    LSYM(__fmul_swap):
+        movs    r0,     r1
+
+    LSYM(__fmul_special1):
+        // $r0 is either INF or NAN.  $r1 has already been examined.
+        // Flags are already set correctly.
+        lsls    r2,     r0,     #1
+        cmp     r2,     rT
+        beq     SYM(__fp_infinity)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        b       SYM(__fp_check_nan)
+      #else
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsf3
+CM0_FUNC_END aeabi_fmul
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fneg.S libgcc/config/arm/cm0/fneg.S
--- libgcc/config/arm/cm0/fneg.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fneg.S	2020-11-10 21:33:20.985886867 -0800
@@ -0,0 +1,75 @@
+/* fneg.S: Cortex M0 optimized 32-bit float negation
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+// float __aeabi_fneg(float) [obsolete]
+// The argument and result are in $r0.
+// Uses $r1 and $r2 as scratch registers.
+.section .text.libgcc.fneg,"x"
+CM0_FUNC_START aeabi_fneg
+FUNC_ALIAS negsf2 aeabi_fneg
+    CFI_START_FUNCTION
+
+  #if (defined(STRICT_NANS) && STRICT_NANS) || \
+      (defined(TRAP_NANS) && TRAP_NANS)
+        // Check for NAN.
+        lsls    r1,     r0,     #1
+        movs    r2,     #255
+        lsls    r2,     #24
+        cmp     r1,     r2
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        blo     SYM(__fneg_nan)
+      #else
+        blo     LSYM(__fneg_return)
+      #endif
+  #endif
+
+        // Flip the sign.
+        movs    r1,     #1
+        lsls    r1,     #31
+        eors    r0,     r1
+
+    LSYM(__fneg_return):
+        RETx    lr
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+    LSYM(__fneg_nan):
+        // Set up registers for exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(fp_check_nan)
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END negsf2
+CM0_FUNC_END aeabi_fneg
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/fplib.h libgcc/config/arm/cm0/fplib.h
--- libgcc/config/arm/cm0/fplib.h	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/fplib.h	2020-11-12 09:45:36.032217491 -0800
@@ -0,0 +1,80 @@
+/* fplib.h: Cortex M0 optimized 64-bit header definitions 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifndef __CM0_FPLIB_H
+#define __CM0_FPLIB_H 
+
+/* Enable exception interrupt handler.  
+   Exception implementation is opportunistic, and not fully tested.  */
+#define TRAP_EXCEPTIONS (0)
+#define EXCEPTION_CODES (0)
+
+/* Perform extra checks to avoid modifying the sign bit of NANs */
+#define STRICT_NANS (0)
+
+/* Trap signaling NANs regardless of context. */
+#define TRAP_NANS (0)
+
+/* TODO: Define service numbers according to the handler requirements */ 
+#define SVC_TRAP_NAN (0)
+#define SVC_FP_EXCEPTION (0)
+#define SVC_DIVISION_BY_ZERO (0)
+
+/* Push extra registers when required for 64-bit stack alignment */
+#define DOUBLE_ALIGN_STACK (0)
+
+/* Define various exception codes.  These don't map to anything in particular */
+#define SUBTRACTED_INFINITY (20)
+#define INFINITY_TIMES_ZERO (21)
+#define DIVISION_0_BY_0 (22)
+#define DIVISION_INF_BY_INF (23)
+#define UNORDERED_COMPARISON (24)
+#define CAST_OVERFLOW (25)
+#define CAST_INEXACT (26)
+#define CAST_UNDEFINED (27)
+
+/* Exception control for quiet NANs.
+   If TRAP_NAN support is enabled, signaling NANs always raise exceptions. */
+.equ FCMP_RAISE_EXCEPTIONS, 16
+.equ FCMP_NO_EXCEPTIONS,    0
+
+/* These assignments are significant.  See implementation.
+   They must be shared for use in libm functions.  */
+.equ FCMP_3WAY, 1
+.equ FCMP_LT, 2
+.equ FCMP_EQ, 4
+.equ FCMP_GT, 8
+
+.equ FCMP_GE, (FCMP_EQ | FCMP_GT)
+.equ FCMP_LE, (FCMP_LT | FCMP_EQ)
+.equ FCMP_NE, (FCMP_LT | FCMP_GT)
+
+/* These flags affect the result of unordered comparisons.  See implementation.  */
+.equ FCMP_UN_THREE,     128
+.equ FCMP_UN_POSITIVE,  64
+.equ FCMP_UN_ZERO,      32
+.equ FCMP_UN_NEGATIVE,  0
+
+#endif /* __CM0_FPLIB_H */
diff -ruN libgcc/config/arm/cm0/futil.S libgcc/config/arm/cm0/futil.S
--- libgcc/config/arm/cm0/futil.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/futil.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,407 @@
+/* futil.S: Cortex M0 optimized 32-bit common routines
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// Internal function, decomposes the unsigned float in $r2.
+// The exponent will be returned in $r2, the mantissa in $r3.
+// If subnormal, the mantissa will be normalized, so that
+//  the MSB of the mantissa (if any) will be aligned at bit[31].
+// Preserves $r0 and $r1, uses $rT as scratch space.
+.section .text.libgcc.normf,"x"
+CM0_FUNC_START fp_normalize2
+    CFI_START_FUNCTION
+
+        // Extract the mantissa.
+        lsls    r3,     r2,     #8
+
+        // Extract the exponent.
+        lsrs    r2,     #24
+        beq     SYM(__fp_lalign2)
+
+        // Restore the mantissa's implicit '1'.
+        adds    r3,     #1
+        rors    r3,     r3
+
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_normalize2
+
+
+// Internal function, aligns $r3 so the MSB is aligned in bit[31].
+// Simultaneously, subtracts the shift from the exponent in $r2
+.section .text.libgcc.alignf,"x"
+CM0_FUNC_START fp_lalign2
+    CFI_START_FUNCTION
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Unroll the loop, similar to __clzsi2().
+        lsrs    rT,     r3,     #16
+        bne     LSYM(__align8)
+        subs    r2,     #16
+        lsls    r3,     #16
+
+    LSYM(__align8):
+        lsrs    rT,     r3,     #24
+        bne     LSYM(__align4)
+        subs    r2,     #8
+        lsls    r3,     #8
+
+    LSYM(__align4):
+        lsrs    rT,     r3,     #28
+        bne     LSYM(__align2)
+        subs    r2,     #4
+        lsls    r3,     #4 // 12
+  #endif
+
+    LSYM(__align2):
+        // Refresh the state of the N flag before entering the loop.
+        tst     r3,     r3
+
+    LSYM(__align_loop):
+        // Test before subtracting to compensate for the natural exponent.
+        // The largest subnormal should have an exponent of 0, not -1.
+        bmi     LSYM(__align_return)
+        subs    r2,     #1
+        lsls    r3,     #1
+        bne     LSYM(__align_loop) // 6 * 31
+
+        // Not just a subnormal... 0!  By design, this should never happen.
+        // All callers of this internal function filter 0 as a special case.
+        // Was there an uncontrolled jump from somewhere else?  Cosmic ray?
+        eors    r2,     r2
+
+      #ifdef DEBUG
+        bkpt    #0
+      #endif
+
+    LSYM(__align_return):
+        RETx    lr // 24 - 192 (size), 19 - 36
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_lalign2
+
+
+// Internal function to combine mantissa, exponent, and sign. No return.
+// Expects the unsigned result in $r1.  To avoid underflow (slower),
+//  the MSB should be in bits [31:29].
+// Expects any remainder bits of the unrounded result in $r0.
+// Expects the exponent in $r2.  The exponent must be relative to bit[30].
+// Expects the sign of the result (and only the sign) in $ip.
+// Returns a correctly rounded floating value in $r0.
+.section .text.libgcc.assemblef,"x"
+CM0_FUNC_START fp_assemble
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Examine the upper three bits [31:29] for underflow.
+        lsrs    r3,     r1,     #29
+        beq     LSYM(__fp_underflow)
+
+        // Convert bits [31:29] into an offset in the range of { 0, -1, -2 }.
+        // Right rotation aligns the MSB in bit [31], filling any LSBs with '0'.
+        lsrs    r3,     r1,     #1
+        mvns    r3,     r3
+        ands    r3,     r1
+        lsrs    r3,     #30
+        subs    r3,     #2
+        rors    r1,     r3
+
+        // Update the exponent, assuming the final result will be normal.
+        // The new exponent is 1 less than actual, to compensate for the
+        //  eventual addition of the implicit '1' in the result.
+        // If the final exponent becomes negative, proceed directly to gradual
+        //  underflow, without bothering to search for the MSB.
+        adds    r2,     r3
+
+CM0_FUNC_START fp_assemble2
+        bmi     LSYM(__fp_subnormal)
+
+    LSYM(__fp_normal):
+        // Check for overflow (remember the implicit '1' to be added later).
+        cmp     r2,     #254
+        bge     SYM(__fp_overflow) // +13 underflow
+
+        // Save LSBs for the remainder. Position doesn't matter any more,
+        //  these are just tiebreakers for round-to-even.
+        lsls    rT,     r1,     #25
+
+        // Align the final result.
+        lsrs    r1,     #8
+
+    LSYM(__fp_round):
+        // If carry bit is '0', always round down.
+        bcc     LSYM(__fp_return)
+
+        // The carry bit is '1'.  Round to nearest, ties to even.
+        // If either the saved remainder bits [6:0], the additional remainder
+        //  bits in $r1, or the final LSB is '1', round up.
+        lsls    r3,     r1,     #31
+        orrs    r3,     rT
+        orrs    r3,     r0
+        beq     LSYM(__fp_return)
+
+        // If rounding up overflows the result to 2.0, the result
+        //  is still correct, up to and including INF.
+        adds    r1,     #1
+
+    LSYM(__fp_return):
+        // Combine the mantissa and the exponent.
+        lsls    r2,     #23
+        adds    r0,     r1,     r2
+
+        // Combine with the saved sign.
+        // End of library call, return to user.
+        add     r0,     ip
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: Underflow/inexact reporting IFF remainder
+  #endif
+
+        pop     { rT, pc } // +30 (typical)
+                .cfi_restore_state
+
+    LSYM(__fp_underflow):
+        // Set up to align the mantissa.
+        movs    r3,     r1 // 5
+        bne     LSYM(__fp_underflow2)
+
+        // MSB wasn't in the upper 32 bits, check the remainder.
+        // If the remainder is also zero, the result is +/-0.
+        movs    r3,     r0
+        beq     SYM(__fp_zero)
+
+        eors    r0,     r0
+        subs    r2,     #32
+
+    LSYM(__fp_underflow2):
+        // Save the pre-alignment exponent to align the remainder later.
+        movs    r1,     r2 // 9 - 11
+
+        // Align the mantissa with the MSB in bit[31].
+        bl      SYM(__fp_lalign2) // 37 - 207 (size), 32 - 51
+
+        // Calculate the actual remainder shift.
+        subs    rT,     r1,     r2
+
+        // Align the lower bits of the remainder.
+        movs    r1,     r0
+        lsls    r0,     rT
+
+        // Combine the upper bits of the remainder with the aligned value.
+        rsbs    rT,     #0
+        adds    rT,     #32
+        lsrs    r1,     rT
+        adds    r1,     r3
+
+        // The MSB is now aligned at bit[31] of $r1.
+        // If the net exponent is still positive, the result will be normal.
+        // Because this function is used by fmul(), there is a possibility
+        //  that the value is still wider than 24 bits; always round.
+        tst     r2,     r2
+        bpl     LSYM(__fp_normal)
+
+    LSYM(__fp_subnormal):
+        // The MSB is aligned at bit[31], with a net negative exponent.
+        // The mantissa will need to be shifted right by the absolute value of
+        //  the exponent, plus the normal shift of 8.
+
+        // If the negative shift is smaller than -25, there is no result,
+        //  no rounding, no anything.  Return signed zero.
+        // (Otherwise, the shift for result and remainder may wrap.)
+        adds    r2,     #25
+        bmi     SYM(__fp_inexact_zero)
+
+        // Save the extra bits for the remainder.
+        movs    rT,     r1
+        lsls    rT,     r2
+
+        // Shift the mantissa to create a subnormal.
+        // Just like normal, round to nearest, ties to even.
+        movs    r3,     #33
+        subs    r3,     r2
+        eors    r2,     r2
+
+        // This shift must be last, leaving the shifted LSB in the C flag.
+        lsrs    r1,     r3
+        b       LSYM(__fp_round)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_assemble2
+CM0_FUNC_END fp_assemble
+
+
+// Recreate INF with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.libgcc.infinityf,"x"
+CM0_FUNC_START fp_overflow
+    CFI_START_FUNCTION
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/overflow exception
+  #endif
+
+CM0_FUNC_START fp_infinity
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        movs    r0,     #255
+        lsls    r0,     #23
+        add     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_infinity
+CM0_FUNC_END fp_overflow
+
+
+// Recreate 0 with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.libgcc.zerof,"x"
+CM0_FUNC_START fp_inexact_zero
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/underflow exception
+  #endif
+
+CM0_FUNC_START fp_zero
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Return 0 with the correct sign.
+        mov     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_zero
+CM0_FUNC_END fp_inexact_zero
+
+
+// Internal function to detect signaling NANs.  No return.
+// Uses $r2 as scratch space.
+.section .text.libgcc.checkf,"x"
+CM0_FUNC_START fp_check_nan2
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+
+CM0_FUNC_START fp_check_nan
+
+        // Check for quiet NAN.
+        lsrs    r2,     r0,     #23
+        bcs     LSYM(__quiet_nan)
+
+        // Raise exception.  Preserves both $r0 and $r1.
+        svc     #(SVC_TRAP_NAN)
+
+        // Quiet the resulting NAN.
+        movs    r2,     #1
+        lsls    r2,     #22
+        orrs    r0,     r2
+
+    LSYM(__quiet_nan):
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_check_nan
+CM0_FUNC_END fp_check_nan2
+
+
+// Internal function to report floating point exceptions.  No return.
+// Expects the original argument(s) in $r0 (possibly also $r1).
+// Expects a code that describes the exception in $r3.
+.section .text.libgcc.exceptf,"x"
+CM0_FUNC_START fp_exception
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Create a quiet NAN.
+        movs    r2,     #255
+        lsls    r2,     #1
+        adds    r2,     #1
+        lsls    r2,     #22
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        // Annotate the exception type in the NAN field.
+        // Make sure that the exception is in the valid region 
+        lsls    rT,     r3,     #13
+        orrs    r2,     rT
+      #endif
+
+// Exception handler that expects the result already in $r2,
+//  typically when the result is not going to be NAN.
+CM0_FUNC_START fp_exception2
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_FP_EXCEPTION)
+      #endif
+
+        // TODO: Save exception flags in a static variable.
+
+        // Set up the result, now that the argument isn't required any more.
+        movs    r0,     r2
+
+        // HACK: for sincosf(), with 2 parameters to return.
+        movs    r1,     r2
+
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_exception2
+CM0_FUNC_END fp_exception
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/lcmp.S libgcc/config/arm/cm0/lcmp.S
--- libgcc/config/arm/cm0/lcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/lcmp.S	2020-11-10 21:33:20.985886867 -0800
@@ -0,0 +1,96 @@
+/* lcmp.S: Cortex M0 optimized 64-bit integer comparison
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// int __aeabi_lcmp(long long, long long)
+// Compares the 64 bit signed values in $r1:$r0 and $r3:$r2.
+// Returns { -1, 0, +1 } in $r0 for ordering { <, ==, > }, respectively.
+.section .text.libgcc.lcmp,"x"
+CM0_FUNC_START aeabi_lcmp
+    CFI_START_FUNCTION
+
+        // Calculate the difference $r1:$r0 - $r3:$r2.
+        subs    r0,     r2
+        sbcs    r1,     r3
+
+        // With $r2 free, create a reference value without affecting flags.
+        mov     r2,     r3
+
+        // Finish the comparison.
+        blt     LSYM(__lcmp_lt)
+
+        // The reference difference ($r2 - $r3) will be +2 iff the first
+        //  argument is larger, otherwise $r2 remains equal to $r3.
+        adds    r2,     #2
+
+    LSYM(__lcmp_lt):
+        // Check for equality (all 64 bits).
+        orrs    r0,     r1
+        beq     LSYM(__lcmp_return)
+
+        // Convert the relative difference to an absolute value +/-1.
+        subs    r0,     r2,     r3
+        subs    r0,     #1
+
+    LSYM(__lcmp_return):
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_lcmp
+
+
+// int __aeabi_ulcmp(unsigned long long, unsigned long long)
+// Compares the 64 bit unsigned values in $r1:$r0 and $r3:$r2.
+// Returns { -1, 0, +1 } in $r0 for ordering { <, ==, > }, respectively.
+.section .text.libgcc.ulcmp,"x"
+CM0_FUNC_START aeabi_ulcmp
+    CFI_START_FUNCTION
+
+        // Calculate the 'C' flag.
+        subs    r0,     r2
+        sbcs    r1,     r3
+
+        // $r2 will contain -1 if the first value is smaller,
+        //  0 if the first value is larger or equal.
+        sbcs    r2,     r2
+
+        // Check for equality (all 64 bits).
+        orrs    r0,     r1
+        beq     LSYM(__ulcmp_return)
+
+        // $r0 should contain +1 or -1
+        movs    r0,     #1
+        orrs    r0,     r2
+
+    LSYM(__ulcmp_return):
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_ulcmp
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/ldiv.S libgcc/config/arm/cm0/ldiv.S
--- libgcc/config/arm/cm0/ldiv.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/ldiv.S	2020-11-12 09:46:26.943906976 -0800
@@ -0,0 +1,413 @@
+/* ldiv.S: Cortex M0 optimized 64-bit integer division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// long long __aeabi_ldiv0(long long)
+// Helper function for division by 0.
+.section .text.libgcc.ldiv0,"x"
+CM0_FUNC_START aeabi_ldiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        // Return { 0, numerator } for quotient and remainder.
+        movs    r2,     r0
+        movs    r3,     r1
+        eors    r0,     r0
+        eors    r1,     r1
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_ldiv0
+
+
+// long long __aeabi_ldiv(long long, long long)
+// lldiv_return __aeabi_ldivmod(long long, long long)
+// Returns signed $r1:$r0 after division by $r3:$r2.
+// Also returns the signed remainder in $r3:$r2.
+.section .text.libgcc.ldiv,"x"
+CM0_FUNC_START aeabi_ldivmod
+FUNC_ALIAS aeabi_ldiv aeabi_ldivmod
+FUNC_ALIAS divdi3 aeabi_ldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before pushing registers.
+        cmp     r2,     #0
+        bne     LSYM(__ldivmod_valid)
+
+        cmp     r3,     #0
+        beq     SYM(__aeabi_ldiv0)
+
+    LSYM(__ldivmod_valid):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+      #else
+        push    { rP, rQ, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 12
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset lr, 8
+      #endif
+
+        // Absolute value of the numerator.
+        asrs    rP,     r1,     #31
+        eors    r0,     rP
+        eors    r1,     rP
+        subs    r0,     rP
+        sbcs    r1,     rP
+
+        // Absolute value of the denominator.
+        asrs    rQ,     r3,     #31
+        eors    r2,     rQ
+        eors    r3,     rQ
+        subs    r2,     rQ
+        sbcs    r3,     rQ
+
+        // Keep the XOR of signs for the quotient.
+        eors    rQ,     rP
+
+        // Handle division as unsigned.
+        bl      LSYM(__internal_uldivmod)
+
+        // Set the sign of the quotient.
+        eors    r0,     rQ
+        eors    r1,     rQ
+        subs    r0,     rQ
+        sbcs    r1,     rQ
+
+        // Set the sign of the remainder.
+        eors    r2,     rP
+        eors    r3,     rP
+        subs    r2,     rP
+        sbcs    r3,     rP
+
+    LSYM(__ldivmod_return):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { rP, rQ, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divdi3
+CM0_FUNC_END aeabi_ldiv
+CM0_FUNC_END aeabi_ldivmod
+
+
+// unsigned long long __aeabi_uldiv(unsigned long long, unsigned long long)
+// ulldiv_return __aeabi_uldivmod(unsigned long long, unsigned long long)
+// Returns unsigned $r1:$r0 after division by $r3:$r2.
+// Also returns the remainder in $r3:$r2.
+.section .text.libgcc.uldiv,"x"
+CM0_FUNC_START aeabi_uldivmod
+FUNC_ALIAS aeabi_uldiv aeabi_uldivmod
+FUNC_ALIAS udivdi3 aeabi_uldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before changing the stack.
+        cmp     r3,     #0
+        bne     LSYM(__internal_uldivmod)
+
+        cmp     r2,     #0
+        beq     SYM(__aeabi_ldiv0)
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+    LSYM(__internal_uldivmod):
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+
+        // Set up denominator shift, assuming a single width result.
+        movs    rP,     #32
+
+        // If the upper word of the denominator is 0 ...
+        tst     r3,     r3
+        bne     LSYM(__uldivmod_setup) // 12
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // ... and the upper word of the numerator is also 0,
+        //  single width division will be at least twice as fast.
+        tst     r1,     r1
+        beq     LSYM(__uldivmod_small)
+  #endif
+
+        // ... and the lower word of the denominator is less than or equal
+        //     to the upper word of the numerator ...
+        cmp     r1,     r2
+        blo     LSYM(__uldivmod_setup)
+
+        //  ... then the result will be double width, at least 33 bits.
+        // Set up a flag in $rP to seed the shift for the second word.
+        movs    r3,     r2
+        eors    r2,     r2
+        adds    rP,     #64
+
+    LSYM(__uldivmod_setup):
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // Since search is destructive, first save a copy of the numerator.
+        mov     ip,     r0
+        mov     lr,     r1
+
+        // Set up binary search.
+        movs    rQ,     #16
+        eors    rT,     rT // 21
+
+    LSYM(__uldivmod_align):
+        // Maintain a secondary shift $rT = 32 - $rQ, making the overlapping
+        //  shifts between low and high words easier to construct.
+        adds    rT,     rQ
+
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    r1,     rQ
+
+        // Measure the high bits of denominator against the numerator.
+        cmp     r1,     r3
+        blo     LSYM(__uldivmod_skip)
+        bhi     LSYM(__uldivmod_shift)
+
+        // If the high bits are equal, construct the low bits for checking.
+        mov     r1,     lr
+        lsls    r1,     rT
+
+        lsrs    r0,     rQ
+        orrs    r1,     r0
+
+        cmp     r1,     r2
+        blo     LSYM(__uldivmod_skip)
+
+    LSYM(__uldivmod_shift):
+        // Scale the denominator and the result together.
+        subs    rP,     rQ
+
+        // If the reduced numerator is still larger than or equal to the
+        //  denominator, it is safe to shift the denominator left.
+        movs    r1,     r2
+        lsrs    r1,     rT
+        lsls    r3,     rQ
+
+        lsls    r2,     rQ
+        orrs    r3,     r1
+
+    LSYM(__uldivmod_skip):
+        // Restore the numerator.
+        mov     r0,     ip
+        mov     r1,     lr
+
+        // Iterate until the shift goes to 0.
+        lsrs    rQ,     #1
+        bne     LSYM(__uldivmod_align) // (12 to 23) * 5
+
+        // Initialize the result (zero).
+        mov     ip,     rQ
+
+        // HACK: Compensate for the first word test.
+        lsls    rP,     #6 // 2, 140
+
+    LSYM(__uldivmod_word2):
+        // Is there another word?
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return) // +4
+
+        // Shift the calculated result by 1 word.
+        mov     lr,     ip
+        mov     ip,     rQ
+
+        // Set up the MSB of the next word of the quotient
+        movs    rQ,     #1
+        rors    rQ,     rP
+        b     LSYM(__uldivmod_entry) // 9 * 2, 149
+
+    LSYM(__uldivmod_loop):
+        // Divide the denominator by 2.
+        // It could be slightly faster to multiply the numerator,
+        //  but that would require shifting the remainder at the end.
+        lsls    rT,     r3,     #31
+        lsrs    r3,     #1
+        lsrs    r2,     #1
+        adds    r2,     rT
+
+        // Step to the next bit of the result.
+        lsrs    rQ,     #1
+        beq     LSYM(__uldivmod_word2) // (19 * 32 + 2) * 2, 140+9+610+9+610+4+12
+
+    LSYM(__uldivmod_entry):
+        // Test if the denominator is smaller, high byte first.
+        cmp     r1,     r3
+        blo     LSYM(__uldivmod_loop)
+        bhi     LSYM(__uldivmod_quotient)
+
+        cmp     r0,     r2
+        blo     LSYM(__uldivmod_loop)
+
+    LSYM(__uldivmod_quotient):
+        // Smaller denominator: the next bit of the quotient will be set.
+        add     ip,     rQ
+
+        // Subtract the denominator from the remainder.
+        // If the new remainder goes to 0, exit early.
+        subs    r0,     r2
+        sbcs    r1,     r3
+        bne     LSYM(__uldivmod_loop)
+
+        tst     r0,     r0
+        bne     LSYM(__uldivmod_loop)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Check whether there's still a second word to calculate.
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return)
+
+        // If so, shift the result left by a full word.
+        mov     lr,     ip
+        mov     ip,     r1 // zero
+  #else
+        eors    rQ,     rQ
+        b       LSYM(__uldivmod_word2)
+  #endif
+
+    LSYM(__uldivmod_return):
+        // Move the remainder to the second half of the result.
+        movs    r2,     r0
+        movs    r3,     r1
+
+        // Move the quotient to the first half of the result.
+        mov     r0,     ip
+        mov     r1,     lr
+
+        pop     { rP, rQ, rT, pc } // + 12
+                .cfi_restore_state
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__uldivmod_small):
+        // Arrange arguments for 32-bit division.
+        movs    r1,     r2
+        bl      LSYM(__internal_uidivmod) // 20
+
+        // Extend quotient and remainder to 64 bits, unsigned.
+        movs    r2,     r1
+        eors    r1,     r1
+        eors    r3,     r3
+        pop     { rP, rQ, rT, pc } // 31
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END udivdi3
+CM0_FUNC_END aeabi_uldiv
+CM0_FUNC_END aeabi_uldivmod
+
+
+#if 0
+
+    LSYM(__internal_uldivmod):
+        push    { r0 - rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 32
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset r1, 4
+                .cfi_rel_offset r2, 8
+                .cfi_rel_offset r3, 12
+                .cfi_rel_offset rP, 16
+                .cfi_rel_offset rQ, 20
+                .cfi_rel_offset rT, 24
+                .cfi_rel_offset lr, 28
+
+        // Count leading zeros of the numerator
+        bl      SYM(__clzdi2) // 55
+        mov     rP,     r0
+
+        // Load denominator
+        add     r0,     sp,     #8
+        ldm     r0,     { r0, r1 }
+
+        // Count leading zeros of the denominator.
+        bl      SYM(__clzdi2) // 55
+
+        // If the numerator has more zeros than the denominator,
+        //  the result is { 0, numerator }
+        subs    rP,     r0,     rP
+        bhi     LSYM(__uldivmod_simple)
+
+        // Reload the denominator
+        add     r0,     sp,     #8
+        ldm     r0,     { r0, r1 }
+
+        // Shift the denominator
+        movs    r2,     rP
+        bl      SYM(__aeabi_llsl) // 14
+
+        // Reload the numerator as remainder.
+        pop     { r2, r3 }
+
+        // Discard the copy of the denominator on the stack.
+        add     sp,     #8
+
+        // Shift the first quotient bit into place
+
+        // Initialize the result.
+
+        // Main division loop.
+
+
+        // Copy the quotient to the result.
+        mov     r0,     ip
+        mov     r1,     lr
+
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+
+
+
+    LSYM(__uldivmod_simple):
+        movs    r2,     r0
+        movs    r3,     r1
+        eors    r0,     r0
+        eors    r1,     r1
+
+#endif
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/cm0/lmul.S libgcc/config/arm/cm0/lmul.S
--- libgcc/config/arm/cm0/lmul.S	1969-12-31 16:00:00.000000000 -0800
+++ libgcc/config/arm/cm0/lmul.S	2020-11-10 21:33:20.985886867 -0800
@@ -0,0 +1,294 @@
+/* lmul.S: Cortex M0 optimized 64-bit integer multiplication 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef __BUILD_CM0_FPLIB // whole file
+
+
+// long long __aeabi_lmul(long long, long long)
+// Returns the least significant 64 bits of a 64 bit multiplication.
+// Expects the two multiplicands in $r1:$r0 and $r3:$r2.
+// Returns the product in $r1:$r0 (does not distinguish signed types).
+// Uses $r4 and $r5 as scratch space.
+.section .text.libgcc.lmul,"x"
+CM0_FUNC_START aeabi_lmul
+FUNC_ALIAS muldi3 aeabi_lmul
+    CFI_START_FUNCTION
+
+        // $r1:$r0 = 0xDDDDCCCCBBBBAAAA
+        // $r3:$r2 = 0xZZZZYYYYXXXXWWWW
+
+        // The following operations that only affect the upper 64 bits
+        //  can be safely discarded:
+        //   DDDD * ZZZZ
+        //   DDDD * YYYY
+        //   DDDD * XXXX
+        //   CCCC * ZZZZ
+        //   CCCC * YYYY
+        //   BBBB * ZZZZ
+
+        // MAYBE: Test for multiply by ZERO on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+    LSYM(__safe_muldi3):
+        // (0xDDDDCCCC * 0xXXXXWWWW), free $r1
+        muls    r1,     r2
+
+        // (0xZZZZYYYY * 0xBBBBAAAA), free $r3
+        muls    r3,     r0
+        add     r3,     r1
+
+        // Put the parameters in the correct form for umulsidi3().
+        movs    r1,     r2
+        b       LSYM(__internal_umulsidi3) // 7
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_lmul
+CM0_FUNC_END muldi3
+
+// unsigned long long __aeabi_lmul(unsigned int, unsigned int)
+// Returns all 64 bits of a 32 bit multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $ip as scratch space.
+.section .text.libgcc.umulsidi3,"x"
+CM0_FUNC_START umulsidi3
+    CFI_START_FUNCTION
+
+        // 32x32 multiply with 64 bit result.
+        // Expand the multiply into 4 parts, since muls only returns 32 bits.
+        //         (a16h * b16h / 2^32)
+        //       + (a16h * b16l / 2^48) + (a16l * b16h / 2^48)
+        //       + (a16l * b16l / 2^64)
+
+        // MAYBE: Test for multiply by 0 on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+    LSYM(__safe_umulsidi3):
+        eors    r3,     r3
+
+    LSYM(__internal_umulsidi3):
+        mov     ip,     r3
+
+        // a16h * b16h
+        lsrs    r2,     r0,     #16
+        lsrs    r3,     r1,     #16
+        muls    r2,     r3
+        add     ip,     r2
+
+        // a16l * b16h; save a16h first!
+        lsrs    r2,     r0,     #16
+        uxth    r0,     r0
+        muls    r3,     r0
+
+        // a16l * b16l
+        uxth    r1,     r1
+        muls    r0,     r1
+
+        // a16h * b16l
+        muls    r1,     r2
+
+        // Distribute intermediate results.
+        eors    r2,     r2
+        adds    r1,     r3
+        adcs    r2,     r2
+        lsls    r3,     r1,     #16
+        lsrs    r1,     #16
+        lsls    r2,     #16
+        adds    r0,     r3
+        adcs    r1,     r2
+
+        // Add in the remaining high bits.
+        add     r1,     ip
+        RETx    lr // 24
+
+    CFI_END_FUNCTION
+CM0_FUNC_END umulsidi3
+
+
+// long long __aeabi_lmul(int, int)
+// Returns all 64 bits of a 32 bit signed multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $rT as scratch space.
+.section .text.libgcc.mulsidi3,"x"
+CM0_FUNC_START mulsidi3
+    CFI_START_FUNCTION
+
+        // Push registers for function call .
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save signs of the arguments.
+        asrs    r3,     r0,     #31
+        asrs    rT,     r1,     #31
+
+        // Absolute value of the arguments.
+        eors    r0,     r3
+        eors    r1,     rT
+        subs    r0,     r3
+        subs    r1,     rT
+
+        // Save sign of the result.
+        eors    rT,     r3
+
+        bl      SYM(__umulsidi3) __PLT__ // 14+24
+
+        // Apply sign of the result.
+        eors    r0,     rT
+        eors    r1,     rT
+        subs    r0,     rT
+        sbcs    r1,     rT
+
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsidi3
+
+
+// long long __aeabi_llsl(long long, int)
+// Logical shift left the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.llsl,"x"
+CM0_FUNC_START aeabi_llsl
+FUNC_ALIAS ashldi3 aeabi_llsl
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r0
+
+        // Assume a simple shift.
+        lsls    r0,     r2
+        lsls    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsl_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsrs    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsl_large):
+        // Apply any remaining shift
+        lsls    r3,     r2
+
+        // Merge remainder and result.
+        adds    r1,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashldi3
+CM0_FUNC_END aeabi_llsl
+
+
+// long long __aeabi_llsr(long long, int)
+// Logical shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.llsr,"x"
+CM0_FUNC_START aeabi_llsr
+FUNC_ALIAS lshrdi3 aeabi_llsr
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r1
+
+        // Assume a simple shift.
+        lsrs    r0,     r2
+        lsrs    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsr_large):
+        // Apply any remaining shift
+        lsrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END lshrdi3
+CM0_FUNC_END aeabi_llsr
+
+
+// long long __aeabi_lasr(long long, int)
+// Arithmetic shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.lasr,"x"
+CM0_FUNC_START aeabi_lasr
+FUNC_ALIAS ashrdi3 aeabi_lasr
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r1
+
+        // Assume a simple shift.
+        lsrs    r0,     r2
+        asrs    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__lasr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__lasr_large):
+        // Apply any remaining shift
+        asrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashrdi3
+CM0_FUNC_END aeabi_lasr
+
+
+#endif // __BUILD_CM0_FPLIB
diff -ruN libgcc/config/arm/lib1funcs.S libgcc/config/arm/lib1funcs.S
--- libgcc/config/arm/lib1funcs.S	2020-11-08 14:32:11.000000000 -0800
+++ libgcc/config/arm/lib1funcs.S	2020-11-12 10:13:44.383982884 -0800
@@ -1050,6 +1050,10 @@
 /* ------------------------------------------------------------------------ */
 /*		Start of the Real Functions				    */
 /* ------------------------------------------------------------------------ */
+
+/* Disable these functions for v6m in favor of the versions below */
+#ifndef NOT_ISA_TARGET_32BIT
+
 #ifdef L_udivsi3
 
 #if defined(__prefer_thumb__)
@@ -1507,6 +1511,8 @@
 	cfi_end	LSYM(Lend_div0)
 	FUNC_END div0
 #endif
+
+#endif /* NOT_ISA_TARGET_32BIT */
 	
 #endif /* L_dvmd_lnx */
 #ifdef L_clear_cache
@@ -1583,6 +1589,9 @@
    so for Reg value in (32...63) and (-1...-31) we will get zero (in the
    case of logical shifts) or the sign (for asr).  */
 
+/* Disable these functions for v6m in favor of the versions below */
+#ifndef NOT_ISA_TARGET_32BIT
+
 #ifdef __ARMEB__
 #define al	r1
 #define ah	r0
@@ -1820,6 +1829,8 @@
 #endif
 #endif /* L_clzdi2 */
 
+#endif /* NOT_ISA_TARGET_32BIT */
+
 #ifdef L_ctzsi2
 #ifdef NOT_ISA_TARGET_32BIT
 FUNC_START ctzsi2
@@ -2189,5 +2200,54 @@
 #include "bpabi.S"
 #else /* NOT_ISA_TARGET_32BIT */
 #include "bpabi-v6m.S"
+
+
+#include "cm0/fplib.h"
+
+/* Temp registers. */
+#define rP r4
+#define rQ r5
+#define rS r6
+#define rT r7
+
+.macro CM0_FUNC_START name
+.global SYM(__\name)
+.type SYM(__\name),function
+.thumb_func
+.align 1
+    SYM(__\name):
+.endm
+
+.macro CM0_FUNC_END name
+.size SYM(__\name), . - SYM(__\name)
+.endm
+
+.macro RETx x
+        bx      \x
+.endm
+
+/* Order files to maximize +/- 2k jump offset of 'b' */
+#define __BUILD_CM0_FPLIB
+
+#include "cm0/clz2.S"
+#include "cm0/lmul.S"
+#include "cm0/lcmp.S"
+#include "cm0/div.S"
+#include "cm0/ldiv.S"
+
+#include "cm0/fcmp.S"
+#include "cm0/fconv.S"
+#include "cm0/fneg.S"
+
+#include "cm0/fadd.S"
+#include "cm0/futil.S"
+#include "cm0/fmul.S"
+#include "cm0/fdiv.S"
+
+#include "cm0/ffloat.S"
+#include "cm0/ffixed.S"
+
+#undef __BUILD_CM0_FPLIB
+
 #endif /* NOT_ISA_TARGET_32BIT */
 #endif /* !__symbian__ */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2020-11-12 23:04 [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0 Daniel Engel
@ 2020-11-26  9:14 ` Christophe Lyon
  2020-12-02  3:32   ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2020-11-26  9:14 UTC (permalink / raw)
  To: Daniel Engel; +Cc: gcc Patches

Hi,

On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi,
>
> This patch adds an efficient assembly-language implementation of IEEE-754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-1).  This is the libgcc portion of a larger library originally described in 2018:
>
>     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
>
> Since that time, I've separated the libm functions for submission to newlib.  The remaining libgcc functions in the attached patch have the following characteristics:
>
>     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
>     __clzsi2                        42                  23              0       exact
>     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
>     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
>
>     __umulsidi3                     44                  24              0       exact
>     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
>     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
>     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
>     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
>     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
>
>     __aeabi_lcmp                    20                   13             0       exact
>     __aeabi_ulcmp                   16                  10              0       exact
>
>     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
>     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
>     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
>     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
>     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
>
>     __shared_float                  178
>     __shared_float (OPTIMIZE_SIZE)  154
>
>     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
>     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
>     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
>     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
>     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
>     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
>
>     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
>     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
>
>     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
>     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
>     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
>     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
>     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
>
>     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
>     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
>     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
>     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
>     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
>
>     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
>     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
>     __aeabi_h2f                     34+__shared_float 34             8     exact
>     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
>
> Copyright assignment is on file with the FSF.
>
> I've built the gcc-arm-none-eabi cross-compiler using the 20201108 snapshot of GCC plus this patch, and successfully compiled a test program:
>
>     extern int main (void)
>     {
>         volatile int x = 1;
>         volatile unsigned long long int y = 10;
>         volatile long long int z = x / y; // 64-bit division
>
>         volatile float a = x; // 32-bit casting
>         volatile float b = y; // 64 bit casting
>         volatile float c = z / b; // float division
>         volatile float d = a + c; // float addition
>         volatile float e = c * b; // float multiplication
>         volatile float f = d - e - c; // float subtraction
>
>         if (f != c) // float comparison
>             y -= (long long int)d; // float casting
>     }
>
> As one point of comparison, the test program links to 876 bytes of libgcc code from the patched toolchain, vs 10276 bytes from the latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a 90% size reduction.

This looks awesome!

>
> I have extensive test vectors, and have passed these tests on an STM32F051.  These vectors were derived from UCB [1], Testfloat [2], and IEEECC754 [3] sources, plus some of my own creation.  Unfortunately, I'm not sure how "make check" should work for a cross compiler run time library.
>
> Although I believe this patch can be incorporated as-is, there are at least two points that might bear discussion:
>
> * I'm not sure where or how they would be integrated, but I would be happy to provide sources for my test vectors.
>
> * The library is currently built for the ARM v6m architecture only.  It is likely that some of the other Cortex variants would benefit from these routines.  However, I would need some guidance on this to proceed without introducing regressions.  I do not currently have a test strategy for architectures beyond Cortex M0, and I have NOT profiled the existing thumb-2 implementations (ieee754-sf.S) for comparison.

I tried your patch, and I see many regressions in the GCC testsuite
because many tests fail to link with errors like:
ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function `__clzdi2':
/libgcc/config/arm/cm0/clz2.S:39: multiple definition of
`__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
first defined here

This happens with a toolchain configured with --target arm-none-eabi,
default cpu/fpu/mode,
--enable-multilib --with-multilib-list=rmprofile and running the tests with
-mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m

Does it work for you?

Thanks,

Christophe

>
> I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
>
> Thanks,
> Daniel Engel
>
> [1] http://www.netlib.org/fp/ucbtest.tgz
> [2] http://www.jhauser.us/arithmetic/TestFloat.html
> [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2020-11-26  9:14 ` Christophe Lyon
@ 2020-12-02  3:32   ` Daniel Engel
  2020-12-16 17:15     ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2020-12-02  3:32 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 12380 bytes --]

Hi Christophe,

On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
> Hi,
> 
> On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > Hi,
> >
> > This patch adds an efficient assembly-language implementation of IEEE-
> > 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
> > 1).  This is the libgcc portion of a larger library originally
> > described in 2018:
> >
> >     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
> >
> > Since that time, I've separated the libm functions for submission to
> > newlib.  The remaining libgcc functions in the attached patch have
> > the following characteristics:
> >
> >     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
> >     __clzsi2                        42                  23              0       exact
> >     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
> >     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
> >
> >     __umulsidi3                     44                  24              0       exact
> >     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
> >     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
> >     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
> >     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
> >     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
> >
> >     __aeabi_lcmp                    20                   13             0       exact
> >     __aeabi_ulcmp                   16                  10              0       exact
> >
> >     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
> >     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
> >     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
> >     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
> >     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
> >
> >     __shared_float                  178
> >     __shared_float (OPTIMIZE_SIZE)  154
> >
> >     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
> >     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
> >     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> >     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> >     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
> >     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
> >     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
> >     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
> >
> >     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
> >     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> >     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> >     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
> >     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
> >     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
> >     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
> >     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
> >     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
> >
> >     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
> >     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
> >     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
> >     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
> >     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
> >
> >     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
> >     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
> >     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
> >     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
> >     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
> >
> >     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
> >     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
> >     __aeabi_h2f                     34+__shared_float 34             8     exact
> >     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
> >
> > Copyright assignment is on file with the FSF.
> >
> > I've built the gcc-arm-none-eabi cross-compiler using the 20201108
> > snapshot of GCC plus this patch, and successfully compiled a test
> > program:
> >
> >     extern int main (void)
> >     {
> >         volatile int x = 1;
> >         volatile unsigned long long int y = 10;
> >         volatile long long int z = x / y; // 64-bit division
> >
> >         volatile float a = x; // 32-bit casting
> >         volatile float b = y; // 64 bit casting
> >         volatile float c = z / b; // float division
> >         volatile float d = a + c; // float addition
> >         volatile float e = c * b; // float multiplication
> >         volatile float f = d - e - c; // float subtraction
> >
> >         if (f != c) // float comparison
> >             y -= (long long int)d; // float casting
> >     }
> >
> > As one point of comparison, the test program links to 876 bytes of
> > libgcc code from the patched toolchain, vs 10276 bytes from the
> > latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
> > 90% size reduction.
>
> This looks awesome!
>
> >
> > I have extensive test vectors, and have passed these tests on an
> > STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
> > and IEEECC754 [3] sources, plus some of my own creation.
> > Unfortunately, I'm not sure how "make check" should work for a cross
> > compiler run time library.
> >
> > Although I believe this patch can be incorporated as-is, there are
> > at least two points that might bear discussion:
> >
> > * I'm not sure where or how they would be integrated, but I would be
> >   happy to provide sources for my test vectors.
> >
> > * The library is currently built for the ARM v6m architecture only.
> >   It is likely that some of the other Cortex variants would benefit
> >   from these routines.  However, I would need some guidance on this
> >   to proceed without introducing regressions.  I do not currently
> >   have a test strategy for architectures beyond Cortex M0, and I
> >   have NOT profiled the existing thumb-2 implementations (ieee754-
> >   sf.S) for comparison.
> 
> I tried your patch, and I see many regressions in the GCC testsuite
> because many tests fail to link with errors like:
> ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function 
> `__clzdi2':
> /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
> `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
> first defined here
> 
> This happens with a toolchain configured with --target arm-none-eabi,
> default cpu/fpu/mode,
> --enable-multilib --with-multilib-list=rmprofile and running the tests with
> -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
> 
> Does it work for you?

Thanks for the feedback.

I'm afraid I'm quite ignorant as to the gcc test suite infrastructure,
so I don't know how to use the options you've shared above.  I'm cross-
compiling the Windows toolchain on Ubuntu.  Would you mind sharing a
full command line you would use for testing?  The toolchain is built
with the default options, which includes "--target arm-none-eabi".

I did see similar errors once before.  It turned out then that I omitted
one of the ".S" files from the build.  My interpretation at that point
was that gcc had been searching multiple versions of "libgcc.a" and
unable to merge the symbols.  In hindsight, that was a really bad
interpretation.   I was able to reproduce the error above by simply
adding a line like "volatile double m = 1.0; m += 2;".

After reviewing the existing asm implementations more closely, I
believe that I have not been using the function guard macros (L_arm_*)
as intended.  The make script appears to compile "lib1funcs.S" dozens of
times -- once for each function guard macro listed in LIB1ASMFUNCS --
with the intent of generating a separate ".o" file for each function.
Because they were unguarded, my new library functions were duplicated
into every ".o" file, which caused the link errors you saw.

I have attached an updated patch that implements the macros.

However, I'm not sure whether my usage is really consistent with the
spirit of the make script.  If there's a README or HOWTO, I haven't
found it yet.  The following points summarize my concerns as I was
making these updates:

1.  While some of the new functions (e.g. __cmpsf2) are standalone,
    there is a common core in the new library shared by several related
    functions.  That keeps the library small.  For now, I've elected to
    group all of these related functions together in a single object
    file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
    within this unit.  Notice that I manually assigned section names in
    the code, so there still shouldn't be any unnecessary code linked in
    the final build.  Does the multiple-".o" files strategy predate "-gc-
    sections", or should I be trying harder to break these related
    functions into separate compilation units?

2.  I introduced a few new macro keywords for functions/groups (e.g.
    "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
    files compiled for the non-v6m architectures will be benign.

3.  The "t-elf" make script implies that __mulsf3() should not be
    compiled in thumb mode (it's inside a conditional), but this is one
    of the new functions.  Moot for now, since my __mulsf3() is grouped
    with the common core functions (see point 1) and is thus currently
    guarded by the "_arm_addsubsf3.o" macro.

4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
    to be working.  I have defined __clzsi2() as a weak symbol to be
    overridden by the combined function __clzdi2().  I can also see
    (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
    "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
    always linked, even in programs that only call __clzsi2(),  A minor
    annoyance at this point.

5.  Is there a permutation of the makefile that compiles libgcc with
    __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
    optimize either way, yet the final product only seems to have the
    "fast" code.  At this optimization level, the sample program above
    pulls in 1012 bytes of library code instead of 836. Perhaps this is
    meant to be controlled by the toolchain configuration step, but it
    doesn't follow that the optimization for the cross-compiler would
    automatically translate to the target runtime libraries.

Thanks again, 
Daniel

> 
> Thanks,
> 
> Christophe
> 
> >
> > I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
> >
> > Thanks,
> > Daniel Engel
> >
> > [1] http://www.netlib.org/fp/ucbtest.tgz
> > [2] http://www.jhauser.us/arithmetic/TestFloat.html
> > [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
>

[-- Attachment #2: cortex-m0-fplib-20201130.patch --]
[-- Type: application/octet-stream, Size: 144670 bytes --]

diff -ruN gcc-11-20201108-clean/libgcc/config/arm/bpabi-v6m.S gcc-11-20201108/libgcc/config/arm/bpabi-v6m.S
--- gcc-11-20201108-clean/libgcc/config/arm/bpabi-v6m.S	2020-11-08 14:32:11.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/bpabi-v6m.S	2020-11-30 15:08:39.332813447 -0800
@@ -33,212 +33,6 @@
 	.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-FUNC_START aeabi_lcmp
-	cmp	xxh, yyh
-	beq	1f
-	bgt	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-	RET
-1:
-	subs	r0, xxl, yyl
-	beq	1f
-	bhi	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-1:
-	RET
-	FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-	
-#ifdef L_aeabi_ulcmp
-
-FUNC_START aeabi_ulcmp
-	cmp	xxh, yyh
-	bne	1f
-	subs	r0, xxl, yyl
-	beq	2f
-1:
-	bcs	1f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-1:
-	movs	r0, #1
-2:
-	RET
-	FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
-.macro test_div_by_zero signed
-	cmp	yyh, #0
-	bne	7f
-	cmp	yyl, #0
-	bne	7f
-	cmp	xxh, #0
-	.ifc	\signed, unsigned
-	bne	2f
-	cmp	xxl, #0
-2:
-	beq	3f
-	movs	xxh, #0
-	mvns	xxh, xxh		@ 0xffffffff
-	movs	xxl, xxh
-3:
-	.else
-	blt	6f
-	bgt	4f
-	cmp	xxl, #0
-	beq	5f
-4:	movs	xxl, #0
-	mvns	xxl, xxl		@ 0xffffffff
-	lsrs	xxh, xxl, #1		@ 0x7fffffff
-	b	5f
-6:	movs	xxh, #0x80
-	lsls	xxh, xxh, #24		@ 0x80000000
-	movs	xxl, #0
-5:
-	.endif
-	@ tailcalls are tricky on v6-m.
-	push	{r0, r1, r2}
-	ldr	r0, 1f
-	adr	r1, 1f
-	adds	r0, r1
-	str	r0, [sp, #8]
-	@ We know we are not on armv4t, so pop pc is safe.
-	pop	{r0, r1, pc}
-	.align	2
-1:
-	.word	__aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-	test_div_by_zero signed
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__gnu_ldivmod_helper)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_ldivmod
-
-#endif /* L_aeabi_ldivmod */
-
-#ifdef L_aeabi_uldivmod
-
-FUNC_START aeabi_uldivmod
-	test_div_by_zero unsigned
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__udivmoddi4)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_uldivmod
-	
-#endif /* L_aeabi_uldivmod */
-
-#ifdef L_arm_addsubsf3
-
-FUNC_START aeabi_frsub
-
-      push	{r4, lr}
-      movs	r4, #1
-      lsls	r4, #31
-      eors	r0, r0, r4
-      bl	__aeabi_fadd
-      pop	{r4, pc}
-
-      FUNC_END aeabi_frsub
-
-#endif /* L_arm_addsubsf3 */
-
-#ifdef L_arm_cmpsf2
-
-FUNC_START aeabi_cfrcmple
-
-	mov	ip, r0
-	movs	r0, r1
-	mov	r1, ip
-	b	6f
-
-FUNC_START aeabi_cfcmpeq
-FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
-
-	@ The status-returning routines are required to preserve all
-	@ registers except ip, lr, and cpsr.
-6:	push	{r0, r1, r2, r3, r4, lr}
-	bl	__lesf2
-	@ Set the Z flag correctly, and the C flag unconditionally.
-	cmp	r0, #0
-	@ Clear the C flag if the return value was -1, indicating
-	@ that the first operand was smaller than the second.
-	bmi	1f
-	movs	r1, #0
-	cmn	r0, r1
-1:
-	pop	{r0, r1, r2, r3, r4, pc}
-
-	FUNC_END aeabi_cfcmple
-	FUNC_END aeabi_cfcmpeq
-	FUNC_END aeabi_cfrcmple
-
-FUNC_START	aeabi_fcmpeq
-
-	push	{r4, lr}
-	bl	__eqsf2
-	negs	r0, r0
-	adds	r0, r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmpeq
-
-.macro COMPARISON cond, helper, mode=sf2
-FUNC_START	aeabi_fcmp\cond
-
-	push	{r4, lr}
-	bl	__\helper\mode
-	cmp	r0, #0
-	b\cond	1f
-	movs	r0, #0
-	pop	{r4, pc}
-1:
-	movs	r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmp\cond
-.endm
-
-COMPARISON lt, le
-COMPARISON le, le
-COMPARISON gt, ge
-COMPARISON ge, ge
-
-#endif /* L_arm_cmpsf2 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/clz2.S gcc-11-20201108/libgcc/config/arm/cm0/clz2.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/clz2.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/clz2.S	2020-11-30 19:14:58.341497924 -0800
@@ -0,0 +1,143 @@
+/* clz2.S: Cortex M0 optimized 'clz' functions 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined(L_clzdi2) || defined(L_clzsi2)
+.section .text.libgcc.clz2,"x"
+
+#ifdef L_clzdi2
+
+// int __clzdi2(long long)
+// Counts leading zeros in a 64 bit double word.
+// Expects the argument  in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+CM0_FUNC_START clzdi2
+    CFI_START_FUNCTION
+
+        // Assume all the bits in the argument are zero.
+        movs    r2,     #64
+
+        // If the upper word is ZERO, calculate 32 + __clzsi2(lower).
+        cmp     r1,     #0
+        beq     LSYM(__clz16)
+
+        // The upper word is non-zero, so calculate __clzsi2(upper).
+        movs    r0,     r1
+
+        // Fall through.
+
+
+// int __clzsi2(int)
+// Counts leading zeros in a 32 bit word.
+// Expects the argument in $r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+CM0_FUNC_START clzsi2
+
+#else
+ 
+// Allow a standalone implementation of clzsi2() to be superceded by a 
+//  combined implementation.  This allows use of the slightly smaller 
+//  unit in programs that do not need clzdi2(). Requires '_clzsi2' to 
+//  appear before '_clzdi2' in LIB1ASMFUNCS. 
+CM0_WEAK_START clzsi2
+    CFI_START_FUNCTION
+
+#endif /* L_clzdi2 */
+
+        // Assume all the bits in the argument are zero
+        movs    r2,     #32
+
+    LSYM(__clz16):
+        // Size optimized: 22 bytes, 51 cycles 
+        // Speed optimized: 50 bytes, 20 cycles
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+        // Binary search starts at half the word width.
+        movs    r3,     #16
+
+    LSYM(__clz_loop):
+        // Test the upper 'n' bits of the operand for ZERO.
+        movs    r1,     r0
+        lsrs    r1,     r3
+        beq     LSYM(__clz_skip)
+
+        // When the test fails, discard the lower bits of the register,
+        //  and deduct the count of discarded bits from the result.
+        movs    r0,     r1
+        subs    r2,     r3
+
+    LSYM(__clz_skip):
+        // Decrease the shift distance for the next test.
+        lsrs    r3,     #1
+        bne     LSYM(__clz_loop)
+
+  #else // !__OPTIMIZE_SIZE__
+
+        // Unrolled binary search.
+        lsrs    r1,     r0,     #16
+        beq     LSYM(__clz8)
+        movs    r0,     r1
+        subs    r2,     #16
+
+    LSYM(__clz8):
+        lsrs    r1,     r0,     #8
+        beq     LSYM(__clz4)
+        movs    r0,     r1
+        subs    r2,     #8
+
+    LSYM(__clz4):
+        lsrs    r1,     r0,     #4
+        beq     LSYM(__clz2)
+        movs    r0,     r1
+        subs    r2,     #4
+
+    LSYM(__clz2):
+        // Load the remainder by index
+	adr     r3,     LSYM(__clz_remainder)
+        ldrb    r0,     [r3, r0]
+
+  #endif // !__OPTIMIZE_SIZE__
+
+        // Account for the remainder.
+        subs    r0,     r2,     r0
+        RETx    lr
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        .align 2
+    LSYM(__clz_remainder):
+        .byte 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clzsi2
+
+  #ifdef L_clzdi2
+    CM0_FUNC_END clzdi2
+  #endif
+
+#endif /* L_clzdi2 || L_clzsi2 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/ctz2.S gcc-11-20201108/libgcc/config/arm/cm0/ctz2.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/ctz2.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/ctz2.S	2020-11-30 15:08:39.332813447 -0800
@@ -0,0 +1,183 @@
+/* ctz2.S: Cortex M0 optimized 'ctz' functions
+
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef L_ctzsi2
+
+// int __ffsdi2(int)
+// Return the index of the least significant 1-bit in $r1:r0, 
+//  or zero if $r1:r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.libgcc.ffsdi2,"x"
+CM0_FUNC_START ffsdi2
+    CFI_START_FUNCTION
+        
+        // Except for zero, ffsdi2(x) == ctzdi2(x) + 1.  Assume non-zero and 
+        //  set up the result before the test (to simplify branching). 
+        // Initial set up assumes the least-significant bit in the lower word.
+        movs    r2,     #33
+        
+        // Test the lower word.  
+        cmp     r0,     #0 
+        bne     LSYM(__ctz16)
+        
+        // Test the upper word and fall through to return zero.  
+        movs    r2,     #65 
+        movs    r0,     r1 
+        bne     LSYM(__ctz16)
+        RETx    lr
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END ffsdi2
+    
+    
+// int __ffssi2(int)
+// Return the index of the least significant 1-bit in $r0, 
+//  or zero if $r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.libgcc.ffssi2,"x"
+CM0_FUNC_START ffssi2
+    CFI_START_FUNCTION
+        
+        // Except for zero, ffssi2(x) == ctzsi2(x) + 1.  Assume non-zero and 
+        //  set up the result before the test (to simplify branching). 
+        movs    r2,     #33
+        
+        // Now test for the exception and fall through to return zero. 
+        cmp     r0,     #0 
+        bne     LSYM(__ctz16)
+        RETx    lr
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END ffssi2
+
+
+// int __ctzdi2(long long)
+// Counts trailing zeros in a 64 bit double word.
+// Expects the argument  in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.libgcc.ctzdi2,"x"
+CM0_FUNC_START ctzdi2
+    CFI_START_FUNCTION
+
+        // If the lower word is non-zero, result is just __ctzsi2(lower).
+        cmp     r0,     #0
+        bne     SYM(__ctzsi2)
+
+        // The lower word is zero, so calculate 32 + __ctzsi2(upper).
+        movs    r0,     r1
+        movs    r2,     #64
+        b       LSYM(__ctz16)
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END ctzdi2
+
+
+// int __ctzsi2(int)
+// Counts trailing zeros in a 32 bit word.
+// Expects the argument in $r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.libgcc.ctzsi2,"x"
+CM0_FUNC_START ctzsi2
+    CFI_START_FUNCTION
+
+        // Assume all the bits in the argument are zero
+        movs    r2,     #32
+
+    LSYM(__ctz16):
+        // Size optimized: 24 bytes, 52 cycles
+        // Speed optimized: 52 bytes, 21 cycles
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+        // Binary search starts at half the word width.
+        movs    r3,     #16
+
+    LSYM(__ctz_loop):
+        // Test the upper 'n' bits of the operand for ZERO.
+        movs    r1,     r0
+        
+        lsls    r1,     r3
+        beq     LSYM(__ctz_skip)
+
+        // When the test fails, discard the lower bits of the register,
+        //  and deduct the count of discarded bits from the result.
+        movs    r0,     r1
+        subs    r2,     r3
+
+    LSYM(__ctz_skip):
+        // Decrease the shift distance for the next test.
+        lsrs    r3,     #1
+        bne     LSYM(__ctz_loop)
+       
+        // Prepare the remainder.
+        lsrs    r0,     #31
+ 
+  #else // !__OPTIMIZE_SIZE__
+ 
+        // Unrolled binary search.
+        lsls    r1,     r0,     #16
+        beq     LSYM(__ctz8)
+        movs    r0,     r1
+        subs    r2,     #16
+
+    LSYM(__ctz8):
+        lsls    r1,     r0,     #8
+        beq     LSYM(__ctz4)
+        movs    r0,     r1
+        subs    r2,     #8
+
+    LSYM(__ctz4):
+        lsls    r1,     r0,     #4
+        beq     LSYM(__ctz2)
+        movs    r0,     r1
+        subs    r2,     #4
+
+    LSYM(__ctz2):
+        // Load the remainder by index
+        lsrs    r0,     #28 
+        adr     r3,     LSYM(__ctz_remainder)
+        ldrb    r0,     [r3, r0]
+  
+  #endif // !__OPTIMIZE_SIZE__ 
+
+        // Apply the remainder.
+        subs    r0,     r2,     r0
+        RETx    lr
+       
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__ 
+        .align 2
+    LSYM(__ctz_remainder):
+        .byte 0,4,3,4,2,4,3,4,1,4,3,4,2,4,3,4
+  #endif  
+      
+    CFI_END_FUNCTION
+CM0_FUNC_END ctzsi2
+
+#endif /* L_ctzsi2 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fadd.S gcc-11-20201108/libgcc/config/arm/cm0/fadd.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fadd.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fadd.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,300 @@
+/* fadd.S: Cortex M0 optimized 32-bit float addition
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// float __aeabi_frsub(float, float)
+// Returns the floating point difference of $r1 - $r0 in $r0.
+.section .text.libgcc.frsub,"x"
+CM0_FUNC_START aeabi_frsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r0 is NAN before modifying.
+        lsls    r2,     r0,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     LSYM(__internal_fadd)
+      #endif
+
+        // Flip sign and run through fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r0,     r2
+        b       LSYM(__internal_fadd)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_frsub
+
+
+// float __aeabi_fsub(float, float)
+// Returns the floating point difference of $r0 - $r1 in $r0.
+.section .text.libgcc.fsub,"x"
+CM0_FUNC_START aeabi_fsub
+FUNC_ALIAS subsf3 aeabi_fsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r1 is NAN before modifying.
+        lsls    r2,     r1,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     LSYM(__internal_fadd)
+      #endif
+
+        // Flip sign and run through fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r1,     r2
+        b       LSYM(__internal_fadd)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END subsf3
+CM0_FUNC_END aeabi_fsub
+
+
+// float __aeabi_fadd(float, float)
+// Returns the floating point sum of $r0 + $r1 in $r0.
+.section .text.libgcc.fadd,"x"
+CM0_FUNC_START aeabi_fadd
+FUNC_ALIAS addsf3 aeabi_fadd
+    CFI_START_FUNCTION
+
+    LSYM(__internal_fadd):
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Drop the sign bit to compare absolute value.
+        lsls    r2,     r0,     #1
+        lsls    r3,     r1,     #1
+
+        // Save the logical difference of original values.
+        // This actually makes the following swap slightly faster.
+        eors    r1,     r0
+
+        // Compare exponents+mantissa.
+        // MAYBE: Speedup for equal values?  This would have to separately
+        //  check for NAN/INF and then either:
+        // * Increase the exponent by '1' (for multiply by 2), or
+        // * Return +0
+        cmp     r2,     r3
+        bhs     LSYM(__fadd_ordered)
+
+        // Reorder operands so the larger absolute value is in r2,
+        //  the corresponding original operand is in $r0,
+        //  and the smaller absolute value is in $r3.
+        movs    r3,     r2
+        eors    r0,     r1
+        lsls    r2,     r0,     #1
+
+    LSYM(__fadd_ordered):
+        // Extract the exponent of the larger operand.
+        // If INF/NAN, then it becomes an automatic result.
+        lsrs    r2,     #24
+        cmp     r2,     #255
+        beq     LSYM(__fadd_special)
+
+        // Save the sign of the result.
+        lsrs    rT,     r0,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // If the original value of $r1 was to +/-0,
+        //  $r0 becomes the automatic result.
+        // Because $r0 is known to be a finite value, return directly.
+        // It's actually important that +/-0 not go through the normal
+        //  process, to keep "-0 +/- 0"  from being turned into +0.
+        cmp     r3,     #0
+        beq     LSYM(__fadd_zero)
+
+        // Extract the second exponent.
+        lsrs    r3,     #24
+
+        // Calculate the difference of exponents (always positive).
+        subs    r3,     r2,     r3
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // If the smaller operand is more than 25 bits less significant
+        //  than the larger, the larger operand is an automatic result.
+        // The smaller operand can't affect the result, even after rounding.
+        cmp     r3,     #25
+        bhi     LSYM(__fadd_return)
+      #endif
+
+        // Isolate both mantissas, recovering the smaller.
+        lsls    rT,     r0,     #9
+        lsls    r0,     r1,     #9
+        eors    r0,     rT
+
+        // If the larger operand is normal, restore the implicit '1'.
+        // If subnormal, the second operand will also be subnormal.
+        cmp     r2,     #0
+        beq     LSYM(__fadd_normal)
+        adds    rT,     #1
+        rors    rT,     rT
+
+        // If the smaller operand is also normal, restore the implicit '1'.
+        // If subnormal, the smaller operand effectively remains multiplied
+        //  by 2 w.r.t the first.  This compensates for subnormal exponents,
+        //  which are technically still -126, not -127.
+        cmp     r2,     r3
+        beq     LSYM(__fadd_normal)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fadd_normal):
+        // Provide a spare bit for overflow.
+        // Normal values will be aligned in bits [30:7]
+        // Subnormal values will be aligned in bits [30:8]
+        lsrs    rT,     #1
+        lsrs    r0,     #1
+
+        // If signs weren't matched, negate the smaller operand (branchless).
+        asrs    r1,     #31
+        eors    r0,     r1
+        subs    r0,     r1
+
+        // Keep a copy of the small mantissa for the remainder.
+        movs    r1,     r0
+
+        // Align the small mantissa for addition.
+        asrs    r1,     r3
+
+        // Isolate the remainder.
+        // NOTE: Given the various cases above, the remainder will only
+        //  be used as a boolean for rounding ties to even.  It is not
+        //  necessary to negate the remainder for subtraction operations.
+        rsbs    r3,     #0
+        adds    r3,     #32
+        lsls    r0,     r3
+
+        // Because operands are ordered, the result will never be negative.
+        // If the result of subtraction is 0, the overall result must be +0.
+        // If the overall result in $r1 is 0, then the remainder in $r0
+        //  must also be 0, so no register copy is necessary on return.
+        adds    r1,     rT
+        beq     LSYM(__fadd_return)
+
+        // The large operand was aligned in bits [29:7]...
+        // If the larger operand was normal, the implicit '1' went in bit [30].
+        //
+        // After addition, the MSB of the result may be in bit:
+        //    31,  if the result overflowed.
+        //    30,  the usual case.
+        //    29,  if there was a subtraction of operands with exponents
+        //          differing by more than 1.
+        //  < 28, if there was a subtraction of operands with exponents +/-1,
+        //  < 28, if both operands were subnormal.
+
+        // In the last case (both subnormal), the alignment shift will be 8,
+        //  the exponent will be 0, and no rounding is necessary.
+        cmp     r2,     #0
+        bne     SYM(__fp_assemble)
+
+        // Subnormal overflow automatically forms the correct exponent.
+        lsrs    r0,     r1,     #8
+        add     r0,     ip
+
+    LSYM(__fadd_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fadd_special):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If $r1 is (also) NAN, force it in place of $r0.
+        // As the smaller NAN, it is more likely to be signaling.
+        movs    rT,     #255
+        lsls    rT,     #24
+        cmp     r3,     rT
+        bls     LSYM(__fadd_ordered2)
+
+        eors    r0,     r1
+      #endif
+
+    LSYM(__fadd_ordered2):
+        // There are several possible cases to consider here:
+        //  1. Any NAN/NAN combination
+        //  2. Any NAN/INF combination
+        //  3. Any NAN/value combination
+        //  4. INF/INF with matching signs
+        //  5. INF/INF with mismatched signs.
+        //  6. Any INF/value combination.
+        // In all cases but the case 5, it is safe to return $r0.
+        // In the special case, a new NAN must be constructed.
+        // First, check the mantissa to see if $r0 is NAN.
+        lsls    r2,     r0,     #9
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fadd_return)
+      #endif
+
+    LSYM(__fadd_zero):
+        // Next, check for an INF/value combination.
+        lsls    r2,     r1,     #1
+        bne     LSYM(__fadd_return)
+
+        // Finally, check for matching sign on INF/INF.
+        // Also accepts matching signs when +/-0 are added.
+        bcc     LSYM(__fadd_return)
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(SUBTRACTED_INFINITY)
+      #endif
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // Restore original operands.
+        eors    r1,     r0
+      #endif
+
+        // Identify mismatched 0.
+        lsls    r2,     r0,     #1
+        bne     SYM(__fp_exception)
+
+        // Force mismatched 0 to +0.
+        eors    r0,     r0
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END addsf3
+CM0_FUNC_END aeabi_fadd
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fcmp.S gcc-11-20201108/libgcc/config/arm/cm0/fcmp.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fcmp.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,558 @@
+/* fcmp.S: Cortex M0 optimized 32-bit float comparison
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_cmpsf2
+
+// int __cmpsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * +1 if ($r0 > $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * -1 if ($r0 < $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.cmpsf2,"x"
+CM0_FUNC_START cmpsf2
+FUNC_ALIAS lesf2 cmpsf2
+FUNC_ALIAS ltsf2 cmpsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+
+// int,int __internal_cmpsf2(float, float, int)
+// Internal function expects a set of control flags in $r2.
+// If ordered, returns a comparison type { 0, 1, 2 } in $r3
+CM0_FUNC_START internal_cmpsf2
+
+        // When operand signs are considered, the comparison result falls
+        //  within one of the following quadrants:
+        //
+        // $r0  $r1  $r0-$r1* flags  result
+        //  +    +      >      C=0     GT
+        //  +    +      =      Z=1     EQ
+        //  +    +      <      C=1     LT
+        //  +    -      >      C=1     GT
+        //  +    -      =      C=1     GT
+        //  +    -      <      C=1     GT
+        //  -    +      >      C=0     LT
+        //  -    +      =      C=0     LT
+        //  -    +      <      C=0     LT
+        //  -    -      >      C=0     LT
+        //  -    -      =      Z=1     EQ
+        //  -    -      <      C=1     GT
+        //
+        // *When interpeted as a subtraction of unsigned integers
+        //
+        // From the table, it is clear that in the presence of any negative
+        //  operand, the natural result simply needs to be reversed.
+        // Save the 'N' flag for later use.
+        movs    r3,     r0
+        orrs    r3,     r1
+        mov     ip,     r3
+
+        // Keep the absolute value of the second argument for NAN testing.
+        lsls    r3,     r1,     #1
+
+        // With the absolute value of the second argument safely stored,
+        //  recycle $r1 to calculate the difference of the arguments.
+        subs    r1,     r0,     r1
+
+        // Save the 'C' flag for use later.
+        // Effectively shifts all the flags 1 bit left.
+        adcs    r2,     r2
+
+        // Absolute value of the first argument.
+        lsls    r0,     #1
+
+        // Identify the largest absolute value between the two arguments.
+        cmp     r0,     r3
+        bhs     LSYM(__fcmp_sorted)
+
+        // Keep the larger absolute value for NAN testing.
+        // NOTE: When the arguments are respectively a signaling NAN and a
+        //  quiet NAN, the quiet NAN has precedence.  This has consequences
+        //  if TRAP_NANS is enabled, but the flags indicate that exceptions
+        //  for quiet NANs should be suppressed.  After the signaling NAN is
+        //  discarded, no exception is raised, although it should have been.
+        // This could be avoided by using a fifth register to save both
+        //  arguments until the signaling bit can be tested, but that seems
+        //  like an excessive amount of ugly code for an ambiguous case.
+        movs    r0,     r3
+
+    LSYM(__fcmp_sorted):
+        // If $r3 is NAN, the result is unordered.
+        movs    r3,     #255
+        lsls    r3,     #24
+        cmp     r0,     r3
+        bhi     LSYM(__fcmp_unordered)
+
+        // Positive and negative zero must be considered equal.
+        // If the larger absolute value is +/-0, both must have been +/-0.
+        subs    r3,     r0,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Test for regular equality.
+        subs    r3,     r1,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Isolate the saved 'C', and invert if either argument was negative.
+        // Remembering that the original subtraction was $r1 - $r0,
+        //  the result will be 1 if 'C' was set (gt), or 0 for not 'C' (lt).
+        lsls    r3,     r2,     #31
+        add     r3,     ip
+        lsrs    r3,     #31
+
+        // HACK: Clear the 'C' bit
+        adds    r3,     #0
+
+    LSYM(__fcmp_zero):
+        // After everything is combined, the temp result will be
+        //  2 (gt), 1 (eq), or 0 (lt).
+        adcs    r3,     r3
+
+        // Return directly if the 3-way comparison flag is set.
+        // Also shifts the condition mask into bits[2:0].
+        lsrs    r2,     #2
+        bcs     LSYM(__fcmp_return)
+
+        // If the bit corresponding to the comparison result is set in the
+        //  accepance mask, a '1' will fall out into the result.
+        movs    r0,     #1
+        lsrs    r2,     r3
+        ands    r0,     r2
+        RETx    lr
+
+    LSYM(__fcmp_unordered):
+        // Set up the requested UNORDERED result.
+        // Remember the shift in the flags (above).
+        lsrs    r2,     #6
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // TODO: ... The
+
+
+  #endif
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Always raise an exception if FCMP_RAISE_EXCEPTIONS was specified.
+        bcs     LSYM(__fcmp_trap)
+
+        // If FCMP_NO_EXCEPTIONS was specified, no exceptions on quiet NANs.
+        // The comparison flags are moot, so $r1 can serve as scratch space.
+        lsrs    r1,     r0,     #24
+        bcs     LSYM(__fcmp_return2)
+
+    LSYM(__fcmp_trap):
+        // Restore the NAN (sans sign) for an argument to the exception.
+        // As an IRQ, the handler restores all registers, including $r3.
+        // NOTE: The service handler may not return.
+        lsrs    r0,     #1
+        movs    r3,     #(UNORDERED_COMPARISON)
+        svc     #(SVC_TRAP_NAN)
+  #endif
+
+     LSYM(__fcmp_return2):
+        // HACK: Work around result register mapping.
+        // This could probably be eliminated by remapping the flags register.
+        movs    r3,     r2
+
+    LSYM(__fcmp_return):
+        // Finish setting up the result.
+        // The subtraction allows a negative result from an 8 bit set of flags.
+        //  (See the variations on the FCMP_UN parameter, above).
+        subs    r0,     r3,     #1
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ltsf2
+CM0_FUNC_END lesf2
+CM0_FUNC_END cmpsf2
+
+
+// int __eqsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1)
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1), or either argument is NAN
+// Uses $r2, $r3, and $ip as scratch space.
+CM0_FUNC_START eqsf2
+FUNC_ALIAS nesf2 eqsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END nesf2
+CM0_FUNC_END eqsf2
+
+
+// int __gesf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.gesf2,"x"
+CM0_FUNC_START gesf2
+FUNC_ALIAS gtsf2 gesf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_NEGATIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END gtsf2
+CM0_FUNC_END gesf2
+
+
+// int __aeabi_fcmpeq(float, float)
+// Returns '1' in $r1 if ($r0 == $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpeq,"x"
+CM0_FUNC_START aeabi_fcmpeq
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpeq
+
+
+// int __aeabi_fcmpne(float, float) [non-standard]
+// Returns '1' in $r1 if ($r0 != $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpne,"x"
+CM0_FUNC_START aeabi_fcmpne
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_NE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpne
+
+
+// int __aeabi_fcmplt(float, float)
+// Returns '1' in $r1 if ($r0 < $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmplt,"x"
+CM0_FUNC_START aeabi_fcmplt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmplt
+
+
+// int __aeabi_fcmple(float, float)
+// Returns '1' in $r1 if ($r0 <= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmple,"x"
+CM0_FUNC_START aeabi_fcmple
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmple
+
+
+// int __aeabi_fcmpge(float, float)
+// Returns '1' in $r1 if ($r0 >= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpge,"x"
+CM0_FUNC_START aeabi_fcmpge
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpge
+
+
+// int __aeabi_fcmpgt(float, float)
+// Returns '1' in $r1 if ($r0 > $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpgt,"x"
+CM0_FUNC_START aeabi_fcmpgt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpgt
+
+#endif /* L_arm_cmpsf2 */
+
+
+#ifdef L_arm_unordsf2
+
+// int __aeabi_fcmpun(float, float)
+// Returns '1' in $r1 if $r0 and $r1 are unordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libgcc.fcmpun,"x"
+CM0_FUNC_START aeabi_fcmpun
+FUNC_ALIAS unordsf2 aeabi_fcmpun
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END unordsf2
+CM0_FUNC_END aeabi_fcmpun
+
+#endif /* L_unordsf2 */
+
+
+#if 0
+
+// void __aeabi_cfrcmple(float, float)
+// Reverse three-way compare of $r1 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 > $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+.section .text.libgcc.cfrcmple,"x"
+CM0_FUNC_START aeabi_cfrcmple
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+        // Reverse the order of the arguments.
+        ldr     r0,     [sp, #4]
+        ldr     r1,     [sp, #0]
+
+        // Don't just fall through into cfcmple(), else registers will get pushed twice.
+        b       SYM(__real_cfrcmple)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_cfrcmple
+
+
+// void __aeabi_cfcmpeq(float, float)
+// NOTE: This function only applies if __aeabi_cfcmple() can raise exceptions.
+// Three-way compare of $r0 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 < $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+#if defined(TRAP_NANS) && TRAP_NANS
+  .section .text.libgcc.cfcmpeq,"x"
+  CM0_FUNC_START aeabi_cfcmpeq
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+        // No exceptions on quiet NAN.
+        // On an unordered result, 'C' should be '1' and 'Z' should be '0'.
+        // A subtraction giving -1 sets these flags correctly.
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS)
+        b       LSYM(__real_cfcmpeq)
+
+    CFI_END_FUNCTION
+  CM0_FUNC_END aeabi_cfcmpeq
+#endif
+
+// void __aeabi_cfcmple(float, float)
+// Three-way compare of $r0 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 < $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+.section .text.libgcc.cfcmple,"x"
+CM0_FUNC_START aeabi_cfcmple
+
+  // __aeabi_cfcmpeq() is defined separately when TRAP_NANS is enabled.
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    FUNC_ALIAS aeabi_cfcmpeq aeabi_cfcmple
+  #endif
+
+    CFI_START_FUNCTION
+
+        push    { r0-r3, lr }
+
+        // Save the current CFI state
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 20
+        .cfi_rel_offset r0, 0
+        .cfi_rel_offset r1, 4
+        .cfi_rel_offset r2, 8
+        .cfi_rel_offset r3, 12
+        .cfi_rel_offset lr, 16
+
+    LSYM(__real_cfrcmple):
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // The result in $r0 will be ignored, but do raise exceptions.
+        // On an unordered result, 'C' should be '1' and 'Z' should be '0'.
+        // A subtraction giving -1 sets these flags correctly.
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS)
+  #endif
+
+    LSYM(__real_cfcmpeq):
+        // __internal_cmpsf2() always sets the APSR flags on return.
+        bl      LSYM(__internal_cmpsf2)
+
+        // Because __aeabi_cfcmpeq() wants the 'C' flag set on equal values,
+        //  magic is required.   For the possible intermediate values in $r3:
+        //  * 0b01 gives C = 0 and Z = 0 for $r0 < $r1
+        //  * 0b10 gives C = 1 and Z = 1 for $r0 == $r1
+        //  * 0b11 gives C = 1 and Z = 0 for $r0 > $r1 (or unordered)
+        cmp    r1,     #0
+
+        // Cleanup.
+        pop    { r0-r3, pc }
+        .cfi_restore_state
+
+    CFI_END_FUNCTION
+
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    CM0_FUNC_END aeabi_cfcmpeq
+  #endif
+
+CM0_FUNC_END aeabi_cfcmple
+
+
+// int isgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 > $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isgreaterf,"x"
+CM0_FUNC_START isgreaterf
+MATH_ALIAS isgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterf
+CM0_FUNC_END isgreaterf
+
+
+// int isgreaterequalf(float, float)
+// Returns '1' in $r0 if ($r0 >= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isgreaterequalf,"x"
+CM0_FUNC_START isgreaterequalf
+MATH_ALIAS isgreaterequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterequalf
+CM0_FUNC_END isgreaterequalf
+
+
+// int islessf(float, float)
+// Returns '1' in $r0 if ($r0 < $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessf,"x"
+CM0_FUNC_START islessf
+MATH_ALIAS islessf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessf
+CM0_FUNC_END islessf
+
+
+// int islessequalf(float, float)
+// Returns '1' in $r0 if ($r0 <= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessequalf,"x"
+CM0_FUNC_START islessequalf
+MATH_ALIAS islessequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessequalf
+CM0_FUNC_END islessequalf
+
+
+// int islessgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 != $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.islessgreaterf,"x"
+CM0_FUNC_START islessgreaterf
+MATH_ALIAS islessgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessgreaterf
+CM0_FUNC_END islessgreaterf
+
+
+// int isunorderedf(float, float)
+// Returns '1' in $r0 if either $r0 or $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.libm.isunorderedf,"x"
+CM0_FUNC_START isunorderedf
+MATH_ALIAS isunorderedf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isunorderedf
+CM0_FUNC_END isunorderedf
+
+#endif /* 0 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fconv.S gcc-11-20201108/libgcc/config/arm/cm0/fconv.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fconv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fconv.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,356 @@
+/* fconv.S: Cortex M0 optimized 32- and 64-bit float conversions
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubdf3 /* like ieee754-df.S */
+
+// double __aeabi_f2d(float)
+// Converts a single-precision float in $r0 to double-precision in $r1:$r0.
+// Rounding, overflow, and underflow are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.libgcc.f2d,"x"
+CM0_FUNC_START aeabi_f2d
+FUNC_ALIAS extendsfdf2 aeabi_f2d
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r1,     r0,     #31
+        lsls    r1,     #31
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Test for zero.
+        lsls    r0,     #1
+        beq     LSYM(__f2d_return)
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        movs    r2,     r0
+        bl      SYM(__fp_normalize2) __PLT__
+
+        // Set up the exponent bias.  For INF/NAN values, the bias
+        //  is 1791 (2047 - 255 - 1), where the last '1' accounts
+        //  for the implicit '1' in the mantissa.
+        movs    r0,     #3
+        lsls    r0,     #9
+        adds    r0,     #255
+
+        // Test for INF/NAN, promote exponent if necessary
+        cmp     r2,     #255
+        beq     LSYM(__f2d_indefinite)
+
+        // For normal values, the exponent bias is 895 (1023 - 127 - 1),
+        //  which is half of the prepared INF/NAN bias.
+        lsrs    r0,     #1
+
+    LSYM(__f2d_indefinite):
+        // Assemble exponent with bias correction.
+        adds    r2,     r0
+        lsls    r2,     #20
+        adds    r1,     r2
+
+        // Assemble the high word of the mantissa.
+        lsrs    r0,     r3,     #11
+        add     r1,     r0
+
+        // Remainder of the mantissa in the low word of the result.
+        lsls    r0,     r3,     #21
+
+    LSYM(__f2d_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END extendsfdf2
+CM0_FUNC_END aeabi_f2d
+
+#endif /* L_addsubdf3 */
+
+
+#ifdef L_arm_truncdfsf2
+
+// float __aeabi_d2f(double)
+// Converts a double-precision float in $r1:$r0 to single-precision in $r0.
+// Values out of range become ZERO or INF; returns the upper 23 bits of NAN.
+// Rounds to nearest, ties to even.  The ARM ABI does not appear to specify a
+//  rounding mode, so no problems here.  Unfortunately, GCC specifies rounding
+//  towards zero, which makes this implementation incompatible.
+// (It would be easy enough to truncate normal values, but single-precision
+//  subnormals would require a significantly more complex approach.)
+.section .text.libgcc.d2f,"x"
+CM0_FUNC_START aeabi_d2f
+// FUNC_ALIAS truncdfsf2 aeabi_d2f // incompatible rounding
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r2,     r1,     #31
+        lsls    r2,     #31
+        mov     ip,     r2
+
+        // Isolate the exponent (11 bits).
+        lsls    r2,     r1,     #1
+        lsrs    r2,     #21
+
+        // Isolate the mantissa.  It's safe to always add the implicit '1' --
+        //  even for subnormals -- since they will underflow in every case.
+        lsls    r1,     #12
+        adds    r1,     #1
+        rors    r1,     r1
+        lsrs    r3,     r0,     #21
+        adds    r1,     r3
+        lsls    r0,     #11
+
+        // Test for INF/NAN (r3 = 2047)
+        mvns    r3,     r2
+        lsrs    r3,     #21
+        cmp     r3,     r2
+        beq     LSYM(__d2f_indefinite)
+
+        // Adjust exponent bias.  Offset is 127 - 1023, less 1 more since
+        //  __fp_assemble() expects the exponent relative to bit[30].
+        lsrs    r3,     #1
+        subs    r2,     r3
+        adds    r2,     #126
+
+    LSYM(__d2f_assemble):
+        // Use the standard formatting for overflow and underflow.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(__fp_assemble)
+                .cfi_restore_state
+
+    LSYM(__d2f_indefinite):
+        // Test for INF.  If the mantissa, exclusive of the implicit '1',
+        //  is equal to '0', the result will be INF.
+        lsls    r3,     r1,     #1
+        orrs    r3,     r0
+        beq     LSYM(__d2f_assemble)
+
+        // Construct NAN with the upper 22 bits of the mantissa, setting bit[21]
+        //  to ensure a valid NAN without changing bit[22] (quiet)
+        subs    r2,     #0xD
+        lsls    r0,     r2,     #20
+        lsrs    r1,     #8
+        orrs    r0,     r1
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        add     r0,     ip
+      #endif
+
+        RETx    lr
+
+    CFI_END_FUNCTION
+// CM0_FUNC_END truncdfsf2
+CM0_FUNC_END aeabi_d2f
+
+#endif /* L_arm_truncdfsf2 */
+
+
+#ifdef L_arm_h2f 
+
+// float __aeabi_h2f(short hf)
+// Converts a half-precision float in $r0 to single-precision.
+// Rounding, overflow, and underflow conditions are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.libgcc.h2f,"x"
+CM0_FUNC_START aeabi_h2f
+    CFI_START_FUNCTION
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the mantissa and exponent.
+        lsls    r2,     r0,     #17
+
+        // Isolate the sign.
+        lsrs    r0,     #15
+        lsls    r0,     #31
+
+        // Align the exponent at bit[24] for normalization.
+        // If zero, return the original sign.
+        lsrs    r2,     #3
+        beq     LSYM(__h2f_return)
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        bl      SYM(__fp_normalize2) __PLT__
+
+        // Set up the exponent bias.  For INF/NAN values, the bias is 223,
+        //  where the last '1' accounts for the implicit '1' in the mantissa.
+        adds    r2,     #(255 - 31 - 1)
+
+        // Test for INF/NAN.
+        cmp     r2,     #254
+        beq     LSYM(__h2f_assemble)
+
+        // For normal values, the bias should have been 111.
+        // However, this adjustment now is faster than branching.
+        subs    r2,     #((255 - 31 - 1) - (127 - 15 - 1))
+
+    LSYM(__h2f_assemble):
+        // Combine exponent and sign.
+        lsls    r2,     #23
+        adds    r0,     r2
+
+        // Combine mantissa.
+        lsrs    r3,     #8
+        add     r0,     r3
+
+    LSYM(__h2f_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_h2f
+
+#endif /* L_arm_h2f */
+
+
+#ifdef L_arm_f2h
+
+// short __aeabi_f2h(float f)
+// Converts a single-precision float in $r1 to half-precision,
+//  rounding to nearest, ties to even.
+// Values out of range become ZERO or INF; returns the upper 12 bits of NAN.
+// Values out of range are forced to either ZERO or INF.
+.section .text.libgcc.f2h,"x"
+CM0_FUNC_START aeabi_f2h
+    CFI_START_FUNCTION
+
+        // Set up the sign.
+        lsrs    r2,     r0,     #31
+        lsls    r2,     #15
+
+        // Save the exponent and mantissa.
+        // If ZERO, return the original sign.
+        lsls    r0,     #1
+        beq     LSYM(__f2h_return)
+
+        // Isolate the exponent, check for NAN.
+        lsrs    r1,     r0,     #24
+        cmp     r1,     #255
+        beq     LSYM(__f2h_indefinite)
+
+        // Check for overflow.
+        cmp     r1,     #(127 + 15)
+        bhi     LSYM(__f2h_overflow)
+
+        // Isolate the mantissa, adding back the implicit '1'.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Adjust exponent bias for half-precision, including '1' to
+        //  account for the mantissa's implicit '1'.
+        subs    r1,     #(127 - 15 + 1)
+        bmi     LSYM(__f2h_underflow)
+
+        // Combine the exponent and sign.
+        lsls    r1,     #10
+        adds    r2,     r1
+
+        // Split the mantissa (11 bits) and remainder (13 bits).
+        lsls    r3,     r0,     #12
+        lsrs    r0,     #21
+
+     LSYM(__f2h_round):
+        // If the carry bit is '0', always round down.
+        bcc     LSYM(__f2h_return)
+
+        // Carry was set.  If a tie (no remainder) and the
+        //  LSB of the result are '0', round down (to even).
+        lsls    r1,     r0,     #31
+        orrs    r1,     r3
+        beq     LSYM(__f2h_return)
+
+        // Round up, ties to even.
+        adds    r0,     #1
+
+     LSYM(__f2h_return):
+        // Combine mantissa and exponent.
+        adds    r0,     r2
+        RETx    lr
+
+    LSYM(__f2h_underflow):
+        // Align the remainder. The remainder consists of the last 12 bits
+        //  of the mantissa plus the magnitude of underflow.
+        movs    r3,     r0
+        adds    r1,     #12
+        lsls    r3,     r1
+
+        // Align the mantissa.  The MSB of the remainder must be
+        // shifted out into last the 'C' flag for rounding.
+        subs    r1,     #33
+        rsbs    r1,     #0
+        lsrs    r0,     r1
+        b       LSYM(__f2h_round)
+
+    LSYM(__f2h_overflow):
+        // Create single-precision INF from which to construct half-precision.
+        movs    r0,     #255
+        lsls    r0,     #24
+
+    LSYM(__f2h_indefinite):
+        // Check for INF.
+        lsls    r3,     r0,     #8
+        beq     LSYM(__f2h_infinite)
+
+        // Set bit[8] to ensure a valid NAN without changing bit[9] (quiet).
+        adds    r2,     #128
+        adds    r2,     #128
+
+    LSYM(__f2h_infinite):
+        // Construct the result from the upper 22 bits of the mantissa
+        //  and the lower 5 bits of the exponent.
+        lsls    r0,     #3
+        lsrs    r0,     #17
+
+        // Combine with the sign (and possibly NAN flag).
+        orrs    r0,     r2
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_f2h
+
+#endif  /* L_arm_f2h */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fdiv.S gcc-11-20201108/libgcc/config/arm/cm0/fdiv.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fdiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fdiv.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,257 @@
+/* fdiv.S: Cortex M0 optimized 32-bit float division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// float __aeabi_fdiv(float, float)
+// Returns $r0 after division by $r1.
+.section .text.libgcc.fdiv,"x"
+CM0_FUNC_START aeabi_fdiv
+FUNC_ALIAS divsf3 aeabi_fdiv
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save for the sign of the result.
+        movs    r3,     r1
+        eors    r3,     r0
+        lsrs    rT,     r3,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for divide by 0.  Automatically catches 0/0.
+        lsls    r2,     r1,     #1
+        beq     LSYM(__fdiv_by_zero)
+
+        // Check for INF/INF, or a number divided by itself.
+        lsls    r3,     #1
+        beq     LSYM(__fdiv_equal)
+
+        // Check the numerator for INF/NAN.
+        eors    r3,     r2
+        cmp     r3,     rT
+        bhs     LSYM(__fdiv_special1)
+
+        // Check the denominator for INF/NAN.
+        cmp     r2,     rT
+        bhs     LSYM(__fdiv_special2)
+
+        // Check the numerator for zero.
+        cmp     r3,     #0
+        beq     SYM(__fp_zero)
+
+        // No action if the numerator is subnormal.
+        //  The mantissa will normalize naturally in the division loop.
+        lsls    r0,     #9
+        lsrs    r1,     r3,     #24
+        beq     LSYM(__fdiv_denominator)
+
+        // Restore the numerator's implicit '1'.
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fdiv_denominator):
+        // The denominator must be normalized and left aligned.
+        bl      SYM(__fp_normalize2)
+
+        // 25 bits of precision will be sufficient.
+        movs    rT,     #64
+
+        // Run division.
+        bl      SYM(__internal_fdiv_loop)
+        b       SYM(__fp_assemble)
+
+    LSYM(__fdiv_equal):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_INF_BY_INF)
+      #endif
+
+        // The absolute value of both operands are equal, but not 0.
+        // If both operands are INF, create a new NAN.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If both operands are NAN, return the NAN in $r0.
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Return 1.0f, with appropriate sign.
+        movs    r0,     #127
+        lsls    r0,     #23
+        add     r0,     ip
+
+    LSYM(__fdiv_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fdiv_special2):
+        // The denominator is either INF or NAN, numerator is neither.
+        // Also, the denominator is not equal to 0.
+        // If the denominator is INF, the result goes to 0.
+        beq     SYM(__fp_zero)
+
+        // The only other option is NAN, fall through to branch.
+        mov     r0,     r1
+
+    LSYM(__fdiv_special1):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // The numerator is INF or NAN.  If NAN, return it directly.
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fdiv_return)
+      #endif
+
+        // If INF, the result will be INF if the denominator is finite.
+        // The denominator won't be either INF or 0,
+        //  so fall through the exception trap to check for NAN.
+        movs    r0,     r1
+
+    LSYM(__fdiv_by_zero):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_0_BY_0)
+      #endif
+
+        // The denominator is 0.
+        // If the numerator is also 0, the result will be a new NAN.
+        // Otherwise the result will be INF, with the correct sign.
+        lsls    r2,     r0,     #1
+        beq     SYM(__fp_exception)
+
+        // The result should be NAN if the numerator is NAN.  Otherwise,
+        //  the result is INF regardless of the numerator value.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Recreate INF with the correct sign.
+        b       SYM(__fp_infinity)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsf3
+CM0_FUNC_END aeabi_fdiv
+
+
+// Division helper, possibly to be shared with atan2.
+// Expects the numerator mantissa in $r0, exponent in $r1,
+//  plus the denominator mantissa in $r3, exponent in $r2, and
+//  a bit pattern in $rT that controls the result precision.
+// Returns quotient in $r1, exponent in $r2, pseudo remainder in $r0.
+.section .text.libgcc.fdiv2,"x"
+CM0_FUNC_START internal_fdiv_loop
+    CFI_START_FUNCTION
+
+        // Initialize the exponent, relative to bit[30].
+        subs    r2,     r1,     r2
+
+    SYM(__internal_fdiv_loop2):
+        // The exponent should be (expN - 127) - (expD - 127) + 127.
+        // An additional offset of 25 is required to account for the
+        //  minimum number of bits in the result (before rounding).
+        // However, drop '1' because the offset is relative to bit[30],
+        //  while the result is calculated relative to bit[31].
+        adds    r2,     #(127 + 25 - 1)
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Dividing by a power of 2?
+        lsls    r1,     r3,     #1
+        beq     LSYM(__fdiv_simple)
+      #endif
+
+        // Initialize the result.
+        eors    r1,     r1
+
+        // Clear the MSB, so that when the numerator is smaller than
+        //  the denominator, there is one bit free for a left shift.
+        // After a single shift, the numerator is guaranteed to be larger.
+        // The denominator ends up in r3, and the numerator ends up in r0,
+        //  so that the numerator serves as a psuedo-remainder in rounding.
+        // Shift the numerator one additional bit to compensate for the
+        //  pre-incrementing loop.
+        lsrs    r0,     #2
+        lsrs    r3,     #1
+
+    LSYM(__fdiv_loop):
+        // Once the MSB of the output reaches the MSB of the register,
+        //  the result has been calculated to the required precision.
+        lsls    r1,     #1
+        bmi     LSYM(__fdiv_break)
+
+        // Shift the numerator/remainder left to set up the next bit.
+        subs    r2,     #1
+        lsls    r0,     #1
+
+        // Test if the numerator/remainder is smaller than the denominator,
+        //  do nothing if it is.
+        cmp     r0,     r3
+        blo     LSYM(__fdiv_loop)
+
+        // If the numerator/remainder is greater or equal, set the next bit,
+        //  and subtract the denominator.
+        adds    r1,     rT
+        subs    r0,     r3
+
+        // Short-circuit if the remainder goes to 0.
+        // Even with the overhead of "subnormal" alignment,
+        //  this is usually much faster than continuing.
+        bne     LSYM(__fdiv_loop)
+
+        // Compensate the alignment of the result.
+        // The remainder does not need compensation, it's already 0.
+        lsls    r1,     #1
+
+    LSYM(__fdiv_break):
+        RETx    lr 
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fdiv_simple):
+        // The numerator becomes the result, with a remainder of 0.
+        movs    r1,     r0
+        eors    r0,     r0
+        subs    r2,     #25
+        RETx    lr 
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END internal_fdiv_loop
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/ffixed.S gcc-11-20201108/libgcc/config/arm/cm0/ffixed.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/ffixed.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/ffixed.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,339 @@
+/* ffixed.S: Cortex M0 optimized float->int conversion
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// int __aeabi_f2iz(float)
+// Converts a float in $r0 to signed integer, rounding toward 0.
+// Values out of range are forced to either INT_MAX or INT_MIN.
+// NAN becomes zero.
+.section .text.libgcc.f2iz,"x"
+CM0_FUNC_START aeabi_f2iz
+FUNC_ALIAS fixsfsi aeabi_f2iz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #33
+        b       LSYM(__real_f2lz)
+  #else
+        // Flag for signed conversion.
+        movs    r3,     #1
+
+    LSYM(__real_f2iz):
+        // Isolate the sign of the result.
+        asrs    r1,     r0,     #31
+        lsls    r0,     #1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Check for zero to avoid spurious underflow exception on -0.
+        beq     LSYM(__f2iz_return)
+  #endif
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Test for NAN.
+        // Otherwise, NAN will be converted like +/-INF.
+        cmp     r2,     #255
+        beq     LSYM(__f2iz_nan)
+  #endif
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 31.
+        //  * An exponent of 128 will result in a shift of 30.
+        //  *  ...
+        //  * An exponent of 157 will result in a shift of 1.
+        //  * An exponent of 158 will result in no shift at all.
+        //  * An exponent larger than 158 will result in overflow.
+        rsbs    r2,     #0
+        adds    r2,     #158
+
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r3
+        blt     LSYM(__f2iz_overflow)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Would also catch -0.0f if not handled earlier.
+        cmn     r3,     r1
+        blt     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Save a copy for remainder testing
+        movs    r3,     r0
+  #endif
+
+        // Truncate the fraction.
+        lsrs    r0,     r2
+
+        // Two's complement negation, if applicable.
+        // Bonus: the sign in $r1 provides a suitable long long result.
+        eors    r0,     r1
+        subs    r0,     r1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // If any bits set in the remainder, raise FE_INEXACT
+        rsbs    r2,     #0
+        adds    r2,     #32
+        lsls    r3,     r2
+        bne     LSYM(__f2iz_inexact)
+  #endif
+
+    LSYM(__f2iz_return):
+        RETx    lr
+
+    LSYM(__f2iz_overflow):
+        // Positive unsigned integers (r1 == 0, r3 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r1 == -1, r3 == 0), return 0x00000000.
+        // Positive signed integers (r1 == 0, r3 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r1 == -1, r3 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^31).
+        mvns    r0,     r1
+        lsls    r3,     #31
+        eors    r0,     r3
+        RETx    lr
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+    LSYM(__f2iz_inexact):
+        // TODO: Another class of exceptions that doesn't overwrite $r0.
+        bkpt    #0
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_INEXACT)
+      #endif
+
+        b       SYM(__fp_exception)
+  #endif
+
+    LSYM(__f2iz_nan):
+        // Check for INF
+        lsls    r2,     r0,     #9
+        beq     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_UNDEFINED)
+      #endif
+
+        b       SYM(__fp_exception)
+  #else
+
+  #endif
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+
+        // TODO: Extend to long long
+
+        // TODO: bl  fp_check_nan
+      #endif
+
+        // Return long long 0 on NAN.
+        eors    r0,     r0
+        eors    r1,     r1
+        RETx    lr
+
+  #endif // !__OPTIMIZE_SIZE__
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfsi
+CM0_FUNC_END aeabi_f2iz
+
+
+// unsigned int __aeabi_f2uiz(float)
+// Converts a float in $r0 to unsigned integer, rounding toward 0.
+// Values out of range are forced to UINT_MAX.
+// Negative values and NAN all become zero.
+.section .text.libgcc.f2uiz,"x"
+CM0_FUNC_START aeabi_f2uiz
+FUNC_ALIAS fixunssfsi aeabi_f2uiz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #32
+        b       LSYM(__real_f2lz)
+  #else
+        // Flag for unsigned conversion.
+        movs    r3,     #0
+        b       LSYM(__real_f2iz)
+  #endif // !__OPTIMIZE_SIZE__
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfsi
+CM0_FUNC_END aeabi_f2uiz
+
+
+// long long aeabi_f2lz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to either INT64_MAX or INT64_MIN.
+// NAN becomes zero.
+.section .text.libgcc.f2lz,"x"
+CM0_FUNC_START aeabi_f2lz
+FUNC_ALIAS fixsfdi aeabi_f2lz
+    CFI_START_FUNCTION
+
+        movs    r1,     #1
+
+    LSYM(__real_f2lz):
+        // Split the sign of the result from the mantissa/exponent field.
+        // Handle +/-0 specially to avoid spurious exceptions.
+        asrs    r3,     r0,     #31
+        lsls    r0,     #1
+        beq     LSYM(__f2lz_zero)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Specifically, is the LSB of $r1 clear when $r3 is equal to '-1'?
+        //
+        // $r3 (sign)   >=     $r2 (flag)
+        // 0xFFFFFFFF   false   0x00000000
+        // 0x00000000   true    0x00000000
+        // 0xFFFFFFFF   true    0x80000000
+        // 0x00000000   true    0x80000000
+        //
+        // (NOTE: This test will also trap -0.0f, unless handled earlier.)
+        lsls    r2,     r1,     #31
+        cmp     r3,     r2
+        blt     LSYM(__f2lz_overflow)
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+//   #if defined(TRAP_NANS) && TRAP_NANS
+//         // Test for NAN.
+//         // Otherwise, NAN will be converted like +/-INF.
+//         cmp     r2,     #255
+//         beq     LSYM(__f2lz_nan)
+//   #endif
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 63.
+        //  * An exponent of 128 will result in a shift of 62.
+        //  *  ...
+        //  * An exponent of 189 will result in a shift of 1.
+        //  * An exponent of 190 will result in no shift at all.
+        //  * An exponent larger than 190 will result in overflow
+        //     (189 in the case of signed integers).
+        rsbs    r2,     #0
+        adds    r2,     #190
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r1
+        blt     LSYM(__f2lz_overflow)
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate the upper word.
+        // If the shift is greater than 32, gives an automatic '0'.
+        movs    r1,     r0
+        lsrs    r1,     r2
+
+        // Reduce the shift for the lower word.
+        // If the original shift was less than 32, the result may be split
+        //  between the upper and lower words.
+        subs    r2,     #32
+        blt     LSYM(__f2lz_split)
+
+        // Shift is still positive, keep moving right.
+        lsrs    r0,     r2
+
+        // TODO: Remainder test.
+        // $r1 is technically free, as long as it's zero by the time
+        //  this is over.
+
+    LSYM(__f2lz_return):
+        // Two's complement negation, if the original was negative.
+        eors    r0,     r3
+        eors    r1,     r3
+        subs    r0,     r3
+        sbcs    r1,     r3
+        RETx    lr
+
+    LSYM(__f2lz_split):
+        // Shift was negative, calculate the remainder
+        rsbs    r2,     #0
+        lsls    r0,     r2
+        b       LSYM(__f2lz_return)
+
+    LSYM(__f2lz_zero):
+        eors    r1,     r1
+        RETx    lr
+
+    LSYM(__f2lz_overflow):
+        // Positive unsigned integers (r3 == 0, r1 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r3 == -1, r1 == 0), return 0x00000000.
+        // Positive signed integers (r3 == 0, r1 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r3 == -1, r1 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^63).
+        mvns    r0,     r3
+
+        // For 32-bit results
+        lsls    r2,     r1,     #26
+        lsls    r1,     #31
+        ands    r2,     r1
+        eors    r0,     r2
+
+//    LSYM(__f2lz_zero):
+        eors    r1,     r0
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfdi
+CM0_FUNC_END aeabi_f2lz
+
+
+// unsigned long long __aeabi_f2ulz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to UINT64_MAX.
+// Negative values and NAN all become zero.
+.section .text.libgcc.f2ulz,"x"
+CM0_FUNC_START aeabi_f2ulz
+FUNC_ALIAS fixunssfdi aeabi_f2ulz
+    CFI_START_FUNCTION
+
+        eors    r1,     r1
+        b       LSYM(__real_f2lz)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfdi
+CM0_FUNC_END aeabi_f2ulz
+
+#endif /* L_arm_addsubsf3 */ 
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/ffloat.S gcc-11-20201108/libgcc/config/arm/cm0/ffloat.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/ffloat.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/ffloat.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,97 @@
+/* ffixed.S: Cortex M0 optimized int->float conversion
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// float __aeabi_i2f(int)
+// Converts a signed integer in $r0 to float.
+.section .text.libgcc.il2f,"x"
+CM0_FUNC_START aeabi_i2f
+FUNC_ALIAS floatsisf aeabi_i2f
+    CFI_START_FUNCTION
+
+        // Sign extension to long long.
+        asrs    r1,     r0,     #31
+
+// float __aeabi_l2f(long long)
+// Converts a signed 64-bit integer in $r1:$r0 to a float in $r0.
+CM0_FUNC_START aeabi_l2f
+FUNC_ALIAS floatdisf aeabi_l2f
+
+        // Save the sign.
+        asrs    r3,     r1,     #31
+
+        // Absolute value of the input.
+        eors    r0,     r3
+        eors    r1,     r3
+        subs    r0,     r3
+        sbcs    r1,     r3
+
+        b       LSYM(__internal_uil2f)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatdisf
+CM0_FUNC_END aeabi_l2f
+CM0_FUNC_END floatsisf
+CM0_FUNC_END aeabi_i2f
+
+
+// float __aeabi_ui2f(unsigned)
+// Converts a unsigned integer in $r0 to float.
+.section .text.libgcc.uil2f,"x"
+CM0_FUNC_START aeabi_ui2f
+FUNC_ALIAS floatunsisf aeabi_ui2f
+    CFI_START_FUNCTION
+
+        // Convert to unsigned long long with upper bits of 0.
+        eors    r1,     r1
+
+// float __aeabi_ul2f(unsigned long long)
+// Converts a unsigned 64-bit integer in $r1:$r0 to a float in $r0.
+CM0_FUNC_START aeabi_ul2f
+FUNC_ALIAS floatundisf aeabi_ul2f
+
+        // Sign is always positive.
+        eors    r3,     r3
+
+    LSYM(__internal_uil2f):
+        // Default exponent, relative to bit[30] of $r1.
+        movs    r2,     #(189)
+
+        // Format the sign.
+        lsls    r3,     #31
+        mov     ip,     r3
+
+        push    { rT, lr }
+        b       SYM(__fp_assemble)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatundisf
+CM0_FUNC_END aeabi_ul2f
+CM0_FUNC_END floatunsisf
+CM0_FUNC_END aeabi_ui2f
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fmul.S gcc-11-20201108/libgcc/config/arm/cm0/fmul.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fmul.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fmul.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,215 @@
+/* fmul.S: Cortex M0 optimized 32-bit float multiplication
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// float __aeabi_fmul(float, float)
+// Returns $r0 after multiplication by $r1.
+.section .text.libgcc.fmul,"x"
+CM0_FUNC_START aeabi_fmul
+FUNC_ALIAS mulsf3 aeabi_fmul
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the sign of the result.
+        movs    rT,     r1
+        eors    rT,     r0
+        lsrs    rT,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for multiplication by zero.
+        lsls    r2,     r0,     #1
+        beq     LSYM(__fmul_zero1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_zero2)
+
+        // Check for INF/NAN.
+        cmp     r3,     rT
+        bhs     LSYM(__fmul_special2)
+
+        cmp     r2,     rT
+        bhs     LSYM(__fmul_special1)
+
+        // Because neither operand is INF/NAN, the result will be finite.
+        // It is now safe to modify the original operand registers.
+        lsls    r0,     #9
+
+        // Isolate the first exponent.  When normal, add back the implicit '1'.
+        // The result is always aligned with the MSB in bit [31].
+        // Subnormal mantissas remain effectively multiplied by 2x relative to
+        //  normals, but this works because the weight of a subnormal is -126.
+        lsrs    r2,     #24
+        beq     LSYM(__fmul_normalize2)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fmul_normalize2):
+        // IMPORTANT: exp10i() jumps in here!
+        // Repeat for the mantissa of the second operand.
+        // Short-circuit when the mantissa is 1.0, as the
+        //  first mantissa is already prepared in $r0
+        lsls    r1,     #9
+
+        // When normal, add back the implicit '1'.
+        lsrs    r3,     #24
+        beq     LSYM(__fmul_go)
+        adds    r1,     #1
+        rors    r1,     r1
+
+    LSYM(__fmul_go):
+        // Calculate the final exponent, relative to bit [30].
+        adds    rT,     r2,     r3
+        subs    rT,     #127
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Short-circuit on multiplication by powers of 2.
+        lsls    r3,     r0,     #1
+        beq     LSYM(__fmul_simple1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_simple2)
+  #endif
+
+        // Save $ip across the call.
+        // (Alternatively, could push/pop a separate register,
+        //  but the four instructions here are equivally fast)
+        //  without imposing on the stack.
+        add     rT,     ip
+
+        // 32x32 unsigned multiplication, 64 bit result.
+        bl      SYM(__umulsidi3) __PLT__
+
+        // Separate the saved exponent and sign.
+        sxth    r2,     rT
+        subs    rT,     r2
+        mov     ip,     rT
+
+        b       SYM(__fp_assemble)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fmul_simple2):
+        // Move the high bits of the result to $r1.
+        movs    r1,     r0
+
+    LSYM(__fmul_simple1):
+        // Clear the remainder.
+        eors    r0,     r0
+
+        // Adjust mantissa to match the exponent, relative to bit[30].
+        subs    r2,     rT,     #1
+        b       SYM(__fp_assemble)
+  #endif
+
+    LSYM(__fmul_zero1):
+        // $r0 was equal to 0, set up to check $r1 for INF/NAN.
+        lsls    r2,     r1,     #1
+
+    LSYM(__fmul_zero2):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(INFINITY_TIMES_ZERO)
+      #endif
+
+        // Check the non-zero operand for INF/NAN.
+        // If NAN, it should be returned.
+        // If INF, the result should be NAN.
+        // Otherwise, the result will be +/-0.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+        // If the second operand is finite, the result is 0.
+        blo     SYM(__fp_zero)
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Restore values that got mixed in zero testing, then go back
+        //  to sort out which one is the NAN.
+        lsls    r3,     r1,     #1
+        lsls    r2,     r0,     #1
+      #elif defined(TRAP_NANS) && TRAP_NANS
+        // Return NAN with the sign bit cleared.
+        lsrs    r0,     r2,     #1
+        b       SYM(__fp_check_nan)
+      #else
+        lsrs    r0,     r2,     #1
+        // Return NAN with the sign bit cleared.
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    LSYM(__fmul_special2):
+        // $r1 is INF/NAN.  In case of INF, check $r0 for NAN.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // Force swap if $r0 is not NAN.
+        bls     LSYM(__fmul_swap)
+
+        // $r0 is NAN, keep if $r1 is INF
+        cmp     r3,     rT
+        beq     LSYM(__fmul_special1)
+
+        // Both are NAN, keep the smaller value (more likely to signal).
+        cmp     r2,     r3
+      #endif
+
+        // Prefer the NAN already in $r0.
+        //  (If TRAP_NANS, this is the smaller NAN).
+        bhi     LSYM(__fmul_special1)
+
+    LSYM(__fmul_swap):
+        movs    r0,     r1
+
+    LSYM(__fmul_special1):
+        // $r0 is either INF or NAN.  $r1 has already been examined.
+        // Flags are already set correctly.
+        lsls    r2,     r0,     #1
+        cmp     r2,     rT
+        beq     SYM(__fp_infinity)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        b       SYM(__fp_check_nan)
+      #else
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsf3
+CM0_FUNC_END aeabi_fmul
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fneg.S gcc-11-20201108/libgcc/config/arm/cm0/fneg.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fneg.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fneg.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,76 @@
+/* fneg.S: Cortex M0 optimized 32-bit float negation
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_addsubsf3
+
+// float __aeabi_fneg(float) [obsolete]
+// The argument and result are in $r0.
+// Uses $r1 and $r2 as scratch registers.
+.section .text.libgcc.fneg,"x"
+CM0_FUNC_START aeabi_fneg
+FUNC_ALIAS negsf2 aeabi_fneg
+    CFI_START_FUNCTION
+
+  #if (defined(STRICT_NANS) && STRICT_NANS) || \
+      (defined(TRAP_NANS) && TRAP_NANS)
+        // Check for NAN.
+        lsls    r1,     r0,     #1
+        movs    r2,     #255
+        lsls    r2,     #24
+        cmp     r1,     r2
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        blo     SYM(__fneg_nan)
+      #else
+        blo     LSYM(__fneg_return)
+      #endif
+  #endif
+
+        // Flip the sign.
+        movs    r1,     #1
+        lsls    r1,     #31
+        eors    r0,     r1
+
+    LSYM(__fneg_return):
+        RETx    lr
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+    LSYM(__fneg_nan):
+        // Set up registers for exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(fp_check_nan)
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END negsf2
+CM0_FUNC_END aeabi_fneg
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/fplib.h gcc-11-20201108/libgcc/config/arm/cm0/fplib.h
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/fplib.h	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/fplib.h	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,80 @@
+/* fplib.h: Cortex M0 optimized 64-bit header definitions 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifndef __CM0_FPLIB_H
+#define __CM0_FPLIB_H 
+
+/* Enable exception interrupt handler.  
+   Exception implementation is opportunistic, and not fully tested.  */
+#define TRAP_EXCEPTIONS (0)
+#define EXCEPTION_CODES (0)
+
+/* Perform extra checks to avoid modifying the sign bit of NANs */
+#define STRICT_NANS (0)
+
+/* Trap signaling NANs regardless of context. */
+#define TRAP_NANS (0)
+
+/* TODO: Define service numbers according to the handler requirements */ 
+#define SVC_TRAP_NAN (0)
+#define SVC_FP_EXCEPTION (0)
+#define SVC_DIVISION_BY_ZERO (0)
+
+/* Push extra registers when required for 64-bit stack alignment */
+#define DOUBLE_ALIGN_STACK (0)
+
+/* Define various exception codes.  These don't map to anything in particular */
+#define SUBTRACTED_INFINITY (20)
+#define INFINITY_TIMES_ZERO (21)
+#define DIVISION_0_BY_0 (22)
+#define DIVISION_INF_BY_INF (23)
+#define UNORDERED_COMPARISON (24)
+#define CAST_OVERFLOW (25)
+#define CAST_INEXACT (26)
+#define CAST_UNDEFINED (27)
+
+/* Exception control for quiet NANs.
+   If TRAP_NAN support is enabled, signaling NANs always raise exceptions. */
+.equ FCMP_RAISE_EXCEPTIONS, 16
+.equ FCMP_NO_EXCEPTIONS,    0
+
+/* These assignments are significant.  See implementation.
+   They must be shared for use in libm functions.  */
+.equ FCMP_3WAY, 1
+.equ FCMP_LT, 2
+.equ FCMP_EQ, 4
+.equ FCMP_GT, 8
+
+.equ FCMP_GE, (FCMP_EQ | FCMP_GT)
+.equ FCMP_LE, (FCMP_LT | FCMP_EQ)
+.equ FCMP_NE, (FCMP_LT | FCMP_GT)
+
+/* These flags affect the result of unordered comparisons.  See implementation.  */
+.equ FCMP_UN_THREE,     128
+.equ FCMP_UN_POSITIVE,  64
+.equ FCMP_UN_ZERO,      32
+.equ FCMP_UN_NEGATIVE,  0
+
+#endif /* __CM0_FPLIB_H */
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/futil.S gcc-11-20201108/libgcc/config/arm/cm0/futil.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/futil.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/futil.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,407 @@
+/* futil.S: Cortex M0 optimized 32-bit common routines
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+   
+#ifdef L_arm_addsubsf3
+ 
+// Internal function, decomposes the unsigned float in $r2.
+// The exponent will be returned in $r2, the mantissa in $r3.
+// If subnormal, the mantissa will be normalized, so that
+//  the MSB of the mantissa (if any) will be aligned at bit[31].
+// Preserves $r0 and $r1, uses $rT as scratch space.
+.section .text.libgcc.normf,"x"
+CM0_FUNC_START fp_normalize2
+    CFI_START_FUNCTION
+
+        // Extract the mantissa.
+        lsls    r3,     r2,     #8
+
+        // Extract the exponent.
+        lsrs    r2,     #24
+        beq     SYM(__fp_lalign2)
+
+        // Restore the mantissa's implicit '1'.
+        adds    r3,     #1
+        rors    r3,     r3
+
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_normalize2
+
+
+// Internal function, aligns $r3 so the MSB is aligned in bit[31].
+// Simultaneously, subtracts the shift from the exponent in $r2
+.section .text.libgcc.alignf,"x"
+CM0_FUNC_START fp_lalign2
+    CFI_START_FUNCTION
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Unroll the loop, similar to __clzsi2().
+        lsrs    rT,     r3,     #16
+        bne     LSYM(__align8)
+        subs    r2,     #16
+        lsls    r3,     #16
+
+    LSYM(__align8):
+        lsrs    rT,     r3,     #24
+        bne     LSYM(__align4)
+        subs    r2,     #8
+        lsls    r3,     #8
+
+    LSYM(__align4):
+        lsrs    rT,     r3,     #28
+        bne     LSYM(__align2)
+        subs    r2,     #4
+        lsls    r3,     #4
+  #endif
+
+    LSYM(__align2):
+        // Refresh the state of the N flag before entering the loop.
+        tst     r3,     r3
+
+    LSYM(__align_loop):
+        // Test before subtracting to compensate for the natural exponent.
+        // The largest subnormal should have an exponent of 0, not -1.
+        bmi     LSYM(__align_return)
+        subs    r2,     #1
+        lsls    r3,     #1
+        bne     LSYM(__align_loop)
+
+        // Not just a subnormal... 0!  By design, this should never happen.
+        // All callers of this internal function filter 0 as a special case.
+        // Was there an uncontrolled jump from somewhere else?  Cosmic ray?
+        eors    r2,     r2
+
+      #ifdef DEBUG
+        bkpt    #0
+      #endif
+
+    LSYM(__align_return):
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_lalign2
+
+
+// Internal function to combine mantissa, exponent, and sign. No return.
+// Expects the unsigned result in $r1.  To avoid underflow (slower),
+//  the MSB should be in bits [31:29].
+// Expects any remainder bits of the unrounded result in $r0.
+// Expects the exponent in $r2.  The exponent must be relative to bit[30].
+// Expects the sign of the result (and only the sign) in $ip.
+// Returns a correctly rounded floating value in $r0.
+.section .text.libgcc.assemblef,"x"
+CM0_FUNC_START fp_assemble
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Examine the upper three bits [31:29] for underflow.
+        lsrs    r3,     r1,     #29
+        beq     LSYM(__fp_underflow)
+
+        // Convert bits [31:29] into an offset in the range of { 0, -1, -2 }.
+        // Right rotation aligns the MSB in bit [31], filling any LSBs with '0'.
+        lsrs    r3,     r1,     #1
+        mvns    r3,     r3
+        ands    r3,     r1
+        lsrs    r3,     #30
+        subs    r3,     #2
+        rors    r1,     r3
+
+        // Update the exponent, assuming the final result will be normal.
+        // The new exponent is 1 less than actual, to compensate for the
+        //  eventual addition of the implicit '1' in the result.
+        // If the final exponent becomes negative, proceed directly to gradual
+        //  underflow, without bothering to search for the MSB.
+        adds    r2,     r3
+
+CM0_FUNC_START fp_assemble2
+        bmi     LSYM(__fp_subnormal)
+
+    LSYM(__fp_normal):
+        // Check for overflow (remember the implicit '1' to be added later).
+        cmp     r2,     #254
+        bge     SYM(__fp_overflow)
+
+        // Save LSBs for the remainder. Position doesn't matter any more,
+        //  these are just tiebreakers for round-to-even.
+        lsls    rT,     r1,     #25
+
+        // Align the final result.
+        lsrs    r1,     #8
+
+    LSYM(__fp_round):
+        // If carry bit is '0', always round down.
+        bcc     LSYM(__fp_return)
+
+        // The carry bit is '1'.  Round to nearest, ties to even.
+        // If either the saved remainder bits [6:0], the additional remainder
+        //  bits in $r1, or the final LSB is '1', round up.
+        lsls    r3,     r1,     #31
+        orrs    r3,     rT
+        orrs    r3,     r0
+        beq     LSYM(__fp_return)
+
+        // If rounding up overflows the result to 2.0, the result
+        //  is still correct, up to and including INF.
+        adds    r1,     #1
+
+    LSYM(__fp_return):
+        // Combine the mantissa and the exponent.
+        lsls    r2,     #23
+        adds    r0,     r1,     r2
+
+        // Combine with the saved sign.
+        // End of library call, return to user.
+        add     r0,     ip
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: Underflow/inexact reporting IFF remainder
+  #endif
+
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fp_underflow):
+        // Set up to align the mantissa.
+        movs    r3,     r1
+        bne     LSYM(__fp_underflow2)
+
+        // MSB wasn't in the upper 32 bits, check the remainder.
+        // If the remainder is also zero, the result is +/-0.
+        movs    r3,     r0
+        beq     SYM(__fp_zero)
+
+        eors    r0,     r0
+        subs    r2,     #32
+
+    LSYM(__fp_underflow2):
+        // Save the pre-alignment exponent to align the remainder later.
+        movs    r1,     r2
+
+        // Align the mantissa with the MSB in bit[31].
+        bl      SYM(__fp_lalign2)
+
+        // Calculate the actual remainder shift.
+        subs    rT,     r1,     r2
+
+        // Align the lower bits of the remainder.
+        movs    r1,     r0
+        lsls    r0,     rT
+
+        // Combine the upper bits of the remainder with the aligned value.
+        rsbs    rT,     #0
+        adds    rT,     #32
+        lsrs    r1,     rT
+        adds    r1,     r3
+
+        // The MSB is now aligned at bit[31] of $r1.
+        // If the net exponent is still positive, the result will be normal.
+        // Because this function is used by fmul(), there is a possibility
+        //  that the value is still wider than 24 bits; always round.
+        tst     r2,     r2
+        bpl     LSYM(__fp_normal)
+
+    LSYM(__fp_subnormal):
+        // The MSB is aligned at bit[31], with a net negative exponent.
+        // The mantissa will need to be shifted right by the absolute value of
+        //  the exponent, plus the normal shift of 8.
+
+        // If the negative shift is smaller than -25, there is no result,
+        //  no rounding, no anything.  Return signed zero.
+        // (Otherwise, the shift for result and remainder may wrap.)
+        adds    r2,     #25
+        bmi     SYM(__fp_inexact_zero)
+
+        // Save the extra bits for the remainder.
+        movs    rT,     r1
+        lsls    rT,     r2
+
+        // Shift the mantissa to create a subnormal.
+        // Just like normal, round to nearest, ties to even.
+        movs    r3,     #33
+        subs    r3,     r2
+        eors    r2,     r2
+
+        // This shift must be last, leaving the shifted LSB in the C flag.
+        lsrs    r1,     r3
+        b       LSYM(__fp_round)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_assemble2
+CM0_FUNC_END fp_assemble
+
+
+// Recreate INF with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.libgcc.infinityf,"x"
+CM0_FUNC_START fp_overflow
+    CFI_START_FUNCTION
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/overflow exception
+  #endif
+
+CM0_FUNC_START fp_infinity
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        movs    r0,     #255
+        lsls    r0,     #23
+        add     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_infinity
+CM0_FUNC_END fp_overflow
+
+
+// Recreate 0 with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.libgcc.zerof,"x"
+CM0_FUNC_START fp_inexact_zero
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/underflow exception
+  #endif
+
+CM0_FUNC_START fp_zero
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Return 0 with the correct sign.
+        mov     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_zero
+CM0_FUNC_END fp_inexact_zero
+
+
+// Internal function to detect signaling NANs.  No return.
+// Uses $r2 as scratch space.
+.section .text.libgcc.checkf,"x"
+CM0_FUNC_START fp_check_nan2
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+
+CM0_FUNC_START fp_check_nan
+
+        // Check for quiet NAN.
+        lsrs    r2,     r0,     #23
+        bcs     LSYM(__quiet_nan)
+
+        // Raise exception.  Preserves both $r0 and $r1.
+        svc     #(SVC_TRAP_NAN)
+
+        // Quiet the resulting NAN.
+        movs    r2,     #1
+        lsls    r2,     #22
+        orrs    r0,     r2
+
+    LSYM(__quiet_nan):
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_check_nan
+CM0_FUNC_END fp_check_nan2
+
+
+// Internal function to report floating point exceptions.  No return.
+// Expects the original argument(s) in $r0 (possibly also $r1).
+// Expects a code that describes the exception in $r3.
+.section .text.libgcc.exceptf,"x"
+CM0_FUNC_START fp_exception
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Create a quiet NAN.
+        movs    r2,     #255
+        lsls    r2,     #1
+        adds    r2,     #1
+        lsls    r2,     #22
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        // Annotate the exception type in the NAN field.
+        // Make sure that the exception is in the valid region 
+        lsls    rT,     r3,     #13
+        orrs    r2,     rT
+      #endif
+
+// Exception handler that expects the result already in $r2,
+//  typically when the result is not going to be NAN.
+CM0_FUNC_START fp_exception2
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_FP_EXCEPTION)
+      #endif
+
+        // TODO: Save exception flags in a static variable.
+
+        // Set up the result, now that the argument isn't required any more.
+        movs    r0,     r2
+
+        // HACK: for sincosf(), with 2 parameters to return.
+        movs    r1,     r2
+
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_exception2
+CM0_FUNC_END fp_exception
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/idiv.S gcc-11-20201108/libgcc/config/arm/cm0/idiv.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/idiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/idiv.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,182 @@
+/* div.S: Cortex M0 optimized 32-bit integer division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef L_udivsi3
+ 
+// int __aeabi_idiv0(int)
+// Helper function for division by 0.
+.section .text.libgcc.idiv0,"x"
+WEAK aeabi_idiv0
+CM0_FUNC_START aeabi_idiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        // Return {0, numerator}.
+        movs    r1,     r0
+        eors    r0,     r0
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_idiv0
+
+
+// int __aeabi_idiv(int, int)
+// idiv_return __aeabi_idivmod(int, int)
+// Returns signed $r0 after division by $r1.
+// Also returns the signed remainder in $r1.
+.section .text.libgcc.idiv,"x"
+CM0_FUNC_START aeabi_idivmod
+FUNC_ALIAS aeabi_idiv aeabi_idivmod
+FUNC_ALIAS divsi3 aeabi_idivmod
+    CFI_START_FUNCTION
+
+        // Extend the sign of the denominator.
+        asrs    r3,     r1,     #31
+
+        // Absolute value of the denominator, abort on division by zero.
+        eors    r1,     r3
+        subs    r1,     r3
+        beq     SYM(__aeabi_idiv0)
+
+        // Absolute value of the numerator.
+        asrs    r2,     r0,     #31
+        eors    r0,     r2
+        subs    r0,     r2
+
+        // Keep the sign of the numerator in bit[31] (for the remainder).
+        // Save the XOR of the signs in bits[15:0] (for the quotient).
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        lsrs    rT,     r3,     #16
+        eors    rT,     r2
+
+        // Handle division as unsigned.
+        bl      SYM(__aeabi_uidivmod_nonzero)
+
+        // Set the sign of the remainder.
+        asrs    r2,     rT,     #31
+        eors    r1,     r2
+        subs    r1,     r2
+
+        // Set the sign of the quotient.
+        sxth    r3,     rT
+        eors    r0,     r3
+        subs    r0,     r3
+
+    LSYM(__idivmod_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsi3
+CM0_FUNC_END aeabi_idiv
+CM0_FUNC_END aeabi_idivmod
+
+
+// int __aeabi_uidiv(unsigned int, unsigned int)
+// idiv_return __aeabi_uidivmod(unsigned int, unsigned int)
+// Returns unsigned $r0 after division by $r1.
+// Also returns the remainder in $r1.
+.section .text.libgcc.uidiv,"x"
+CM0_FUNC_START aeabi_uidivmod
+FUNC_ALIAS aeabi_uidiv aeabi_uidivmod
+FUNC_ALIAS udivsi3 aeabi_uidivmod
+    CFI_START_FUNCTION
+
+        // Abort on division by zero.
+        tst     r1,     r1
+        beq     SYM(__aeabi_idiv0)
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+CM0_FUNC_START aeabi_uidivmod_nonzero 
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // The loop is destructive, save a copy of the numerator.
+        mov     ip,     r0
+
+        // Set up binary search.
+        movs    r3,     #16
+        movs    r2,     #1
+
+    LSYM(__uidivmod_align):
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    r0,     r3
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_skip)
+
+        // Multiply the denominator and the result together.
+        lsls    r1,     r3
+        lsls    r2,     r3
+
+    LSYM(__uidivmod_skip):
+        // Restore the numerator, and iterate until search goes to 0.
+        mov     r0,     ip
+        lsrs    r3,     #1
+        bne     LSYM(__uidivmod_align)
+
+        // In The result $r3 has been conveniently initialized to 0.
+        b       LSYM(__uidivmod_entry)
+
+    LSYM(__uidivmod_loop):
+        // Scale the denominator and the quotient together.
+        lsrs    r1,     #1
+        lsrs    r2,     #1
+        beq     LSYM(__uidivmod_return)
+
+    LSYM(__uidivmod_entry):
+        // Test if the denominator is smaller than the numerator.
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_loop)
+
+        // If the denominator is smaller, the next bit of the result is '1'.
+        // If the new remainder goes to 0, exit early.
+        adds    r3,     r2
+        subs    r0,     r1
+        bne     LSYM(__uidivmod_loop)
+
+    LSYM(__uidivmod_return):
+        mov     r1,     r0
+        mov     r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_uidivmod_nonzero
+CM0_FUNC_END udivsi3
+CM0_FUNC_END aeabi_uidiv
+CM0_FUNC_END aeabi_uidivmod
+
+#endif /* L_udivsi3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/lcmp.S gcc-11-20201108/libgcc/config/arm/cm0/lcmp.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/lcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/lcmp.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,98 @@
+/* lcmp.S: Cortex M0 optimized 64-bit integer comparison
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+   
+#ifdef L_aeabi_lcmp   
+
+// int __aeabi_lcmp(long long, long long)
+// Compares the 64 bit signed values in $r1:$r0 and $r3:$r2.
+// Returns { -1, 0, +1 } in $r0 for ordering { <, ==, > }, respectively.
+.section .text.libgcc.lcmp,"x"
+CM0_FUNC_START aeabi_lcmp
+    CFI_START_FUNCTION
+
+        // Calculate the difference $r1:$r0 - $r3:$r2.
+        subs    r0,     r2
+        sbcs    r1,     r3
+
+        // With $r2 free, create a reference offset without affecting flags.
+        mov     r2,     r3
+
+        // Finish the comparison.
+        blt     LSYM(__lcmp_lt)
+
+        // The reference offset ($r2 - $r3) will be +2 iff the first
+        //  argument is larger, otherwise the reference offset remains 0.
+        adds    r2,     #2
+
+    LSYM(__lcmp_lt):
+        // Check for equality (all 64 bits).
+        orrs    r0,     r1
+        beq     LSYM(__lcmp_return)
+
+        // Convert the relative offset to an absolute value +/-1.
+        subs    r0,     r2,     r3
+        subs    r0,     #1
+
+    LSYM(__lcmp_return):
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_lcmp
+
+#endif /* L_aeabi_lcmp */
+
+
+#ifdef L_aeabi_ulcmp
+
+// int __aeabi_ulcmp(unsigned long long, unsigned long long)
+// Compares the 64 bit unsigned values in $r1:$r0 and $r3:$r2.
+// Returns { -1, 0, +1 } in $r0 for ordering { <, ==, > }, respectively.
+.section .text.libgcc.ulcmp,"x"
+CM0_FUNC_START aeabi_ulcmp
+    CFI_START_FUNCTION
+
+        // Calculate the 'C' flag.
+        subs    r0,     r2
+        sbcs    r1,     r3
+
+        // $r2 will contain -1 if the first value is smaller,
+        //  0 if the first value is larger or equal.
+        sbcs    r2,     r2
+
+        // Check for equality (all 64 bits).
+        orrs    r0,     r1
+        beq     LSYM(__ulcmp_return)
+
+        // $r0 should contain +1 or -1
+        movs    r0,     #1
+        orrs    r0,     r2
+
+    LSYM(__ulcmp_return):
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_ulcmp
+
+#endif /* L_aeabi_ulcmp */
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/ldiv.S gcc-11-20201108/libgcc/config/arm/cm0/ldiv.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/ldiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/ldiv.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,343 @@
+/* ldiv.S: Cortex M0 optimized 64-bit integer division
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef L_aeabi_uldivmod
+
+// long long __aeabi_ldiv0(long long)
+// Helper function for division by 0.
+.section .text.libgcc.ldiv0,"x"
+WEAK aeabi_ldiv0
+CM0_FUNC_START aeabi_ldiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        // Return { 0, numerator } for quotient and remainder.
+        movs    r2,     r0
+        movs    r3,     r1
+        eors    r0,     r0
+        eors    r1,     r1
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_ldiv0
+
+
+// long long __aeabi_ldiv(long long, long long)
+// lldiv_return __aeabi_ldivmod(long long, long long)
+// Returns signed $r1:$r0 after division by $r3:$r2.
+// Also returns the signed remainder in $r3:$r2.
+.section .text.libgcc.ldiv,"x"
+CM0_FUNC_START aeabi_ldivmod
+FUNC_ALIAS aeabi_ldiv aeabi_ldivmod
+FUNC_ALIAS divdi3 aeabi_ldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before pushing registers.
+        cmp     r2,     #0
+        bne     LSYM(__ldivmod_valid)
+
+        cmp     r3,     #0
+        beq     SYM(__aeabi_ldiv0)
+
+    LSYM(__ldivmod_valid):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+      #else
+        push    { rP, rQ, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 12
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset lr, 8
+      #endif
+
+        // Absolute value of the numerator.
+        asrs    rP,     r1,     #31
+        eors    r0,     rP
+        eors    r1,     rP
+        subs    r0,     rP
+        sbcs    r1,     rP
+
+        // Absolute value of the denominator.
+        asrs    rQ,     r3,     #31
+        eors    r2,     rQ
+        eors    r3,     rQ
+        subs    r2,     rQ
+        sbcs    r3,     rQ
+
+        // Keep the XOR of signs for the quotient.
+        eors    rQ,     rP
+
+        // Handle division as unsigned.
+        bl      LSYM(__internal_uldivmod)
+
+        // Set the sign of the quotient.
+        eors    r0,     rQ
+        eors    r1,     rQ
+        subs    r0,     rQ
+        sbcs    r1,     rQ
+
+        // Set the sign of the remainder.
+        eors    r2,     rP
+        eors    r3,     rP
+        subs    r2,     rP
+        sbcs    r3,     rP
+
+    LSYM(__ldivmod_return):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { rP, rQ, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divdi3
+CM0_FUNC_END aeabi_ldiv
+CM0_FUNC_END aeabi_ldivmod
+
+
+// unsigned long long __aeabi_uldiv(unsigned long long, unsigned long long)
+// ulldiv_return __aeabi_uldivmod(unsigned long long, unsigned long long)
+// Returns unsigned $r1:$r0 after division by $r3:$r2.
+// Also returns the remainder in $r3:$r2.
+.section .text.libgcc.uldiv,"x"
+CM0_FUNC_START aeabi_uldivmod
+FUNC_ALIAS aeabi_uldiv aeabi_uldivmod
+FUNC_ALIAS udivdi3 aeabi_uldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before changing the stack.
+        cmp     r3,     #0
+        bne     LSYM(__internal_uldivmod)
+
+        cmp     r2,     #0
+        beq     SYM(__aeabi_ldiv0)
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+    LSYM(__internal_uldivmod):
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+
+        // Set up denominator shift, assuming a single width result.
+        movs    rP,     #32
+
+        // If the upper word of the denominator is 0 ...
+        tst     r3,     r3
+        bne     LSYM(__uldivmod_setup)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // ... and the upper word of the numerator is also 0,
+        //  single width division will be at least twice as fast.
+        tst     r1,     r1
+        beq     LSYM(__uldivmod_small)
+  #endif
+
+        // ... and the lower word of the denominator is less than or equal
+        //     to the upper word of the numerator ...
+        cmp     r1,     r2
+        blo     LSYM(__uldivmod_setup)
+
+        //  ... then the result will be double width, at least 33 bits.
+        // Set up a flag in $rP to seed the shift for the second word.
+        movs    r3,     r2
+        eors    r2,     r2
+        adds    rP,     #64
+
+    LSYM(__uldivmod_setup):
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // Since search is destructive, first save a copy of the numerator.
+        mov     ip,     r0
+        mov     lr,     r1
+
+        // Set up binary search.
+        movs    rQ,     #16
+        eors    rT,     rT
+
+    LSYM(__uldivmod_align):
+        // Maintain a secondary shift $rT = 32 - $rQ, making the overlapping
+        //  shifts between low and high words easier to construct.
+        adds    rT,     rQ
+
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    r1,     rQ
+
+        // Measure the high bits of denominator against the numerator.
+        cmp     r1,     r3
+        blo     LSYM(__uldivmod_skip)
+        bhi     LSYM(__uldivmod_shift)
+
+        // If the high bits are equal, construct the low bits for checking.
+        mov     r1,     lr
+        lsls    r1,     rT
+
+        lsrs    r0,     rQ
+        orrs    r1,     r0
+
+        cmp     r1,     r2
+        blo     LSYM(__uldivmod_skip)
+
+    LSYM(__uldivmod_shift):
+        // Scale the denominator and the result together.
+        subs    rP,     rQ
+
+        // If the reduced numerator is still larger than or equal to the
+        //  denominator, it is safe to shift the denominator left.
+        movs    r1,     r2
+        lsrs    r1,     rT
+        lsls    r3,     rQ
+
+        lsls    r2,     rQ
+        orrs    r3,     r1
+
+    LSYM(__uldivmod_skip):
+        // Restore the numerator.
+        mov     r0,     ip
+        mov     r1,     lr
+
+        // Iterate until the shift goes to 0.
+        lsrs    rQ,     #1
+        bne     LSYM(__uldivmod_align)
+
+        // Initialize the result (zero).
+        mov     ip,     rQ
+
+        // HACK: Compensate for the first word test.
+        lsls    rP,     #6
+
+    LSYM(__uldivmod_word2):
+        // Is there another word?
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return)
+
+        // Shift the calculated result by 1 word.
+        mov     lr,     ip
+        mov     ip,     rQ
+
+        // Set up the MSB of the next word of the quotient
+        movs    rQ,     #1
+        rors    rQ,     rP
+        b     LSYM(__uldivmod_entry)
+
+    LSYM(__uldivmod_loop):
+        // Divide the denominator by 2.
+        // It could be slightly faster to multiply the numerator,
+        //  but that would require shifting the remainder at the end.
+        lsls    rT,     r3,     #31
+        lsrs    r3,     #1
+        lsrs    r2,     #1
+        adds    r2,     rT
+
+        // Step to the next bit of the result.
+        lsrs    rQ,     #1
+        beq     LSYM(__uldivmod_word2)
+
+    LSYM(__uldivmod_entry):
+        // Test if the denominator is smaller, high byte first.
+        cmp     r1,     r3
+        blo     LSYM(__uldivmod_loop)
+        bhi     LSYM(__uldivmod_quotient)
+
+        cmp     r0,     r2
+        blo     LSYM(__uldivmod_loop)
+
+    LSYM(__uldivmod_quotient):
+        // Smaller denominator: the next bit of the quotient will be set.
+        add     ip,     rQ
+
+        // Subtract the denominator from the remainder.
+        // If the new remainder goes to 0, exit early.
+        subs    r0,     r2
+        sbcs    r1,     r3
+        bne     LSYM(__uldivmod_loop)
+
+        tst     r0,     r0
+        bne     LSYM(__uldivmod_loop)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Check whether there's still a second word to calculate.
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return)
+
+        // If so, shift the result left by a full word.
+        mov     lr,     ip
+        mov     ip,     r1 // zero
+  #else
+        eors    rQ,     rQ
+        b       LSYM(__uldivmod_word2)
+  #endif
+
+    LSYM(__uldivmod_return):
+        // Move the remainder to the second half of the result.
+        movs    r2,     r0
+        movs    r3,     r1
+
+        // Move the quotient to the first half of the result.
+        mov     r0,     ip
+        mov     r1,     lr
+
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__uldivmod_small):
+        // Arrange arguments for 32-bit division.
+        movs    r1,     r2
+        bl      SYM(__aeabi_uidivmod_nonzero)
+
+        // Extend quotient and remainder to 64 bits, unsigned.
+        movs    r2,     r1
+        eors    r1,     r1
+        eors    r3,     r3
+        pop     { rP, rQ, rT, pc }
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END udivdi3
+CM0_FUNC_END aeabi_uldiv
+CM0_FUNC_END aeabi_uldivmod
+
+#endif /* L_aeabi_uldivmod */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/lmul.S gcc-11-20201108/libgcc/config/arm/cm0/lmul.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/lmul.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/lmul.S	2020-11-30 19:57:33.837468472 -0800
@@ -0,0 +1,321 @@
+/* lmul.S: Cortex M0 optimized 64-bit integer multiplication 
+
+   Copyright (C) 2018-2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#if defined(L_muldi3) || defined(L_umulsidi3)
+
+#ifdef L_muldi3 
+
+// long long __aeabi_lmul(long long, long long)
+// Returns the least significant 64 bits of a 64 bit multiplication.
+// Expects the two multiplicands in $r1:$r0 and $r3:$r2.
+// Returns the product in $r1:$r0 (does not distinguish signed types).
+// Uses $r4 and $r5 as scratch space.
+.section .text.libgcc.lmul,"x"
+CM0_FUNC_START aeabi_lmul
+FUNC_ALIAS muldi3 aeabi_lmul
+    CFI_START_FUNCTION
+
+        // $r1:$r0 = 0xDDDDCCCCBBBBAAAA
+        // $r3:$r2 = 0xZZZZYYYYXXXXWWWW
+
+        // The following operations that only affect the upper 64 bits
+        //  can be safely discarded:
+        //   DDDD * ZZZZ
+        //   DDDD * YYYY
+        //   DDDD * XXXX
+        //   CCCC * ZZZZ
+        //   CCCC * YYYY
+        //   BBBB * ZZZZ
+
+        // MAYBE: Test for multiply by ZERO on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+        // (0xDDDDCCCC * 0xXXXXWWWW), free $r1
+        muls    r1,     r2
+
+        // (0xZZZZYYYY * 0xBBBBAAAA), free $r3
+        muls    r3,     r0
+        add     r3,     r1
+
+        // Put the parameters in the correct form for umulsidi3().
+        movs    r1,     r2
+        b       LSYM(__mul_remainder)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_lmul
+CM0_FUNC_END muldi3
+
+
+// unsigned long long __umulsidi3(unsigned int, unsigned int)
+// Returns all 64 bits of a 32 bit multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $ip as scratch space.
+CM0_FUNC_START umulsidi3
+    CFI_START_FUNCTION
+
+#else /* !L_muldi3 */
+
+// Allow a standalone implementation of umulsidi3() to be superceded by a
+//  combined implementation with muldi3().  This allows use of the smaller
+//  unit in programs that do not need muldi3(), while keeping the functions 
+//  linked together in the same section when both are needed.
+.section .text.libgcc.umulsidi3,"x"
+CM0_WEAK_START umulsidi3
+    CFI_START_FUNCTION
+
+#endif /* !L_muldi3 */ 
+
+        // 32x32 multiply with 64 bit result.
+        // Expand the multiply into 4 parts, since muls only returns 32 bits.
+        //         (a16h * b16h / 2^32)
+        //       + (a16h * b16l / 2^48) + (a16l * b16h / 2^48)
+        //       + (a16l * b16l / 2^64)
+
+        // MAYBE: Test for multiply by 0 on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+        eors    r3,     r3
+
+    LSYM(__mul_remainder):
+        mov     ip,     r3
+
+        // a16h * b16h
+        lsrs    r2,     r0,     #16
+        lsrs    r3,     r1,     #16
+        muls    r2,     r3
+        add     ip,     r2
+
+        // a16l * b16h; save a16h first!
+        lsrs    r2,     r0,     #16
+        uxth    r0,     r0
+        muls    r3,     r0
+
+        // a16l * b16l
+        uxth    r1,     r1
+        muls    r0,     r1
+
+        // a16h * b16l
+        muls    r1,     r2
+
+        // Distribute intermediate results.
+        eors    r2,     r2
+        adds    r1,     r3
+        adcs    r2,     r2
+        lsls    r3,     r1,     #16
+        lsrs    r1,     #16
+        lsls    r2,     #16
+        adds    r0,     r3
+        adcs    r1,     r2
+
+        // Add in the remaining high bits.
+        add     r1,     ip
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END umulsidi3
+
+#endif /* L_muldi3 || L_umulsidi3 */
+
+
+#ifdef L_mulsidi3
+
+// long long mulsidi3(int, int)
+// Returns all 64 bits of a 32 bit signed multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $rT as scratch space.
+.section .text.libgcc.mulsidi3,"x"
+CM0_FUNC_START mulsidi3
+    CFI_START_FUNCTION
+
+        // Push registers for function call.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save signs of the arguments.
+        asrs    r3,     r0,     #31
+        asrs    rT,     r1,     #31
+
+        // Absolute value of the arguments.
+        eors    r0,     r3
+        eors    r1,     rT
+        subs    r0,     r3
+        subs    r1,     rT
+
+        // Save sign of the result.
+        eors    rT,     r3
+
+        bl      SYM(__umulsidi3) __PLT__
+
+        // Apply sign of the result.
+        eors    r0,     rT
+        eors    r1,     rT
+        subs    r0,     rT
+        sbcs    r1,     rT
+
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsidi3
+
+#endif /* L_mulsidi3 */
+
+
+#ifdef L_ashldi3
+
+// long long __aeabi_llsl(long long, int)
+// Logical shift left the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.llsl,"x"
+CM0_FUNC_START aeabi_llsl
+FUNC_ALIAS ashldi3 aeabi_llsl
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r0
+
+        // Assume a simple shift.
+        lsls    r0,     r2
+        lsls    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsl_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsrs    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsl_large):
+        // Apply any remaining shift
+        lsls    r3,     r2
+
+        // Merge remainder and result.
+        adds    r1,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashldi3
+CM0_FUNC_END aeabi_llsl
+
+#endif /* L_ashldi3 */
+
+
+#ifdef L_lshrdi3
+
+// long long __aeabi_llsr(long long, int)
+// Logical shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.llsr,"x"
+CM0_FUNC_START aeabi_llsr
+FUNC_ALIAS lshrdi3 aeabi_llsr
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r1
+
+        // Assume a simple shift.
+        lsrs    r0,     r2
+        lsrs    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsr_large):
+        // Apply any remaining shift
+        lsrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END lshrdi3
+CM0_FUNC_END aeabi_llsr
+
+#endif /* L_lshrdi3 */
+
+
+#ifdef L_ashrdi3
+
+// long long __aeabi_lasr(long long, int)
+// Arithmetic shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.libgcc.lasr,"x"
+CM0_FUNC_START aeabi_lasr
+FUNC_ALIAS ashrdi3 aeabi_lasr
+    CFI_START_FUNCTION
+
+        // Save a copy for the remainder.
+        movs    r3,     r1
+
+        // Assume a simple shift.
+        lsrs    r0,     r2
+        asrs    r1,     r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__lasr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__lasr_large):
+        // Apply any remaining shift
+        asrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    r0,     r3
+        RETx    lr
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashrdi3
+CM0_FUNC_END aeabi_lasr
+
+#endif /* L_ashrdi3 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/cm0/parity.S gcc-11-20201108/libgcc/config/arm/cm0/parity.S
--- gcc-11-20201108-clean/libgcc/config/arm/cm0/parity.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/cm0/parity.S	2020-11-30 15:08:39.336813386 -0800
@@ -0,0 +1,102 @@
+/* ctz2.S: Cortex M0 optimized parity functions
+
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined(L_paritysi2) || defined(L_paritydi2)
+.section .text.libgcc.parity,"x"
+   
+#ifdef L_paritydi2
+   
+// int __paritydi2(int)
+// Returns '0' if the number of bits set in $r1:r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+CM0_FUNC_START paritydi2
+    CFI_START_FUNCTION
+    
+        // Combine the upper and lower words, then fall through. 
+        eors    r0,     r1
+            
+            
+// int __paritysi2(int)
+// Returns '0' if the number of bits set in $r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+// Uses $r2 as scratch space.
+CM0_FUNC_START paritysi2
+
+#else /* L_paritysi2 */
+
+// Allow a standalone implementation of ctzsi2() to be superceded by a
+//  combined implementation.  This allows use of the slightly smaller
+//  unit in programs that do not need ctzdi2(). Requires '_ctzsi2' to
+//  appear before '_ctzdi2' in LIB1ASMFUNCS.
+CM0_WEAK_START paritysi2
+    CFI_START_FUNCTION
+
+#endif /* L_paritysi2 */
+    
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+        // Size optimized: 16 bytes, 40 cycles
+        // Speed optimized: 24 bytes, 14 cycles
+        movs    r2,     #16 
+        
+    LSYM(__parity_loop):
+        // Calculate the parity of successively smaller half-words into the MSB.  
+        movs    r1,     r0 
+        lsls    r1,     r2 
+        eors    r0,     r1 
+        lsrs    r2,     #1 
+        bne     LSYM(__parity_loop)
+   
+  #else // !__OPTIMIZE_SIZE__
+        
+        // Unroll the loop.  The 'libgcc' reference C implementation replaces 
+        //  the x2 and the x1 shifts with a constant.  However, since it takes 
+        //  4 cycles to load, index, and mask the constant result, it doesn't 
+        //  cost anything to keep shifting (and saves a few bytes).  
+        lsls    r1,     r0,     #16 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #8 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #4 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #2 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #1 
+        eors    r0,     r1 
+        
+  #endif // !__OPTIMIZE_SIZE__
+  
+        lsrs    r0,     #31 
+        RETx    lr
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END paritysi2
+
+  #ifdef L_paritydi2
+    CM0_FUNC_END paritydi2
+  #endif 
+
+#endif /* L_paritysi2 || L_paritydi2 */
+
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/lib1funcs.S gcc-11-20201108/libgcc/config/arm/lib1funcs.S
--- gcc-11-20201108-clean/libgcc/config/arm/lib1funcs.S	2020-11-08 14:32:11.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/lib1funcs.S	2020-11-30 19:08:38.501266154 -0800
@@ -1050,6 +1050,10 @@
 /* ------------------------------------------------------------------------ */
 /*		Start of the Real Functions				    */
 /* ------------------------------------------------------------------------ */
+
+/* Disable these functions for v6m in favor of the versions below */
+#ifndef NOT_ISA_TARGET_32BIT
+
 #ifdef L_udivsi3
 
 #if defined(__prefer_thumb__)
@@ -1507,6 +1511,8 @@
 	cfi_end	LSYM(Lend_div0)
 	FUNC_END div0
 #endif
+
+#endif /* NOT_ISA_TARGET_32BIT */
 	
 #endif /* L_dvmd_lnx */
 #ifdef L_clear_cache
@@ -1583,6 +1589,9 @@
    so for Reg value in (32...63) and (-1...-31) we will get zero (in the
    case of logical shifts) or the sign (for asr).  */
 
+/* Disable these functions for v6m in favor of the versions below */
+#ifndef NOT_ISA_TARGET_32BIT
+
 #ifdef __ARMEB__
 #define al	r1
 #define ah	r0
@@ -1884,6 +1893,8 @@
 #endif
 #endif /* L_clzsi2 */
 
+#endif /* NOT_ISA_TARGET_32BIT */
+
 /* ------------------------------------------------------------------------ */
 /* These next two sections are here despite the fact that they contain Thumb 
    assembler because their presence allows interworked code to be linked even
@@ -2189,5 +2200,58 @@
 #include "bpabi.S"
 #else /* NOT_ISA_TARGET_32BIT */
 #include "bpabi-v6m.S"
+
+
+/* Temp registers. */
+#define rP r4
+#define rQ r5
+#define rS r6
+#define rT r7
+
+.macro CM0_FUNC_START name
+.global SYM(__\name)
+.type SYM(__\name),function
+.thumb_func
+.align 1
+    SYM(__\name):
+.endm
+
+.macro CM0_WEAK_START name 
+.weak SYM(__\name)
+CM0_FUNC_START \name 
+.endm
+
+.macro CM0_FUNC_END name
+.size SYM(__\name), . - SYM(__\name)
+.endm
+
+.macro RETx x
+        bx      \x
+.endm
+
+/* Order files to maximize +/- 2k jump offset of 'b' */
+#include "cm0/fplib.h"
+
+#include "cm0/lmul.S"
+#include "cm0/lcmp.S"
+#include "cm0/idiv.S"
+#include "cm0/ldiv.S"
+
+#include "cm0/ctz2.S"
+#include "cm0/clz2.S"
+#include "cm0/parity.S"
+
+#include "cm0/fcmp.S"
+#include "cm0/fconv.S"
+#include "cm0/fneg.S"
+
+#include "cm0/fadd.S"
+#include "cm0/futil.S"
+#include "cm0/fmul.S"
+#include "cm0/fdiv.S"
+
+#include "cm0/ffloat.S"
+#include "cm0/ffixed.S"
+
 #endif /* NOT_ISA_TARGET_32BIT */
 #endif /* !__symbian__ */
diff -ruN gcc-11-20201108-clean/libgcc/config/arm/t-elf gcc-11-20201108/libgcc/config/arm/t-elf
--- gcc-11-20201108-clean/libgcc/config/arm/t-elf	2020-11-08 14:32:11.000000000 -0800
+++ gcc-11-20201108/libgcc/config/arm/t-elf	2020-11-30 19:59:35.741693290 -0800
@@ -20,13 +20,13 @@
 # in the asm implementation for other CPUs.
 LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \
 	_call_via_rX _interwork_call_via_rX \
-	_lshrdi3 _ashrdi3 _ashldi3 \
+	_lshrdi3 _ashrdi3 _ashldi3 _mulsidi3 _umulsidi3 _muldi3 \
 	_arm_negdf2 _arm_addsubdf3 _arm_muldivdf3 _arm_cmpdf2 _arm_unorddf2 \
-	_arm_fixdfsi _arm_fixunsdfsi \
+	_arm_fixdfsi _arm_fixunsdfsi _arm_f2h _arm_h2f \
 	_arm_truncdfsf2 _arm_negsf2 _arm_addsubsf3 _arm_muldivsf3 \
 	_arm_cmpsf2 _arm_unordsf2 _arm_fixsfsi _arm_fixunssfsi \
 	_arm_floatdidf _arm_floatdisf _arm_floatundidf _arm_floatundisf \
-	_clzsi2 _clzdi2 _ctzsi2
+	_clzsi2 _clzdi2 _ctzsi2 _paritysi2 _paritydi2 
 
 # Currently there is a bug somewhere in GCC's alias analysis
 # or scheduling code that is breaking _fpmul_parts in fp-bit.c.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2020-12-02  3:32   ` Daniel Engel
@ 2020-12-16 17:15     ` Christophe Lyon
  2021-01-06 11:20       ` [PATCH v3] " Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2020-12-16 17:15 UTC (permalink / raw)
  To: Daniel Engel; +Cc: gcc Patches

On Wed, 2 Dec 2020 at 04:31, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi Christophe,
>
> On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
> > Hi,
> >
> > On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch adds an efficient assembly-language implementation of IEEE-
> > > 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
> > > 1).  This is the libgcc portion of a larger library originally
> > > described in 2018:
> > >
> > >     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
> > >
> > > Since that time, I've separated the libm functions for submission to
> > > newlib.  The remaining libgcc functions in the attached patch have
> > > the following characteristics:
> > >
> > >     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
> > >     __clzsi2                        42                  23              0       exact
> > >     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
> > >     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
> > >
> > >     __umulsidi3                     44                  24              0       exact
> > >     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
> > >     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
> > >     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
> > >     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
> > >     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
> > >
> > >     __aeabi_lcmp                    20                   13             0       exact
> > >     __aeabi_ulcmp                   16                  10              0       exact
> > >
> > >     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
> > >     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
> > >     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
> > >     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
> > >     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
> > >
> > >     __shared_float                  178
> > >     __shared_float (OPTIMIZE_SIZE)  154
> > >
> > >     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
> > >     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
> > >     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> > >     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> > >     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
> > >     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
> > >     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
> > >     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
> > >
> > >     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
> > >     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
> > >     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
> > >
> > >     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
> > >     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
> > >     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
> > >     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
> > >     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
> > >
> > >     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
> > >     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
> > >     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
> > >     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
> > >     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
> > >
> > >     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
> > >     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
> > >     __aeabi_h2f                     34+__shared_float 34             8     exact
> > >     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
> > >
> > > Copyright assignment is on file with the FSF.
> > >
> > > I've built the gcc-arm-none-eabi cross-compiler using the 20201108
> > > snapshot of GCC plus this patch, and successfully compiled a test
> > > program:
> > >
> > >     extern int main (void)
> > >     {
> > >         volatile int x = 1;
> > >         volatile unsigned long long int y = 10;
> > >         volatile long long int z = x / y; // 64-bit division
> > >
> > >         volatile float a = x; // 32-bit casting
> > >         volatile float b = y; // 64 bit casting
> > >         volatile float c = z / b; // float division
> > >         volatile float d = a + c; // float addition
> > >         volatile float e = c * b; // float multiplication
> > >         volatile float f = d - e - c; // float subtraction
> > >
> > >         if (f != c) // float comparison
> > >             y -= (long long int)d; // float casting
> > >     }
> > >
> > > As one point of comparison, the test program links to 876 bytes of
> > > libgcc code from the patched toolchain, vs 10276 bytes from the
> > > latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
> > > 90% size reduction.
> >
> > This looks awesome!
> >
> > >
> > > I have extensive test vectors, and have passed these tests on an
> > > STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
> > > and IEEECC754 [3] sources, plus some of my own creation.
> > > Unfortunately, I'm not sure how "make check" should work for a cross
> > > compiler run time library.
> > >
> > > Although I believe this patch can be incorporated as-is, there are
> > > at least two points that might bear discussion:
> > >
> > > * I'm not sure where or how they would be integrated, but I would be
> > >   happy to provide sources for my test vectors.
> > >
> > > * The library is currently built for the ARM v6m architecture only.
> > >   It is likely that some of the other Cortex variants would benefit
> > >   from these routines.  However, I would need some guidance on this
> > >   to proceed without introducing regressions.  I do not currently
> > >   have a test strategy for architectures beyond Cortex M0, and I
> > >   have NOT profiled the existing thumb-2 implementations (ieee754-
> > >   sf.S) for comparison.
> >
> > I tried your patch, and I see many regressions in the GCC testsuite
> > because many tests fail to link with errors like:
> > ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function
> > `__clzdi2':
> > /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
> > `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
> > first defined here
> >
> > This happens with a toolchain configured with --target arm-none-eabi,
> > default cpu/fpu/mode,
> > --enable-multilib --with-multilib-list=rmprofile and running the tests with
> > -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
> >
> > Does it work for you?
>
> Thanks for the feedback.
>
> I'm afraid I'm quite ignorant as to the gcc test suite infrastructure,
> so I don't know how to use the options you've shared above.  I'm cross-
> compiling the Windows toolchain on Ubuntu.  Would you mind sharing a
> full command line you would use for testing?  The toolchain is built
> with the default options, which includes "--target arm-none-eabi".
>

Why put Windows in the picture? This seems unnecessarily complicated...
I suggest you build your cross-toolchain on x86_64 ubuntu and run it
on x86_64 ubuntu (of course targetting arm)

The above options where GCC configure options, except for the last
one which I used when running the tests.

There is some documentation about how to run the GCC testsuite there:
https://gcc.gnu.org/install/test.html

Basically 'make check' should mostly work except for execution tests
for which you'll need to teach DejaGnu how to run the generated programs
on a real board or on a simulator.

I didn't analyze your patch, I just submitted it to my validation system:
https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/report-build-info.html
- the red "regressed" items indicate regressions in the testsuite. You
can click on "log" to download the corresponding gcc.log
- the dark-red "build broken" items indicate that the toolchain build failed
- the orange "interrupted" items indicate an infrastructure problem,
so you can ignore such cases
- similarly the dark red "ref build failed" indicate that the
reference build failed for some infrastructure reason

for the arm-none-eabi target, several toolchain versions fail to
build, some succeed.
This is because I use different multilib configuration flags, it looks like the
ones involving --with-multilib=rmprofile are broken with your patch.

These ones should be reasonably easy to fix: no 'make check' involved.

For instance if you configure GCC with:
--target arm-none-eabi --enable-multilib --with-multilib-list=rmprofile
you should see the build failure.

HTH

Christophe

> I did see similar errors once before.  It turned out then that I omitted
> one of the ".S" files from the build.  My interpretation at that point
> was that gcc had been searching multiple versions of "libgcc.a" and
> unable to merge the symbols.  In hindsight, that was a really bad
> interpretation.   I was able to reproduce the error above by simply
> adding a line like "volatile double m = 1.0; m += 2;".
>
> After reviewing the existing asm implementations more closely, I
> believe that I have not been using the function guard macros (L_arm_*)
> as intended.  The make script appears to compile "lib1funcs.S" dozens of
> times -- once for each function guard macro listed in LIB1ASMFUNCS --
> with the intent of generating a separate ".o" file for each function.
> Because they were unguarded, my new library functions were duplicated
> into every ".o" file, which caused the link errors you saw.
>
> I have attached an updated patch that implements the macros.
>
> However, I'm not sure whether my usage is really consistent with the
> spirit of the make script.  If there's a README or HOWTO, I haven't
> found it yet.  The following points summarize my concerns as I was
> making these updates:
>
> 1.  While some of the new functions (e.g. __cmpsf2) are standalone,
>     there is a common core in the new library shared by several related
>     functions.  That keeps the library small.  For now, I've elected to
>     group all of these related functions together in a single object
>     file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
>     within this unit.  Notice that I manually assigned section names in
>     the code, so there still shouldn't be any unnecessary code linked in
>     the final build.  Does the multiple-".o" files strategy predate "-gc-
>     sections", or should I be trying harder to break these related
>     functions into separate compilation units?
>
> 2.  I introduced a few new macro keywords for functions/groups (e.g.
>     "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
>     files compiled for the non-v6m architectures will be benign.
>
> 3.  The "t-elf" make script implies that __mulsf3() should not be
>     compiled in thumb mode (it's inside a conditional), but this is one
>     of the new functions.  Moot for now, since my __mulsf3() is grouped
>     with the common core functions (see point 1) and is thus currently
>     guarded by the "_arm_addsubsf3.o" macro.
>
> 4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
>     to be working.  I have defined __clzsi2() as a weak symbol to be
>     overridden by the combined function __clzdi2().  I can also see
>     (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
>     "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
>     always linked, even in programs that only call __clzsi2(),  A minor
>     annoyance at this point.
>
> 5.  Is there a permutation of the makefile that compiles libgcc with
>     __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
>     optimize either way, yet the final product only seems to have the
>     "fast" code.  At this optimization level, the sample program above
>     pulls in 1012 bytes of library code instead of 836. Perhaps this is
>     meant to be controlled by the toolchain configuration step, but it
>     doesn't follow that the optimization for the cross-compiler would
>     automatically translate to the target runtime libraries.
>
> Thanks again,
> Daniel
>
> >
> > Thanks,
> >
> > Christophe
> >
> > >
> > > I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
> > >
> > > Thanks,
> > > Daniel Engel
> > >
> > > [1] http://www.netlib.org/fp/ucbtest.tgz
> > > [2] http://www.jhauser.us/arithmetic/TestFloat.html
> > > [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2020-12-16 17:15     ` Christophe Lyon
@ 2021-01-06 11:20       ` Daniel Engel
  2021-01-06 17:05         ` Richard Earnshaw
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-06 11:20 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 17488 bytes --]

Hi Christophe, 

On Wed, Dec 16, 2020, at 9:15 AM, Christophe Lyon wrote:
> On Wed, 2 Dec 2020 at 04:31, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > Hi Christophe,
> >
> > On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
> > > Hi,
> > >
> > > On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch adds an efficient assembly-language implementation of IEEE-
> > > > 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
> > > > 1).  This is the libgcc portion of a larger library originally
> > > > described in 2018:
> > > >
> > > >     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
> > > >
> > > > Since that time, I've separated the libm functions for submission to
> > > > newlib.  The remaining libgcc functions in the attached patch have
> > > > the following characteristics:
> > > >
> > > >     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
> > > >     __clzsi2                        42                  23              0       exact
> > > >     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
> > > >     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
> > > >
> > > >     __umulsidi3                     44                  24              0       exact
> > > >     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
> > > >     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
> > > >     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
> > > >     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
> > > >     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
> > > >
> > > >     __aeabi_lcmp                    20                   13             0       exact
> > > >     __aeabi_ulcmp                   16                  10              0       exact
> > > >
> > > >     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
> > > >     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
> > > >     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
> > > >     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
> > > >     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
> > > >
> > > >     __shared_float                  178
> > > >     __shared_float (OPTIMIZE_SIZE)  154
> > > >
> > > >     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
> > > >     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
> > > >     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> > > >     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
> > > >     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
> > > >     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
> > > >     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
> > > >     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
> > > >
> > > >     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
> > > >     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
> > > >     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
> > > >
> > > >     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
> > > >     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
> > > >     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
> > > >     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
> > > >     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
> > > >
> > > >     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
> > > >     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
> > > >     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
> > > >     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
> > > >     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
> > > >
> > > >     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
> > > >     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
> > > >     __aeabi_h2f                     34+__shared_float 34             8     exact
> > > >     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
> > > >
> > > > Copyright assignment is on file with the FSF.
> > > >
> > > > I've built the gcc-arm-none-eabi cross-compiler using the 20201108
> > > > snapshot of GCC plus this patch, and successfully compiled a test
> > > > program:
> > > >
> > > >     extern int main (void)
> > > >     {
> > > >         volatile int x = 1;
> > > >         volatile unsigned long long int y = 10;
> > > >         volatile long long int z = x / y; // 64-bit division
> > > >
> > > >         volatile float a = x; // 32-bit casting
> > > >         volatile float b = y; // 64 bit casting
> > > >         volatile float c = z / b; // float division
> > > >         volatile float d = a + c; // float addition
> > > >         volatile float e = c * b; // float multiplication
> > > >         volatile float f = d - e - c; // float subtraction
> > > >
> > > >         if (f != c) // float comparison
> > > >             y -= (long long int)d; // float casting
> > > >     }
> > > >
> > > > As one point of comparison, the test program links to 876 bytes of
> > > > libgcc code from the patched toolchain, vs 10276 bytes from the
> > > > latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
> > > > 90% size reduction.
> > >
> > > This looks awesome!
> > >
> > > >
> > > > I have extensive test vectors, and have passed these tests on an
> > > > STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
> > > > and IEEECC754 [3] sources, plus some of my own creation.
> > > > Unfortunately, I'm not sure how "make check" should work for a cross
> > > > compiler run time library.
> > > >
> > > > Although I believe this patch can be incorporated as-is, there are
> > > > at least two points that might bear discussion:
> > > >
> > > > * I'm not sure where or how they would be integrated, but I would be
> > > >   happy to provide sources for my test vectors.
> > > >
> > > > * The library is currently built for the ARM v6m architecture only.
> > > >   It is likely that some of the other Cortex variants would benefit
> > > >   from these routines.  However, I would need some guidance on this
> > > >   to proceed without introducing regressions.  I do not currently
> > > >   have a test strategy for architectures beyond Cortex M0, and I
> > > >   have NOT profiled the existing thumb-2 implementations (ieee754-
> > > >   sf.S) for comparison.
> > >
> > > I tried your patch, and I see many regressions in the GCC testsuite
> > > because many tests fail to link with errors like:
> > > ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function
> > > `__clzdi2':
> > > /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
> > > `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
> > > first defined here
> > >
> > > This happens with a toolchain configured with --target arm-none-eabi,
> > > default cpu/fpu/mode,
> > > --enable-multilib --with-multilib-list=rmprofile and running the tests with
> > > -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
> > >
> > > Does it work for you?
> >
> > Thanks for the feedback.
> >
> > I'm afraid I'm quite ignorant as to the gcc test suite
> > infrastructure, so I don't know how to use the options you've shared
> > above.  I'm cross- compiling the Windows toolchain on Ubuntu.  Would
> > you mind sharing a full command line you would use for testing?  The
> > toolchain is built with the default options, which includes "--
> > target arm-none-eabi".
> >
>
> Why put Windows in the picture? This seems unnecessarily
> complicated... I suggest you build your cross-toolchain on x86_64
> ubuntu and run it on x86_64 ubuntu (of course targetting arm)

Mostly because I had not previously committed the time to understand the
GCC regression test environment.  My company and personal computers both
run Windows.  I created an Ubuntu virtual machine for this project, and
I'd been trying to get by with the build scripts provided by the ARM
toolchain.  Clearly that was insufficient.

> The above options where GCC configure options, except for the last one
> which I used when running the tests.
>
> There is some documentation about how to run the GCC testsuite there:
> https://gcc.gnu.org/install/test.html

Thanks.  I was able to take this document, plus some additional pages
about constructing a combined tree with newlib, and put together a
working regression test.  GDB didn't want to build cleanly at first, so
eventually I gave up and disabled that part.

> Basically 'make check' should mostly work except for execution tests
> for which you'll need to teach DejaGnu how to run the generated
> programs on a real board or on a simulator.
>
> I didn't analyze your patch, I just submitted it to my validation
> system:
> https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/report-build-info.html
> - the red "regressed" items indicate regressions in the testsuite. You
>   can click on "log" to download the corresponding gcc.log
> - the dark-red "build broken" items indicate that the toolchain build
>   failed
> - the orange "interrupted" items indicate an infrastructure problem,
>   so you can ignore such cases
> - similarly the dark red "ref build failed" indicate that the
>   reference build failed for some infrastructure reason
>
> for the arm-none-eabi target, several toolchain versions fail to
> build, some succeed. This is because I use different multilib
> configuration flags, it looks like the ones involving --with-
> multilib=rmprofile are broken with your patch.
>
> These ones should be reasonably easy to fix: no 'make check' involved.
> 
> For instance if you configure GCC with:
> --target arm-none-eabi --enable-multilib --with-multilib-list=rmprofile
> you should see the build failure.

So far, I have not found a cause for the build failures you are seeing.
The ARM toolchain script I was using before did build with the
'rmprofile' option.  With my current configure options, gcc builds
'rmprofile', 'aprofile', and even 'armeb'.  I did find a number of link
issues with 'make check' due to incorrect usage of the 'L_'  defines in
LIB1ASMFUNCS.  These are fixed in the new version attached.

Returning to the build failures you logged, I do consistently see this
message in the logs [1]: "fatal error: cm0/fplib.h: No such file or
directory".  I recognize the file, since it's one of the new files in
my patch (the full sub-directory is libgcc/config/arm/cm0/fplib.h).
Do I have to format patches in some different way so that new files
get created?

Regression testing also showed that the previous patch was failing the
"arm/divzero" test because I wasn't providing the same arguments to
div0() as the existing implementation.  Having made that change, I think
the patch is clean.  (I don't think there is a strict specification for
div0(), and the changes add a non-trivial number of instructions, but
I'll hold that discussion for another time).

Do you have time to re-check this patch on your build system?

Thanks,
Daniel

[1] Line 36054: <https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/arm-none-eabi/build-rh70-arm-none-eabi-default-default-default-mthumb.-mcpu=cortex-m0.-mfloat-abi=soft.-march=armv6s-m.log.xz>

> 
> HTH
> 
> Christophe
> 
> > I did see similar errors once before.  It turned out then that I omitted
> > one of the ".S" files from the build.  My interpretation at that point
> > was that gcc had been searching multiple versions of "libgcc.a" and
> > unable to merge the symbols.  In hindsight, that was a really bad
> > interpretation.   I was able to reproduce the error above by simply
> > adding a line like "volatile double m = 1.0; m += 2;".
> >
> > After reviewing the existing asm implementations more closely, I
> > believe that I have not been using the function guard macros (L_arm_*)
> > as intended.  The make script appears to compile "lib1funcs.S" dozens of
> > times -- once for each function guard macro listed in LIB1ASMFUNCS --
> > with the intent of generating a separate ".o" file for each function.
> > Because they were unguarded, my new library functions were duplicated
> > into every ".o" file, which caused the link errors you saw.
> >
> > I have attached an updated patch that implements the macros.
> >
> > However, I'm not sure whether my usage is really consistent with the
> > spirit of the make script.  If there's a README or HOWTO, I haven't
> > found it yet.  The following points summarize my concerns as I was
> > making these updates:
> >
> > 1.  While some of the new functions (e.g. __cmpsf2) are standalone,
> >     there is a common core in the new library shared by several related
> >     functions.  That keeps the library small.  For now, I've elected to
> >     group all of these related functions together in a single object
> >     file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
> >     within this unit.  Notice that I manually assigned section names in
> >     the code, so there still shouldn't be any unnecessary code linked in
> >     the final build.  Does the multiple-".o" files strategy predate "-gc-
> >     sections", or should I be trying harder to break these related
> >     functions into separate compilation units?
> >
> > 2.  I introduced a few new macro keywords for functions/groups (e.g.
> >     "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
> >     files compiled for the non-v6m architectures will be benign.
> >
> > 3.  The "t-elf" make script implies that __mulsf3() should not be
> >     compiled in thumb mode (it's inside a conditional), but this is one
> >     of the new functions.  Moot for now, since my __mulsf3() is grouped
> >     with the common core functions (see point 1) and is thus currently
> >     guarded by the "_arm_addsubsf3.o" macro.
> >
> > 4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
> >     to be working.  I have defined __clzsi2() as a weak symbol to be
> >     overridden by the combined function __clzdi2().  I can also see
> >     (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
> >     "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
> >     always linked, even in programs that only call __clzsi2(),  A minor
> >     annoyance at this point.
> >
> > 5.  Is there a permutation of the makefile that compiles libgcc with
> >     __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
> >     optimize either way, yet the final product only seems to have the
> >     "fast" code.  At this optimization level, the sample program above
> >     pulls in 1012 bytes of library code instead of 836. Perhaps this is
> >     meant to be controlled by the toolchain configuration step, but it
> >     doesn't follow that the optimization for the cross-compiler would
> >     automatically translate to the target runtime libraries.
> >
> > Thanks again,
> > Daniel
> >
> > >
> > > Thanks,
> > >
> > > Christophe
> > >
> > > >
> > > > I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
> > > >
> > > > Thanks,
> > > > Daniel Engel
> > > >
> > > > [1] http://www.netlib.org/fp/ucbtest.tgz
> > > > [2] http://www.jhauser.us/arithmetic/TestFloat.html
> > > > [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
> > >
>

[-- Attachment #2: cortex-m0-fplib-20210105.patch --]
[-- Type: application/octet-stream, Size: 195994 bytes --]

diff -ruN gcc-11-20201220-clean/libgcc/config/arm/bpabi.S gcc-11-20201220/libgcc/config/arm/bpabi.S
--- gcc-11-20201220-clean/libgcc/config/arm/bpabi.S	2020-12-20 14:32:15.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/bpabi.S	2021-01-06 02:45:47.416262493 -0800
@@ -34,48 +34,6 @@
 	.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-ARM_FUNC_START aeabi_lcmp
-	cmp	xxh, yyh
-	do_it	lt
-	movlt	r0, #-1
-	do_it	gt
-	movgt	r0, #1
-	do_it	ne
-	RETc(ne)
-	subs	r0, xxl, yyl
-	do_it	lo
-	movlo	r0, #-1
-	do_it	hi
-	movhi	r0, #1
-	RET
-	FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-	
-#ifdef L_aeabi_ulcmp
-
-ARM_FUNC_START aeabi_ulcmp
-	cmp	xxh, yyh
-	do_it	lo
-	movlo	r0, #-1
-	do_it	hi
-	movhi	r0, #1
-	do_it	ne
-	RETc(ne)
-	cmp	xxl, yyl
-	do_it	lo
-	movlo	r0, #-1
-	do_it	hi
-	movhi	r0, #1
-	do_it	eq
-	moveq	r0, #0
-	RET
-	FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
 .macro test_div_by_zero signed
 /* Tail-call to divide-by-zero handlers which may be overridden by the user,
    so unwinding works properly.  */
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/bpabi-v6m.S gcc-11-20201220/libgcc/config/arm/bpabi-v6m.S
--- gcc-11-20201220-clean/libgcc/config/arm/bpabi-v6m.S	2020-12-20 14:32:15.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/bpabi-v6m.S	2021-01-06 02:45:47.428262284 -0800
@@ -33,212 +33,6 @@
 	.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-FUNC_START aeabi_lcmp
-	cmp	xxh, yyh
-	beq	1f
-	bgt	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-	RET
-1:
-	subs	r0, xxl, yyl
-	beq	1f
-	bhi	2f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-2:
-	movs	r0, #1
-1:
-	RET
-	FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-	
-#ifdef L_aeabi_ulcmp
-
-FUNC_START aeabi_ulcmp
-	cmp	xxh, yyh
-	bne	1f
-	subs	r0, xxl, yyl
-	beq	2f
-1:
-	bcs	1f
-	movs	r0, #1
-	negs	r0, r0
-	RET
-1:
-	movs	r0, #1
-2:
-	RET
-	FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
-.macro test_div_by_zero signed
-	cmp	yyh, #0
-	bne	7f
-	cmp	yyl, #0
-	bne	7f
-	cmp	xxh, #0
-	.ifc	\signed, unsigned
-	bne	2f
-	cmp	xxl, #0
-2:
-	beq	3f
-	movs	xxh, #0
-	mvns	xxh, xxh		@ 0xffffffff
-	movs	xxl, xxh
-3:
-	.else
-	blt	6f
-	bgt	4f
-	cmp	xxl, #0
-	beq	5f
-4:	movs	xxl, #0
-	mvns	xxl, xxl		@ 0xffffffff
-	lsrs	xxh, xxl, #1		@ 0x7fffffff
-	b	5f
-6:	movs	xxh, #0x80
-	lsls	xxh, xxh, #24		@ 0x80000000
-	movs	xxl, #0
-5:
-	.endif
-	@ tailcalls are tricky on v6-m.
-	push	{r0, r1, r2}
-	ldr	r0, 1f
-	adr	r1, 1f
-	adds	r0, r1
-	str	r0, [sp, #8]
-	@ We know we are not on armv4t, so pop pc is safe.
-	pop	{r0, r1, pc}
-	.align	2
-1:
-	.word	__aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-	test_div_by_zero signed
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__gnu_ldivmod_helper)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_ldivmod
-
-#endif /* L_aeabi_ldivmod */
-
-#ifdef L_aeabi_uldivmod
-
-FUNC_START aeabi_uldivmod
-	test_div_by_zero unsigned
-
-	push	{r0, r1}
-	mov	r0, sp
-	push	{r0, lr}
-	ldr	r0, [sp, #8]
-	bl	SYM(__udivmoddi4)
-	ldr	r3, [sp, #4]
-	mov	lr, r3
-	add	sp, sp, #8
-	pop	{r2, r3}
-	RET
-	FUNC_END aeabi_uldivmod
-	
-#endif /* L_aeabi_uldivmod */
-
-#ifdef L_arm_addsubsf3
-
-FUNC_START aeabi_frsub
-
-      push	{r4, lr}
-      movs	r4, #1
-      lsls	r4, #31
-      eors	r0, r0, r4
-      bl	__aeabi_fadd
-      pop	{r4, pc}
-
-      FUNC_END aeabi_frsub
-
-#endif /* L_arm_addsubsf3 */
-
-#ifdef L_arm_cmpsf2
-
-FUNC_START aeabi_cfrcmple
-
-	mov	ip, r0
-	movs	r0, r1
-	mov	r1, ip
-	b	6f
-
-FUNC_START aeabi_cfcmpeq
-FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
-
-	@ The status-returning routines are required to preserve all
-	@ registers except ip, lr, and cpsr.
-6:	push	{r0, r1, r2, r3, r4, lr}
-	bl	__lesf2
-	@ Set the Z flag correctly, and the C flag unconditionally.
-	cmp	r0, #0
-	@ Clear the C flag if the return value was -1, indicating
-	@ that the first operand was smaller than the second.
-	bmi	1f
-	movs	r1, #0
-	cmn	r0, r1
-1:
-	pop	{r0, r1, r2, r3, r4, pc}
-
-	FUNC_END aeabi_cfcmple
-	FUNC_END aeabi_cfcmpeq
-	FUNC_END aeabi_cfrcmple
-
-FUNC_START	aeabi_fcmpeq
-
-	push	{r4, lr}
-	bl	__eqsf2
-	negs	r0, r0
-	adds	r0, r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmpeq
-
-.macro COMPARISON cond, helper, mode=sf2
-FUNC_START	aeabi_fcmp\cond
-
-	push	{r4, lr}
-	bl	__\helper\mode
-	cmp	r0, #0
-	b\cond	1f
-	movs	r0, #0
-	pop	{r4, pc}
-1:
-	movs	r0, #1
-	pop	{r4, pc}
-
-	FUNC_END aeabi_fcmp\cond
-.endm
-
-COMPARISON lt, le
-COMPARISON le, le
-COMPARISON gt, ge
-COMPARISON ge, ge
-
-#endif /* L_arm_cmpsf2 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/clz2.S gcc-11-20201220/libgcc/config/arm/cm0/clz2.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/clz2.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/clz2.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,324 @@
+/* clz2.S: Cortex M0 optimized 'clz' functions 
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+
+#ifdef L_clzdi2
+
+// int __clzdi2(long long)
+// Counts leading zero bits in $r1:$r0.
+// Returns the result in $r0.
+.section .text.sorted.libgcc.clz2.clzdi2,"x"
+CM0_FUNC_START clzdi2
+    CFI_START_FUNCTION
+
+        // Moved here from lib1funcs.S
+        cmp     xxh,    #0
+        do_it   eq,     et
+        clzeq   r0,     xxl
+        clzne   r0,     xxh
+        addeq   r0,     #32
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clzdi2
+
+#endif /* L_clzdi2 */
+
+
+#ifdef L_clzsi2
+
+// int __clzsi2(int)
+// Counts leading zero bits in $r0.
+// Returns the result in $r0.
+.section .text.sorted.libgcc.clz2.clzsi2,"x"
+CM0_FUNC_START clzsi2
+    CFI_START_FUNCTION
+
+        // Moved here from lib1funcs.S
+        clz     r0,     r0
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clzsi2
+
+#endif /* L_clzsi2 */
+
+#else /* !__ARM_FEATURE_CLZ */
+
+#ifdef L_clzdi2
+
+// int __clzdi2(long long)
+// Counts leading zero bits in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+.section .text.sorted.libgcc.clz2.clzdi2,"x"
+CM0_FUNC_START clzdi2
+    CFI_START_FUNCTION
+
+  #if defined(__ARMEB__) && __ARMEB__
+        // Check if the upper word is zero.
+        cmp     r0,     #0
+
+        // The upper word is non-zero, so calculate __clzsi2(upper).
+        bne     SYM(__clzsi2)
+
+        // The upper word is zero, so calculate 32 + __clzsi2(lower).
+        movs    r2,     #64
+        movs    r0,     r1
+        b       SYM(__internal_clzsi2)
+        
+  #else /* !__ARMEB__ */
+        // Assume all the bits in the argument are zero.
+        movs    r2,     #64
+
+        // Check if the upper word is zero.
+        cmp     r1,     #0
+
+        // The upper word is zero, so calculate 32 + __clzsi2(lower).
+        beq     SYM(__internal_clzsi2)
+
+        // The upper word is non-zero, so set up __clzsi2(upper).
+        // Then fall through.
+        movs    r0,     r1
+        
+  #endif /* !__ARMEB__ */
+
+#endif /* L_clzdi2 */
+
+
+// The bitwise implementation of __clzdi2() tightly couples with __clzsi2(), 
+//  such that instructions must appear consecutively in the same memory 
+//  section for proper flow control.  However, this construction inhibits 
+//  the ability to discard __clzdi2() when only using __clzsi2().
+// Therefore, this block configures __clzsi2() for compilation twice.  
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __clzdi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and  
+//  provide both symbols when required. 
+// '_clzsi2' should appear before '_clzdi2' in LIB1ASMFUNCS.
+#if defined(L_clzsi2) || defined(L_clzdi2)
+
+#ifdef L_clzsi2
+// int __clzsi2(int)
+// Counts leading zero bits in $r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+.section .text.sorted.libgcc.clz2.clzsi2,"x"
+CM0_WEAK_START clzsi2
+    CFI_START_FUNCTION
+
+#else /* L_clzdi2 */
+CM0_FUNC_START clzsi2
+
+#endif
+
+        // Assume all the bits in the argument are zero
+        movs    r2,     #32
+
+#ifdef L_clzsi2
+    CM0_WEAK_START internal_clzsi2
+#else /* L_clzdi2 */
+    CM0_FUNC_START internal_clzsi2
+#endif
+
+        // Size optimized: 22 bytes, 51 cycles 
+        // Speed optimized: 50 bytes, 20 cycles
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+        // Binary search starts at half the word width.
+        movs    r3,     #16
+
+    LSYM(__clz_loop):
+        // Test the upper 'n' bits of the operand for ZERO.
+        movs    r1,     r0
+        lsrs    r1,     r3
+        beq     LSYM(__clz_skip)
+
+        // When the test fails, discard the lower bits of the register,
+        //  and deduct the count of discarded bits from the result.
+        movs    r0,     r1
+        subs    r2,     r3
+
+    LSYM(__clz_skip):
+        // Decrease the shift distance for the next test.
+        lsrs    r3,     #1
+        bne     LSYM(__clz_loop)
+
+  #else /* __OPTIMIZE_SIZE__ */
+
+        // Unrolled binary search.
+        lsrs    r1,     r0,     #16
+        beq     LSYM(__clz8)
+        movs    r0,     r1
+        subs    r2,     #16
+
+    LSYM(__clz8):
+        lsrs    r1,     r0,     #8
+        beq     LSYM(__clz4)
+        movs    r0,     r1
+        subs    r2,     #8
+
+    LSYM(__clz4):
+        lsrs    r1,     r0,     #4
+        beq     LSYM(__clz2)
+        movs    r0,     r1
+        subs    r2,     #4
+
+    LSYM(__clz2):
+        // Load the remainder by index
+	adr     r1,     LSYM(__clz_remainder)
+        ldrb    r0,     [r1, r0]
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+
+        // Account for the remainder.
+        subs    r0,     r2,     r0
+        RET
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        .align 2
+    LSYM(__clz_remainder):
+        .byte 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clzsi2
+
+#ifdef L_clzdi2
+CM0_FUNC_END clzdi2
+#endif
+
+#endif /* L_clzsi2 || L_clzdi2 */
+
+#endif /* !__ARM_FEATURE_CLZ */
+
+
+#ifdef L_clrsbdi2
+
+// int __clrsbdi2(int)
+// Counts the number of "redundant sign bits" in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+.section .text.sorted.libgcc.clz2.clrsbdi2,"x"
+CM0_FUNC_START clrsbdi2
+    CFI_START_FUNCTION
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+        // Invert negative signs to keep counting zeros.
+        asrs    r3,     xxh,    #31
+        eors    xxl,    r3 
+        eors    xxh,    r3 
+
+        // Same as __clzdi2(), except that the 'C' flag is pre-calculated.
+        // Also, the trailing 'subs', since the last bit is not redundant.  
+        do_it   eq,     et
+        clzeq   r0,     xxl
+        clzne   r0,     xxh
+        addeq   r0,     #32
+        subs    r0,     #1
+        RET
+
+  #else  /* !__ARM_FEATURE_CLZ */ 
+        // Result if all the bits in the argument are zero.
+        // Set it here to keep the flags clean after 'eors' below.  
+        movs    r2,     #31         
+
+        // Invert negative signs to keep counting zeros.
+        asrs    r3,     xxh,    #31
+        eors    xxh,    r3 
+
+    #if defined(__ARMEB__) && __ARMEB__
+        // If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+        bne     SYM(__internal_clzsi2) 
+
+        // The upper word is zero, prepare the lower word.
+        movs    r0,     r1
+        eors    r0,     r3 
+
+    #else /* !__ARMEB__ */
+        // Save the lower word temporarily. 
+        // This somewhat awkward construction adds one cycle when the  
+        //  branch is not taken, but prevents a double-branch.   
+        eors    r3,     r0
+
+        // If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+        movs    r0,     r1
+        bne    SYM(__internal_clzsi2)
+
+        // Restore the lower word. 
+        movs    r0,     r3 
+
+    #endif /* !__ARMEB__ */
+
+        // The upper word is zero, return '31 + __clzsi2(lower)'.
+        adds    r2,     #32
+        b       SYM(__internal_clzsi2)
+
+  #endif /* !__ARM_FEATURE_CLZ */ 
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clrsbdi2
+
+#endif /* L_clrsbdi2 */
+
+
+#ifdef L_clrsbsi2
+
+// int __clrsbsi2(int)
+// Counts the number of "redundant sign bits" in $r0.  
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+.section .text.sorted.libgcc.clz2.clrsbsi2,"x"
+CM0_FUNC_START clrsbsi2 
+    CFI_START_FUNCTION
+
+        // Invert negative signs to keep counting zeros.
+        asrs    r2,     r0,    #31
+        eors    r0,     r2
+
+      #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+        // Count.  
+        clz     r0,     r0
+
+        // The result for a positive value will always be >= 1.  
+        // By definition, the last bit is not redundant. 
+        subs    r0,     #1
+        RET  
+
+      #else /* !__ARM_FEATURE_CLZ */
+        // Result if all the bits in the argument are zero.
+        // By definition, the last bit is not redundant. 
+        movs    r2,     #31
+        b       SYM(__internal_clzsi2)
+
+      #endif  /* !__ARM_FEATURE_CLZ */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END clrsbsi2 
+
+#endif /* L_clrsbsi2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/ctz2.S gcc-11-20201220/libgcc/config/arm/cm0/ctz2.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/ctz2.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/ctz2.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,285 @@
+/* ctz2.S: Cortex M0 optimized 'ctz' functions
+
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+// When the hardware 'clz' function is available, an efficient version 
+//  of __ctzsi2(x) can be created by calculating '31 - __clzsi2(lsb(x))', 
+//  where lsb(x) is 'x' with only the least-significant '1' bit set.  
+// The following offset applies to all of the functions in this file.   
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+  #define CTZ_RESULT_OFFSET 1
+#else 
+  #define CTZ_RESULT_OFFSET 0
+#endif 
+
+
+#ifdef L_ctzdi2
+
+// int __ctzdi2(long long)
+// Counts trailing zeros in a 64 bit double word.
+// Expects the argument  in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+.section .text.sorted.libgcc.ctz2.ctzdi2,"x"
+CM0_FUNC_START ctzdi2
+    CFI_START_FUNCTION
+
+      #if defined(__ARMEB__) && __ARMEB__
+        // Assume all the bits in the argument are zero.
+        movs    r2,    #(64 - CTZ_RESULT_OFFSET)
+        
+        // Check if the lower word is zero.
+        cmp     r1,     #0
+        
+        // The lower word is zero, so calculate 32 + __ctzsi2(upper).
+        beq     SYM(__internal_ctzsi2)
+
+        // The lower word is non-zero, so set up __ctzsi2(lower).
+        // Then fall through.
+        movs    r0,     r1
+        
+      #else /* !__ARMEB__ */
+        // Check if the lower word is zero.
+        cmp     r0,     #0
+        
+        // If the lower word is non-zero, result is just __ctzsi2(lower).
+        bne     SYM(__ctzsi2)
+
+        // The lower word is zero, so calculate 32 + __ctzsi2(upper).
+        movs    r2,    #(64 - CTZ_RESULT_OFFSET)
+        movs    r0,     r1
+        b       SYM(__internal_ctzsi2)
+        
+      #endif /* !__ARMEB__ */
+
+#endif /* L_ctzdi2 */
+
+
+// The bitwise implementation of __clzdi2() tightly couples with __ctzsi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __clzdi2() when only using __ctzsi2().
+// Therefore, this block configures __ctzsi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __clzdi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_ctzsi2' should appear before '_clzdi2' in LIB1ASMFUNCS.
+#if defined(L_ctzsi2) || defined(L_ctzdi2)
+
+#ifdef L_ctzsi2
+// int __ctzsi2(int)
+// Counts trailing zeros in a 32 bit word.
+// Expects the argument in $r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+.section .text.sorted.libgcc.ctz2.ctzsi2,"x"
+CM0_WEAK_START ctzsi2
+    CFI_START_FUNCTION
+
+#else /* L_ctzdi2 */
+CM0_FUNC_START ctzsi2
+
+#endif
+
+        // Assume all the bits in the argument are zero
+        movs    r2,     #(32 - CTZ_RESULT_OFFSET)
+
+#ifdef L_ctzsi2
+    CM0_WEAK_START internal_ctzsi2
+#else /* L_ctzdi2 */
+    CM0_FUNC_START internal_ctzsi2
+#endif
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+
+        // Find the least-significant '1' bit of the argument. 
+        rsbs    r1,     r0,     #0
+        ands    r1,     r0
+        
+        // Maintain result compatibility with the software implementation.
+        // Technically, __ctzsi2(0) is undefined, but 32 seems better than -1.
+        //  (or possibly 31 if this is an intermediate result for __ctzdi2(0)).   
+        // The carry flag from 'rsbs' gives '-1' iff the argument was 'zero'.  
+        //  (NOTE: 'ands' with 0 shift bits does not change the carry flag.)
+        // After the jump, the final result will be '31 - (-1)'.   
+        sbcs    r0,     r0
+        beq     LSYM(__ctz_zero)
+
+        // Gives the number of '0' bits left of the least-significant '1'.  
+        clz     r0,     r1
+
+  #elif defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Size optimized: 24 bytes, 52 cycles
+        // Speed optimized: 52 bytes, 21 cycles
+
+        // Binary search starts at half the word width.
+        movs    r3,     #16
+
+    LSYM(__ctz_loop):
+        // Test the upper 'n' bits of the operand for ZERO.
+        movs    r1,     r0
+        
+        lsls    r1,     r3
+        beq     LSYM(__ctz_skip)
+
+        // When the test fails, discard the lower bits of the register,
+        //  and deduct the count of discarded bits from the result.
+        movs    r0,     r1
+        subs    r2,     r3
+
+    LSYM(__ctz_skip):
+        // Decrease the shift distance for the next test.
+        lsrs    r3,     #1
+        bne     LSYM(__ctz_loop)
+       
+        // Prepare the remainder.
+        lsrs    r0,     #31
+ 
+  #else /* !__OPTIMIZE_SIZE__ */
+ 
+        // Unrolled binary search.
+        lsls    r1,     r0,     #16
+        beq     LSYM(__ctz8)
+        movs    r0,     r1
+        subs    r2,     #16
+
+    LSYM(__ctz8):
+        lsls    r1,     r0,     #8
+        beq     LSYM(__ctz4)
+        movs    r0,     r1
+        subs    r2,     #8
+
+    LSYM(__ctz4):
+        lsls    r1,     r0,     #4
+        beq     LSYM(__ctz2)
+        movs    r0,     r1
+        subs    r2,     #4
+
+    LSYM(__ctz2):
+        // Load the remainder by index
+        lsrs    r0,     #28 
+        adr     r3,     LSYM(__ctz_remainder)
+        ldrb    r0,     [r3, r0]
+  
+  #endif /* !__OPTIMIZE_SIZE__ */ 
+
+    LSYM(__ctz_zero):
+        // Apply the remainder.
+        subs    r0,     r2,     r0
+        RET
+       
+  #if (!defined(__ARM_FEATURE_CLZ) || !__ARM_FEATURE_CLZ) && \
+      (!defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__)
+        .align 2
+    LSYM(__ctz_remainder):
+        .byte 0,4,3,4,2,4,3,4,1,4,3,4,2,4,3,4
+  #endif  
+ 
+    CFI_END_FUNCTION
+CM0_FUNC_END ctzsi2
+
+#ifdef L_ctzdi2
+CM0_FUNC_END ctzdi2
+#endif
+
+#endif /* L_ctzsi2 || L_ctzdi2 */
+
+ 
+#ifdef L_ffsdi2
+
+// int __ffsdi2(int)
+// Return the index of the least significant 1-bit in $r1:r0, 
+//  or zero if $r1:r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+.section .text.sorted.libgcc.ctz2.ffsdi2,"x"
+CM0_FUNC_START ffsdi2
+    CFI_START_FUNCTION
+       
+        // Simplify branching by assuming a non-zero lower word.  
+        // For all such, ffssi2(x) == ctzsi2(x) + 1.  
+        movs    r2,    #(33 - CTZ_RESULT_OFFSET)
+        
+      #if defined(__ARMEB__) && __ARMEB__
+        // HACK: Save the upper word in a scratch register. 
+        movs    r3,     r0
+      
+        // Test the lower word.
+        movs    r0,     r1
+        bne     SYM(__internal_ctzsi2)
+
+        // Test the upper word.
+        movs    r2,    #(65 - CTZ_RESULT_OFFSET)
+        movs    r0,     r3
+        bne     SYM(__internal_ctzsi2)
+        
+      #else /* !__ARMEB__ */
+        // Test the lower word.
+        cmp     r0,     #0
+        bne     SYM(__internal_ctzsi2)
+
+        // Test the upper word.
+        movs    r2,    #(65 - CTZ_RESULT_OFFSET)
+        movs    r0,     r1
+        bne     SYM(__internal_ctzsi2)
+        
+      #endif /* !__ARMEB__ */
+
+        // Upper and lower words are both zero. 
+        RET
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END ffsdi2
+   
+#endif /* L_ffsdi2 */
+
+
+#ifdef L_ffssi2 
+    
+// int __ffssi2(int)
+// Return the index of the least significant 1-bit in $r0, 
+//  or zero if $r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+.section .text.sorted.libgcc.ctz2.ffssi2,"x"
+CM0_FUNC_START ffssi2
+    CFI_START_FUNCTION
+
+        // Simplify branching by assuming a non-zero argument.  
+        // For all such, ffssi2(x) == ctzsi2(x) + 1.  
+        movs    r2,    #(33 - CTZ_RESULT_OFFSET)
+ 
+        // Test for zero, return unmodified.  
+        cmp     r0,     #0 
+        bne     SYM(__internal_ctzsi2)
+        RET
+ 
+    CFI_END_FUNCTION
+CM0_FUNC_END ffssi2
+
+#endif /* L_ffssi2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fadd.S gcc-11-20201220/libgcc/config/arm/cm0/fadd.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fadd.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fadd.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,324 @@
+/* fadd.S: Cortex M0 optimized 32-bit float addition
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_frsubsf3
+
+// float __aeabi_frsub(float, float)
+// Returns the floating point difference of $r1 - $r0 in $r0.
+.section .text.sorted.libgcc.fpcore.b.frsub,"x"
+CM0_FUNC_START aeabi_frsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r0 is NAN before modifying.
+        lsls    r2,     r0,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     SYM(__aeabi_fadd)
+      #endif
+
+        // Flip sign and run through fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r0,     r2
+        b       SYM(__aeabi_fadd)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_frsub
+
+#endif /* L_arm_frsubsf3 */
+
+
+#ifdef L_arm_addsubsf3 
+
+// float __aeabi_fsub(float, float)
+// Returns the floating point difference of $r0 - $r1 in $r0.
+.section .text.sorted.libgcc.fpcore.c.faddsub,"x"
+CM0_FUNC_START aeabi_fsub
+CM0_FUNC_ALIAS subsf3 aeabi_fsub
+    CFI_START_FUNCTION
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Check if $r1 is NAN before modifying.
+        lsls    r2,     r1,     #1
+        movs    r3,     #255
+        lsls    r3,     #24
+
+        // Let fadd() find the NAN in the normal course of operation,
+        //  moving it to $r0 and checking the quiet/signaling bit.
+        cmp     r2,     r3
+        bhi     SYM(__aeabi_fadd)
+      #endif
+
+        // Flip sign and fall into fadd().
+        movs    r2,     #1
+        lsls    r2,     #31
+        adds    r1,     r2
+
+#endif /* L_arm_addsubsf3 */
+
+
+// The execution of __subsf3() flows directly into __addsf3(), such that
+//  instructions must appear consecutively in the same memory section.
+//  However, this construction inhibits the ability to discard __subsf3()
+//  when only using __addsf3().
+// Therefore, this block configures __addsf3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __subsf3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_arm_addsf3' should appear before '_arm_addsubsf3' in LIB1ASMFUNCS.
+#if defined(L_arm_addsf3) || defined(L_arm_addsubsf3) 
+
+#ifdef L_arm_addsf3
+// float __aeabi_fadd(float, float)
+// Returns the floating point sum of $r0 + $r1 in $r0.
+.section .text.sorted.libgcc.fpcore.c.fadd,"x"
+CM0_WEAK_START aeabi_fadd
+CM0_WEAK_ALIAS addsf3 aeabi_fadd
+    CFI_START_FUNCTION
+
+#else /* L_arm_addsubsf3 */
+CM0_FUNC_START aeabi_fadd
+CM0_FUNC_ALIAS addsf3 aeabi_fadd
+
+#endif
+ 
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Drop the sign bit to compare absolute value.
+        lsls    r2,     r0,     #1
+        lsls    r3,     r1,     #1
+
+        // Save the logical difference of original values.
+        // This actually makes the following swap slightly faster.
+        eors    r1,     r0
+
+        // Compare exponents+mantissa.
+        // MAYBE: Speedup for equal values?  This would have to separately
+        //  check for NAN/INF and then either:
+        // * Increase the exponent by '1' (for multiply by 2), or
+        // * Return +0
+        cmp     r2,     r3
+        bhs     LSYM(__fadd_ordered)
+
+        // Reorder operands so the larger absolute value is in r2,
+        //  the corresponding original operand is in $r0,
+        //  and the smaller absolute value is in $r3.
+        movs    r3,     r2
+        eors    r0,     r1
+        lsls    r2,     r0,     #1
+
+    LSYM(__fadd_ordered):
+        // Extract the exponent of the larger operand.
+        // If INF/NAN, then it becomes an automatic result.
+        lsrs    r2,     #24
+        cmp     r2,     #255
+        beq     LSYM(__fadd_special)
+
+        // Save the sign of the result.
+        lsrs    rT,     r0,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // If the original value of $r1 was to +/-0,
+        //  $r0 becomes the automatic result.
+        // Because $r0 is known to be a finite value, return directly.
+        // It's actually important that +/-0 not go through the normal
+        //  process, to keep "-0 +/- 0"  from being turned into +0.
+        cmp     r3,     #0
+        beq     LSYM(__fadd_zero)
+
+        // Extract the second exponent.
+        lsrs    r3,     #24
+
+        // Calculate the difference of exponents (always positive).
+        subs    r3,     r2,     r3
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // If the smaller operand is more than 25 bits less significant
+        //  than the larger, the larger operand is an automatic result.
+        // The smaller operand can't affect the result, even after rounding.
+        cmp     r3,     #25
+        bhi     LSYM(__fadd_return)
+      #endif
+
+        // Isolate both mantissas, recovering the smaller.
+        lsls    rT,     r0,     #9
+        lsls    r0,     r1,     #9
+        eors    r0,     rT
+
+        // If the larger operand is normal, restore the implicit '1'.
+        // If subnormal, the second operand will also be subnormal.
+        cmp     r2,     #0
+        beq     LSYM(__fadd_normal)
+        adds    rT,     #1
+        rors    rT,     rT
+
+        // If the smaller operand is also normal, restore the implicit '1'.
+        // If subnormal, the smaller operand effectively remains multiplied
+        //  by 2 w.r.t the first.  This compensates for subnormal exponents,
+        //  which are technically still -126, not -127.
+        cmp     r2,     r3
+        beq     LSYM(__fadd_normal)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fadd_normal):
+        // Provide a spare bit for overflow.
+        // Normal values will be aligned in bits [30:7]
+        // Subnormal values will be aligned in bits [30:8]
+        lsrs    rT,     #1
+        lsrs    r0,     #1
+
+        // If signs weren't matched, negate the smaller operand (branchless).
+        asrs    r1,     #31
+        eors    r0,     r1
+        subs    r0,     r1
+
+        // Keep a copy of the small mantissa for the remainder.
+        movs    r1,     r0
+
+        // Align the small mantissa for addition.
+        asrs    r1,     r3
+
+        // Isolate the remainder.
+        // NOTE: Given the various cases above, the remainder will only
+        //  be used as a boolean for rounding ties to even.  It is not
+        //  necessary to negate the remainder for subtraction operations.
+        rsbs    r3,     #0
+        adds    r3,     #32
+        lsls    r0,     r3
+
+        // Because operands are ordered, the result will never be negative.
+        // If the result of subtraction is 0, the overall result must be +0.
+        // If the overall result in $r1 is 0, then the remainder in $r0
+        //  must also be 0, so no register copy is necessary on return.
+        adds    r1,     rT
+        beq     LSYM(__fadd_return)
+
+        // The large operand was aligned in bits [29:7]...
+        // If the larger operand was normal, the implicit '1' went in bit [30].
+        //
+        // After addition, the MSB of the result may be in bit:
+        //    31,  if the result overflowed.
+        //    30,  the usual case.
+        //    29,  if there was a subtraction of operands with exponents
+        //          differing by more than 1.
+        //  < 28, if there was a subtraction of operands with exponents +/-1,
+        //  < 28, if both operands were subnormal.
+
+        // In the last case (both subnormal), the alignment shift will be 8,
+        //  the exponent will be 0, and no rounding is necessary.
+        cmp     r2,     #0
+        bne     SYM(__fp_assemble)
+
+        // Subnormal overflow automatically forms the correct exponent.
+        lsrs    r0,     r1,     #8
+        add     r0,     ip
+
+    LSYM(__fadd_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fadd_special):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If $r1 is (also) NAN, force it in place of $r0.
+        // As the smaller NAN, it is more likely to be signaling.
+        movs    rT,     #255
+        lsls    rT,     #24
+        cmp     r3,     rT
+        bls     LSYM(__fadd_ordered2)
+
+        eors    r0,     r1
+      #endif
+
+    LSYM(__fadd_ordered2):
+        // There are several possible cases to consider here:
+        //  1. Any NAN/NAN combination
+        //  2. Any NAN/INF combination
+        //  3. Any NAN/value combination
+        //  4. INF/INF with matching signs
+        //  5. INF/INF with mismatched signs.
+        //  6. Any INF/value combination.
+        // In all cases but the case 5, it is safe to return $r0.
+        // In the special case, a new NAN must be constructed.
+        // First, check the mantissa to see if $r0 is NAN.
+        lsls    r2,     r0,     #9
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fadd_return)
+      #endif
+
+    LSYM(__fadd_zero):
+        // Next, check for an INF/value combination.
+        lsls    r2,     r1,     #1
+        bne     LSYM(__fadd_return)
+
+        // Finally, check for matching sign on INF/INF.
+        // Also accepts matching signs when +/-0 are added.
+        bcc     LSYM(__fadd_return)
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(SUBTRACTED_INFINITY)
+      #endif
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // Restore original operands.
+        eors    r1,     r0
+      #endif
+
+        // Identify mismatched 0.
+        lsls    r2,     r0,     #1
+        bne     SYM(__fp_exception)
+
+        // Force mismatched 0 to +0.
+        eors    r0,     r0
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END addsf3
+CM0_FUNC_END aeabi_fadd
+
+#ifdef L_arm_addsubsf3
+CM0_FUNC_END subsf3
+CM0_FUNC_END aeabi_fsub
+#endif
+
+#endif /* L_arm_addsf3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fcmp.S gcc-11-20201220/libgcc/config/arm/cm0/fcmp.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fcmp.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,634 @@
+/* fcmp.S: Cortex M0 optimized 32-bit float comparison
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_cmpsf2
+
+// int __cmpsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * +1 if ($r0 > $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * -1 if ($r0 < $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+.section .text.sorted.libgcc.fcmp.cmpsf2,"x"
+CM0_FUNC_START cmpsf2
+CM0_FUNC_ALIAS lesf2 cmpsf2
+CM0_FUNC_ALIAS ltsf2 cmpsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+
+// int,int __internal_cmpsf2(float, float, int)
+// Internal function expects a set of control flags in $r2.
+// If ordered, returns a comparison type { 0, 1, 2 } in $r3
+CM0_FUNC_START internal_cmpsf2
+
+        // When operand signs are considered, the comparison result falls
+        //  within one of the following quadrants:
+        //
+        // $r0  $r1  $r0-$r1* flags  result
+        //  +    +      >      C=0     GT
+        //  +    +      =      Z=1     EQ
+        //  +    +      <      C=1     LT
+        //  +    -      >      C=1     GT
+        //  +    -      =      C=1     GT
+        //  +    -      <      C=1     GT
+        //  -    +      >      C=0     LT
+        //  -    +      =      C=0     LT
+        //  -    +      <      C=0     LT
+        //  -    -      >      C=0     LT
+        //  -    -      =      Z=1     EQ
+        //  -    -      <      C=1     GT
+        //
+        // *When interpeted as a subtraction of unsigned integers
+        //
+        // From the table, it is clear that in the presence of any negative
+        //  operand, the natural result simply needs to be reversed.
+        // Save the 'N' flag for later use.
+        movs    r3,     r0
+        orrs    r3,     r1
+        mov     ip,     r3
+
+        // Keep the absolute value of the second argument for NAN testing.
+        lsls    r3,     r1,     #1
+
+        // With the absolute value of the second argument safely stored,
+        //  recycle $r1 to calculate the difference of the arguments.
+        subs    r1,     r0,     r1
+
+        // Save the 'C' flag for use later.
+        // Effectively shifts all the flags 1 bit left.
+        adcs    r2,     r2
+
+        // Absolute value of the first argument.
+        lsls    r0,     #1
+
+        // Identify the largest absolute value between the two arguments.
+        cmp     r0,     r3
+        bhs     LSYM(__fcmp_sorted)
+
+        // Keep the larger absolute value for NAN testing.
+        // NOTE: When the arguments are respectively a signaling NAN and a
+        //  quiet NAN, the quiet NAN has precedence.  This has consequences
+        //  if TRAP_NANS is enabled, but the flags indicate that exceptions
+        //  for quiet NANs should be suppressed.  After the signaling NAN is
+        //  discarded, no exception is raised, although it should have been.
+        // This could be avoided by using a fifth register to save both
+        //  arguments until the signaling bit can be tested, but that seems
+        //  like an excessive amount of ugly code for an ambiguous case.
+        movs    r0,     r3
+
+    LSYM(__fcmp_sorted):
+        // If $r3 is NAN, the result is unordered.
+        movs    r3,     #255
+        lsls    r3,     #24
+        cmp     r0,     r3
+        bhi     LSYM(__fcmp_unordered)
+
+        // Positive and negative zero must be considered equal.
+        // If the larger absolute value is +/-0, both must have been +/-0.
+        subs    r3,     r0,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Test for regular equality.
+        subs    r3,     r1,     #0
+        beq     LSYM(__fcmp_zero)
+
+        // Isolate the saved 'C', and invert if either argument was negative.
+        // Remembering that the original subtraction was $r1 - $r0,
+        //  the result will be 1 if 'C' was set (gt), or 0 for not 'C' (lt).
+        lsls    r3,     r2,     #31
+        add     r3,     ip
+        lsrs    r3,     #31
+
+        // HACK: Force the 'C' bit clear, 
+        //  since bit[30] of $r3 may vary with the operands.
+        adds    r3,     #0
+
+    LSYM(__fcmp_zero):
+        // After everything is combined, the temp result will be
+        //  2 (gt), 1 (eq), or 0 (lt).
+        adcs    r3,     r3
+
+        // Short-circuit return if the 3-way comparison flag is set.
+        // Otherwise, shifts the condition mask into bits[2:0].
+        lsrs    r2,     #2
+        bcs     LSYM(__fcmp_return)
+
+        // If the bit corresponding to the comparison result is set in the
+        //  accepance mask, a '1' will fall out into the result.
+        movs    r0,     #1
+        lsrs    r2,     r3
+        ands    r0,     r2
+        RET
+
+    LSYM(__fcmp_unordered):
+        // Set up the requested UNORDERED result.
+        // Remember the shift in the flags (above).
+        lsrs    r2,     #6
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        // TODO: ... The
+
+
+  #endif
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Always raise an exception if FCMP_RAISE_EXCEPTIONS was specified.
+        bcs     LSYM(__fcmp_trap)
+
+        // If FCMP_NO_EXCEPTIONS was specified, no exceptions on quiet NANs.
+        // The comparison flags are moot, so $r1 can serve as scratch space.
+        lsrs    r1,     r0,     #24
+        bcs     LSYM(__fcmp_return2)
+
+    LSYM(__fcmp_trap):
+        // Restore the NAN (sans sign) for an argument to the exception.
+        // As an IRQ, the handler restores all registers, including $r3.
+        // NOTE: The service handler may not return.
+        lsrs    r0,     #1
+        movs    r3,     #(UNORDERED_COMPARISON)
+        svc     #(SVC_TRAP_NAN)
+  #endif
+
+     LSYM(__fcmp_return2):
+        // HACK: Work around result register mapping.
+        // This could probably be eliminated by remapping the flags register.
+        movs    r3,     r2
+
+    LSYM(__fcmp_return):
+        // Finish setting up the result.
+        // Constant subtraction allows a negative result while keeping the 
+        //  $r2 flag control word within 8 bits, particularly for FCMP_UN*.  
+        // This operation also happens to set the 'Z' and 'C' flags correctly
+        //  per the requirements of __aeabi_cfcmple() et al.
+        subs    r0,     r3,     #1
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ltsf2
+CM0_FUNC_END lesf2
+CM0_FUNC_END cmpsf2
+
+#endif /* L_arm_cmpsf2 */ 
+
+
+#ifdef L_arm_eqsf2
+
+// int __eqsf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1)
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1), or either argument is NAN
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.eqsf2,"x"
+CM0_FUNC_START eqsf2
+CM0_FUNC_ALIAS nesf2 eqsf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END nesf2
+CM0_FUNC_END eqsf2
+
+#endif /* L_arm_eqsf2 */
+
+
+#ifdef L_arm_gesf2
+
+// int __gesf2(float, float)
+// <https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html>
+// Returns the three-way comparison result of $r0 with $r1:
+//  * -1 if ($r0 < $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * +1 if ($r0 > $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.gesf2,"x"
+CM0_FUNC_START gesf2
+CM0_FUNC_ALIAS gtsf2 gesf2
+    CFI_START_FUNCTION
+
+        // Assumption: The 'libgcc' functions should raise exceptions.
+        movs    r2,     #(FCMP_UN_NEGATIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END gtsf2
+CM0_FUNC_END gesf2
+
+#endif /* L_arm_gesf2 */
+
+
+#ifdef L_arm_fcmpeq
+
+// int __aeabi_fcmpeq(float, float)
+// Returns '1' in $r1 if ($r0 == $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.fcmpeq,"x"
+CM0_FUNC_START aeabi_fcmpeq
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpeq
+
+#endif /* L_arm_fcmpeq */
+
+
+#ifdef L_arm_fcmpne
+
+// int __aeabi_fcmpne(float, float) [non-standard]
+// Returns '1' in $r1 if ($r0 != $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range.
+.section .text.sorted.libgcc.fcmp.fcmpne,"x"
+CM0_FUNC_START aeabi_fcmpne
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_NE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpne
+
+#endif /* L_arm_fcmpne */
+
+
+#ifdef L_arm_fcmplt
+
+// int __aeabi_fcmplt(float, float)
+// Returns '1' in $r1 if ($r0 < $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range.
+.section .text.sorted.libgcc.fcmp.fcmplt,"x"
+CM0_FUNC_START aeabi_fcmplt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmplt
+
+#endif /* L_arm_fcmplt */
+
+
+#ifdef L_arm_fcmple
+
+// int __aeabi_fcmple(float, float)
+// Returns '1' in $r1 if ($r0 <= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range.
+.section .text.sorted.libgcc.fcmp.fcmple,"x"
+CM0_FUNC_START aeabi_fcmple
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_LE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmple
+
+#endif /* L_arm_fcmple */
+
+
+#ifdef L_arm_fcmpge
+
+// int __aeabi_fcmpge(float, float)
+// Returns '1' in $r1 if ($r0 >= $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.fcmpge,"x"
+CM0_FUNC_START aeabi_fcmpge
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GE)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpge
+
+#endif /* L_arm_fcmpge */
+
+
+#ifdef L_arm_fcmpgt
+
+// int __aeabi_fcmpgt(float, float)
+// Returns '1' in $r1 if ($r0 > $r1) (ordered).
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.fcmpgt,"x"
+CM0_FUNC_START aeabi_fcmpgt
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_RAISE_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_fcmpgt
+
+#endif /* L_arm_cmpgt */ 
+
+
+#ifdef L_arm_unordsf2
+
+// int __aeabi_fcmpun(float, float)
+// Returns '1' in $r1 if $r0 and $r1 are unordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.unordsf2,"x"
+CM0_FUNC_START aeabi_fcmpun
+CM0_FUNC_ALIAS unordsf2 aeabi_fcmpun
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END unordsf2
+CM0_FUNC_END aeabi_fcmpun
+
+#endif /* L_arm_unordsf2 */
+
+
+#ifdef L_arm_cfrcmple
+
+// void __aeabi_cfrcmple(float, float)
+// Reverse three-way compare of $r1 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 > $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+// Same parent section as __cmpsf2() to keep tail call branch within range.
+.section .text.sorted.libgcc.fcmp.cfrcmple,"x"
+CM0_FUNC_START aeabi_cfrcmple
+    CFI_START_FUNCTION
+
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { r0 - r3, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 24
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset r1, 4
+                .cfi_rel_offset r2, 8
+                .cfi_rel_offset r3, 12
+                .cfi_rel_offset rT, 16
+                .cfi_rel_offset lr, 20
+      #else
+        push    { r0 - r3, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 20
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset r1, 4
+                .cfi_rel_offset r2, 8
+                .cfi_rel_offset r3, 12
+                .cfi_rel_offset lr, 16
+      #endif
+
+        // Reverse the operands.
+        movs    r0,     r1
+        ldr     r1,     [sp, #0]
+
+        // Don't just fall through, else registers will get pushed twice.
+        b       SYM(__internal_cfrcmple)
+
+        // MAYBE: 
+        // It might be better to pass original order arguments and swap 
+        //  the result instead.  Cleaner for STRICT_NAN trapping too.
+        //  Is 4 cycles worth 6 bytes?
+        // For example: 
+        //  $r2 = (FCMP_UN_NEGATIVE + FCMP_NO_EXCEPTIONS + FCMP_3WAY) 
+        //  movs    r1,    #1  
+        //  subs    r1,    r3
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_cfrcmple
+
+#endif /* L_arm_cfrcmple */
+
+
+#if defined(L_arm_cfcmple) || \ 
+   (defined(L_arm_cfcmpeq) && defined(TRAP_NANS) && TRAP_NANS)
+
+#ifdef L_arm_cfcmple
+.section .text.sorted.libgcc.fcmp.cfcmple,"x"
+  #define CFCMPLE_NAME aeabi_cfcmple
+#else
+.section .text.sorted.libgcc.fcmp.cfcmpeq,"x"
+  #define CFCMPLE_NAME aeabi_cfcmpeq 
+#endif
+
+// void __aeabi_cfcmple(float, float)
+// void __aeabi_cfcmpeq(float, float)
+// NOTE: These functions are only distinct if __aeabi_cfcmple() can raise exceptions.
+// Three-way compare of $r0 ? $r1, with result in the status flags:
+//  * 'Z' is set only when the operands are ordered and equal.
+//  * 'C' is clear only when the operands are ordered and $r0 < $r1.
+// Preserves all core registers except $ip, $lr, and the CPSR.
+// Same parent section as __cmpsf2() to keep tail call branch within range.
+CM0_FUNC_START CFCMPLE_NAME 
+
+  // __aeabi_cfcmpeq() is defined separately when TRAP_NANS is enabled.
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    CM0_FUNC_ALIAS aeabi_cfcmpeq aeabi_cfcmple
+  #endif
+
+    CFI_START_FUNCTION
+
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { r0 - r3, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 24
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset r1, 4
+                .cfi_rel_offset r2, 8
+                .cfi_rel_offset r3, 12
+                .cfi_rel_offset rT, 16
+                .cfi_rel_offset lr, 20
+      #else
+        push    { r0 - r3, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 20
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset r1, 4
+                .cfi_rel_offset r2, 8
+                .cfi_rel_offset r3, 12
+                .cfi_rel_offset lr, 16
+      #endif
+
+  #ifdef L_arm_cfcmple 
+    CM0_FUNC_START internal_cfrcmple
+        // Even though the result in $r0 will be discarded, the 3-way 
+        //  subtraction of '-1' that generates this result happens to 
+        //  set 'C' and 'Z' perfectly.  Unordered results group with '>'.
+        // This happens to be the same control word as __cmpsf2(), meaning 
+        //  that __cmpsf2() is a potential branch target.  However, 
+        //  the choice to set a redundant control word and branch to
+        //  __internal_cmpsf2() makes this compiled object more robust
+        //  against linking with 'foreign' __cmpsf2() implementations.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+  #else /* L_arm_cfcmpeq */ 
+    CM0_FUNC_START internal_cfrcmpeq
+        // No exceptions on quiet NAN.
+        movs    r2,     #(FCMP_UN_POSITIVE + FCMP_NO_EXCEPTIONS + FCMP_3WAY)
+  #endif 
+
+        bl      SYM(__internal_cmpsf2)
+
+        // Clean up all working registers.
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { r0 - r3, rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { r0 - r3, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+
+  #if !defined(TRAP_NANS) || !TRAP_NANS
+    CM0_FUNC_END aeabi_cfcmpeq
+  #endif
+
+CM0_FUNC_END CFCMPLE_NAME 
+
+#endif /* L_arm_cfcmple || L_arm_cfcmpeq */
+
+
+// C99 libm functions
+#if 0
+
+// int isgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 > $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.isgtf,"x"
+CM0_FUNC_START isgreaterf
+MATH_ALIAS isgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterf
+CM0_FUNC_END isgreaterf
+
+
+// int isgreaterequalf(float, float)
+// Returns '1' in $r0 if ($r0 >= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.isgef,"x"
+CM0_FUNC_START isgreaterequalf
+MATH_ALIAS isgreaterequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isgreaterequalf
+CM0_FUNC_END isgreaterequalf
+
+
+// int islessf(float, float)
+// Returns '1' in $r0 if ($r0 < $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.isltf,"x"
+CM0_FUNC_START islessf
+MATH_ALIAS islessf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessf
+CM0_FUNC_END islessf
+
+
+// int islessequalf(float, float)
+// Returns '1' in $r0 if ($r0 <= $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.islef,"x"
+CM0_FUNC_START islessequalf
+MATH_ALIAS islessequalf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessequalf
+CM0_FUNC_END islessequalf
+
+
+// int islessgreaterf(float, float)
+// Returns '1' in $r0 if ($r0 != $r1) and both $r0 and $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.isnef,"x"
+CM0_FUNC_START islessgreaterf
+MATH_ALIAS islessgreaterf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END islessgreaterf
+CM0_FUNC_END islessgreaterf
+
+
+// int isunorderedf(float, float)
+// Returns '1' in $r0 if either $r0 or $r1 are ordered.
+// Uses $r2, $r3, and $ip as scratch space.
+// Same parent section as __cmpsf2() to keep tail call branch within range. 
+.section .text.sorted.libgcc.fcmp.isunf,"x"
+CM0_FUNC_START isunorderedf
+MATH_ALIAS isunorderedf
+    CFI_START_FUNCTION
+
+        movs    r2,     #(FCMP_UN_ZERO + FCMP_NO_EXCEPTIONS + FCMP_GT + FCMP_EQ)
+        b       SYM(__internal_cmpsf2)
+
+    CFI_END_FUNCTION
+MATH_END isunorderedf
+CM0_FUNC_END isunorderedf
+
+#endif /* 0 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fconv.S gcc-11-20201220/libgcc/config/arm/cm0/fconv.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fconv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fconv.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,429 @@
+/* fconv.S: Cortex M0 optimized 32- and 64-bit float conversions
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_f2d
+
+// double __aeabi_f2d(float)
+// Converts a single-precision float in $r0 to double-precision in $r1:$r0.
+// Rounding, overflow, and underflow are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.sorted.libgcc.fpcore.v.extendsfdf2,"x"
+CM0_FUNC_START aeabi_f2d
+CM0_FUNC_ALIAS extendsfdf2 aeabi_f2d
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r1,     r0,     #31
+        lsls    r1,     #31
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Test for zero.
+        lsls    r0,     #1
+        beq     LSYM(__f2d_return)
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        movs    r2,     r0
+        bl      SYM(__fp_normalize2) __PLT__
+
+        // Set up the exponent bias.  For INF/NAN values, the bias
+        //  is 1791 (2047 - 255 - 1), where the last '1' accounts
+        //  for the implicit '1' in the mantissa.
+        movs    r0,     #3
+        lsls    r0,     #9
+        adds    r0,     #255
+
+        // Test for INF/NAN, promote exponent if necessary
+        cmp     r2,     #255
+        beq     LSYM(__f2d_indefinite)
+
+        // For normal values, the exponent bias is 895 (1023 - 127 - 1),
+        //  which is half of the prepared INF/NAN bias.
+        lsrs    r0,     #1
+
+    LSYM(__f2d_indefinite):
+        // Assemble exponent with bias correction.
+        adds    r2,     r0
+        lsls    r2,     #20
+        adds    r1,     r2
+
+        // Assemble the high word of the mantissa.
+        lsrs    r0,     r3,     #11
+        add     r1,     r0
+
+        // Remainder of the mantissa in the low word of the result.
+        lsls    r0,     r3,     #21
+
+    LSYM(__f2d_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END extendsfdf2
+CM0_FUNC_END aeabi_f2d
+
+#endif /* L_arm_f2d */
+
+
+#if defined(L_arm_d2f)
+// TODO: not tested || defined(L_arm_truncdfsf2)
+
+// HACK: Build two separate implementations:
+//  * __aeabi_d2f() rounds to nearest per traditional IEEE-753 rules.
+//  * __truncdfsf2() rounds towards zero per GCC specification.  
+// Presumably, a program will consistently use one ABI or the other, 
+//  which means that this code will not be duplicated in practice.  
+// Merging the two versions with dynamic rounding would be rather hard. 
+#ifdef L_arm_truncdfsf2
+  #define D2F_NAME truncdfsf2 
+#else
+  #define D2F_NAME aeabi_d2f
+#endif
+
+// float __aeabi_d2f(double)
+// Converts a double-precision float in $r1:$r0 to single-precision in $r0.
+// Values out of range become ZERO or INF; returns the upper 23 bits of NAN.
+.section .text.sorted.libgcc.fpcore.w.truncdfsf2,"x"
+CM0_FUNC_START D2F_NAME
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        lsrs    r2,     r1,     #31
+        lsls    r2,     #31
+        mov     ip,     r2
+
+        // Isolate the exponent (11 bits).
+        lsls    r2,     r1,     #1
+        lsrs    r2,     #21
+
+        // Isolate the mantissa.  It's safe to always add the implicit '1' --
+        //  even for subnormals -- since they will underflow in every case.
+        lsls    r1,     #12
+        adds    r1,     #1
+        rors    r1,     r1
+        lsrs    r3,     r0,     #21
+        adds    r1,     r3
+
+  #ifndef L_arm_truncdfsf2 
+        // Fix the remainder.  Even though the mantissa already has 32 bits
+        //  of significance, this value still influences rounding ties.  
+        lsls    r0,     #11
+  #endif 
+
+        // Test for INF/NAN (r3 = 2047)
+        mvns    r3,     r2
+        lsrs    r3,     #21
+        cmp     r3,     r2
+        beq     LSYM(__d2f_indefinite)
+
+        // Adjust exponent bias.  Offset is 127 - 1023, less 1 more since
+        //  __fp_assemble() expects the exponent relative to bit[30].
+        lsrs    r3,     #1
+        subs    r2,     r3
+        adds    r2,     #126
+
+  #ifndef L_arm_truncdfsf2 
+    LSYM(__d2f_overflow):
+        // Use the standard formatting for overflow and underflow.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(__fp_assemble)
+                .cfi_restore_state
+
+  #else /* L_arm_truncdfsf2 */
+        // In theory, __truncdfsf2() could also push registers and branch to
+        //  __fp_assemble() after calculating the truncation shift and clearing
+        //  bits.  __fp_assemble() always rounds down if there is no remainder.  
+        // However, after doing all of that work, the incremental cost to  
+        //  finish assembling the return value is only 6 or 7 instructions
+        //  (depending on how __d2f_overflow() returns).
+        // This seems worthwhile to avoid linking in all of __fp_assemble(). 
+
+        // Test for INF. 
+        cmp     r2,     #254 
+        bge     LSYM(__d2f_overflow)
+
+        // HACK: Pre-empt the default round-to-nearest mode, 
+        //  since GCC specifies rounding towards zero. 
+        // Start by identifying subnormals by negative exponents. 
+        asrs    r3,     r2,     #31
+        ands    r3,     r2
+
+        // Clear the standard exponent field for subnormals. 
+        eors    r2,     r3
+
+        // Add the subnormal shift to the nominal 8 bits.
+        rsbs    r3,     #0
+        adds    r3,     #8
+
+        // Clamp the shift to a single word (branchless).  
+        // Anything larger would have flushed to zero anyway.
+        lsls    r3,     #27 
+        lsrs    r3,     #27
+
+      #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // Preserve inexact zero. 
+        orrs    r0,     r1
+      #endif
+
+        // Clear the insignificant bits.
+        lsrs    r1,     r3 
+
+        // Combine the mantissa and the exponent.
+        // TODO: Test for inexact zero after adding. 
+        lsls    r2,     #23
+        adds    r0,     r1,     r2
+
+        // Combine with the saved sign.
+        add     r0,     ip
+        RET 
+
+    LSYM(__d2f_overflow):
+        // Construct signed INF in $r0.
+        movs    r0,     #255 
+        lsls    r0,     #23
+        add     r0,     ip
+        RET 
+
+  #endif /* L_arm_truncdfsf2 */
+
+    LSYM(__d2f_indefinite):
+        // Test for INF.  If the mantissa, exclusive of the implicit '1',
+        //  is equal to '0', the result will be INF.
+        lsls    r3,     r1,     #1
+        orrs    r3,     r0
+        beq     LSYM(__d2f_overflow)
+
+        // TODO: Support for TRAP_NANS here. 
+        // This will be double precision, not compatible with the current handler. 
+
+        // Construct NAN with the upper 22 bits of the mantissa, setting bit[21]
+        //  to ensure a valid NAN without changing bit[22] (quiet)
+        subs    r2,     #0xD
+        lsls    r0,     r2,     #20
+        lsrs    r1,     #8
+        orrs    r0,     r1
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Yes, the NAN has already been altered, but at least keep the sign... 
+        add     r0,     ip
+      #endif
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END D2F_NAME
+
+#endif /* L_arm_d2f || L_arm_truncdfsf2 */
+
+
+#ifdef L_arm_h2f 
+
+// float __aeabi_h2f(short hf)
+// Converts a half-precision float in $r0 to single-precision.
+// Rounding, overflow, and underflow conditions are impossible.
+// INF and ZERO are returned unmodified.
+.section .text.sorted.libgcc.h2f,"x"
+CM0_FUNC_START aeabi_h2f
+    CFI_START_FUNCTION
+
+        // Set up registers for __fp_normalize2().
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the mantissa and exponent.
+        lsls    r2,     r0,     #17
+
+        // Isolate the sign.
+        lsrs    r0,     #15
+        lsls    r0,     #31
+
+        // Align the exponent at bit[24] for normalization.
+        // If zero, return the original sign.
+        lsrs    r2,     #3
+        beq     LSYM(__h2f_return)
+
+        // Split the exponent and mantissa into separate registers.
+        // This is the most efficient way to convert subnormals in the
+        //  half-precision form into normals in single-precision.
+        // This does add a leading implicit '1' to INF and NAN,
+        //  but that will be absorbed when the value is re-assembled.
+        bl      SYM(__fp_normalize2) __PLT__
+
+        // Set up the exponent bias.  For INF/NAN values, the bias is 223,
+        //  where the last '1' accounts for the implicit '1' in the mantissa.
+        adds    r2,     #(255 - 31 - 1)
+
+        // Test for INF/NAN.
+        cmp     r2,     #254
+        beq     LSYM(__h2f_assemble)
+
+        // For normal values, the bias should have been 111.
+        // However, this adjustment now is faster than branching.
+        subs    r2,     #((255 - 31 - 1) - (127 - 15 - 1))
+
+    LSYM(__h2f_assemble):
+        // Combine exponent and sign.
+        lsls    r2,     #23
+        adds    r0,     r2
+
+        // Combine mantissa.
+        lsrs    r3,     #8
+        add     r0,     r3
+
+    LSYM(__h2f_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_h2f
+
+#endif /* L_arm_h2f */
+
+
+#ifdef L_arm_f2h
+
+// short __aeabi_f2h(float f)
+// Converts a single-precision float in $r1 to half-precision,
+//  rounding to nearest, ties to even.
+// Values out of range become ZERO or INF; returns the upper 12 bits of NAN.
+// Values out of range are forced to either ZERO or INF.
+.section .text.sorted.libgcc.f2h,"x"
+CM0_FUNC_START aeabi_f2h
+    CFI_START_FUNCTION
+
+        // Set up the sign.
+        lsrs    r2,     r0,     #31
+        lsls    r2,     #15
+
+        // Save the exponent and mantissa.
+        // If ZERO, return the original sign.
+        lsls    r0,     #1
+        beq     LSYM(__f2h_return)
+
+        // Isolate the exponent, check for NAN.
+        lsrs    r1,     r0,     #24
+        cmp     r1,     #255
+        beq     LSYM(__f2h_indefinite)
+
+        // Check for overflow.
+        cmp     r1,     #(127 + 15)
+        bhi     LSYM(__f2h_overflow)
+
+        // Isolate the mantissa, adding back the implicit '1'.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Adjust exponent bias for half-precision, including '1' to
+        //  account for the mantissa's implicit '1'.
+        subs    r1,     #(127 - 15 + 1)
+        bmi     LSYM(__f2h_underflow)
+
+        // Combine the exponent and sign.
+        lsls    r1,     #10
+        adds    r2,     r1
+
+        // Split the mantissa (11 bits) and remainder (13 bits).
+        lsls    r3,     r0,     #12
+        lsrs    r0,     #21
+
+     LSYM(__f2h_round):
+        // If the carry bit is '0', always round down.
+        bcc     LSYM(__f2h_return)
+
+        // Carry was set.  If a tie (no remainder) and the
+        //  LSB of the result are '0', round down (to even).
+        lsls    r1,     r0,     #31
+        orrs    r1,     r3
+        beq     LSYM(__f2h_return)
+
+        // Round up, ties to even.
+        adds    r0,     #1
+
+     LSYM(__f2h_return):
+        // Combine mantissa and exponent.
+        adds    r0,     r2
+        RET
+
+    LSYM(__f2h_underflow):
+        // Align the remainder. The remainder consists of the last 12 bits
+        //  of the mantissa plus the magnitude of underflow.
+        movs    r3,     r0
+        adds    r1,     #12
+        lsls    r3,     r1
+
+        // Align the mantissa.  The MSB of the remainder must be
+        // shifted out into last the 'C' flag for rounding.
+        subs    r1,     #33
+        rsbs    r1,     #0
+        lsrs    r0,     r1
+        b       LSYM(__f2h_round)
+
+    LSYM(__f2h_overflow):
+        // Create single-precision INF from which to construct half-precision.
+        movs    r0,     #255
+        lsls    r0,     #24
+
+    LSYM(__f2h_indefinite):
+        // Check for INF.
+        lsls    r3,     r0,     #8
+        beq     LSYM(__f2h_infinite)
+
+        // Set bit[8] to ensure a valid NAN without changing bit[9] (quiet).
+        adds    r2,     #128
+        adds    r2,     #128
+
+    LSYM(__f2h_infinite):
+        // Construct the result from the upper 22 bits of the mantissa
+        //  and the lower 5 bits of the exponent.
+        lsls    r0,     #3
+        lsrs    r0,     #17
+
+        // Combine with the sign (and possibly NAN flag).
+        orrs    r0,     r2
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_f2h
+
+#endif  /* L_arm_f2h */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fdiv.S gcc-11-20201220/libgcc/config/arm/cm0/fdiv.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fdiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fdiv.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,257 @@
+/* fdiv.S: Cortex M0 optimized 32-bit float division
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_divsf3
+
+// float __aeabi_fdiv(float, float)
+// Returns $r0 after division by $r1.
+.section .text.sorted.libgcc.fpcore.n.fdiv,"x"
+CM0_FUNC_START aeabi_fdiv
+CM0_FUNC_ALIAS divsf3 aeabi_fdiv
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save for the sign of the result.
+        movs    r3,     r1
+        eors    r3,     r0
+        lsrs    rT,     r3,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for divide by 0.  Automatically catches 0/0.
+        lsls    r2,     r1,     #1
+        beq     LSYM(__fdiv_by_zero)
+
+        // Check for INF/INF, or a number divided by itself.
+        lsls    r3,     #1
+        beq     LSYM(__fdiv_equal)
+
+        // Check the numerator for INF/NAN.
+        eors    r3,     r2
+        cmp     r3,     rT
+        bhs     LSYM(__fdiv_special1)
+
+        // Check the denominator for INF/NAN.
+        cmp     r2,     rT
+        bhs     LSYM(__fdiv_special2)
+
+        // Check the numerator for zero.
+        cmp     r3,     #0
+        beq     SYM(__fp_zero)
+
+        // No action if the numerator is subnormal.
+        //  The mantissa will normalize naturally in the division loop.
+        lsls    r0,     #9
+        lsrs    r1,     r3,     #24
+        beq     LSYM(__fdiv_denominator)
+
+        // Restore the numerator's implicit '1'.
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fdiv_denominator):
+        // The denominator must be normalized and left aligned.
+        bl      SYM(__fp_normalize2)
+
+        // 25 bits of precision will be sufficient.
+        movs    rT,     #64
+
+        // Run division.
+        bl      SYM(__internal_fdiv_loop)
+        b       SYM(__fp_assemble)
+
+    LSYM(__fdiv_equal):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_INF_BY_INF)
+      #endif
+
+        // The absolute value of both operands are equal, but not 0.
+        // If both operands are INF, create a new NAN.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // If both operands are NAN, return the NAN in $r0.
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Return 1.0f, with appropriate sign.
+        movs    r0,     #127
+        lsls    r0,     #23
+        add     r0,     ip
+
+    LSYM(__fdiv_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fdiv_special2):
+        // The denominator is either INF or NAN, numerator is neither.
+        // Also, the denominator is not equal to 0.
+        // If the denominator is INF, the result goes to 0.
+        beq     SYM(__fp_zero)
+
+        // The only other option is NAN, fall through to branch.
+        mov     r0,     r1
+
+    LSYM(__fdiv_special1):
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // The numerator is INF or NAN.  If NAN, return it directly.
+        bne     SYM(__fp_check_nan)
+      #else
+        bne     LSYM(__fdiv_return)
+      #endif
+
+        // If INF, the result will be INF if the denominator is finite.
+        // The denominator won't be either INF or 0,
+        //  so fall through the exception trap to check for NAN.
+        movs    r0,     r1
+
+    LSYM(__fdiv_by_zero):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(DIVISION_0_BY_0)
+      #endif
+
+        // The denominator is 0.
+        // If the numerator is also 0, the result will be a new NAN.
+        // Otherwise the result will be INF, with the correct sign.
+        lsls    r2,     r0,     #1
+        beq     SYM(__fp_exception)
+
+        // The result should be NAN if the numerator is NAN.  Otherwise,
+        //  the result is INF regardless of the numerator value.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        bhi     SYM(__fp_check_nan)
+      #else
+        bhi     LSYM(__fdiv_return)
+      #endif
+
+        // Recreate INF with the correct sign.
+        b       SYM(__fp_infinity)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsf3
+CM0_FUNC_END aeabi_fdiv
+
+
+// Division helper, possibly to be shared with atan2.
+// Expects the numerator mantissa in $r0, exponent in $r1,
+//  plus the denominator mantissa in $r3, exponent in $r2, and
+//  a bit pattern in $rT that controls the result precision.
+// Returns quotient in $r1, exponent in $r2, pseudo remainder in $r0.
+.section .text.sorted.libgcc.fpcore.o.fdiv2,"x"
+CM0_FUNC_START internal_fdiv_loop
+    CFI_START_FUNCTION
+
+        // Initialize the exponent, relative to bit[30].
+        subs    r2,     r1,     r2
+
+    SYM(__internal_fdiv_loop2):
+        // The exponent should be (expN - 127) - (expD - 127) + 127.
+        // An additional offset of 25 is required to account for the
+        //  minimum number of bits in the result (before rounding).
+        // However, drop '1' because the offset is relative to bit[30],
+        //  while the result is calculated relative to bit[31].
+        adds    r2,     #(127 + 25 - 1)
+
+      #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Dividing by a power of 2?
+        lsls    r1,     r3,     #1
+        beq     LSYM(__fdiv_simple)
+      #endif
+
+        // Initialize the result.
+        eors    r1,     r1
+
+        // Clear the MSB, so that when the numerator is smaller than
+        //  the denominator, there is one bit free for a left shift.
+        // After a single shift, the numerator is guaranteed to be larger.
+        // The denominator ends up in r3, and the numerator ends up in r0,
+        //  so that the numerator serves as a psuedo-remainder in rounding.
+        // Shift the numerator one additional bit to compensate for the
+        //  pre-incrementing loop.
+        lsrs    r0,     #2
+        lsrs    r3,     #1
+
+    LSYM(__fdiv_loop):
+        // Once the MSB of the output reaches the MSB of the register,
+        //  the result has been calculated to the required precision.
+        lsls    r1,     #1
+        bmi     LSYM(__fdiv_break)
+
+        // Shift the numerator/remainder left to set up the next bit.
+        subs    r2,     #1
+        lsls    r0,     #1
+
+        // Test if the numerator/remainder is smaller than the denominator,
+        //  do nothing if it is.
+        cmp     r0,     r3
+        blo     LSYM(__fdiv_loop)
+
+        // If the numerator/remainder is greater or equal, set the next bit,
+        //  and subtract the denominator.
+        adds    r1,     rT
+        subs    r0,     r3
+
+        // Short-circuit if the remainder goes to 0.
+        // Even with the overhead of "subnormal" alignment,
+        //  this is usually much faster than continuing.
+        bne     LSYM(__fdiv_loop)
+
+        // Compensate the alignment of the result.
+        // The remainder does not need compensation, it's already 0.
+        lsls    r1,     #1
+
+    LSYM(__fdiv_break):
+        RET
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fdiv_simple):
+        // The numerator becomes the result, with a remainder of 0.
+        movs    r1,     r0
+        eors    r0,     r0
+        subs    r2,     #25
+        RET
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END internal_fdiv_loop
+
+#endif /* L_arm_divsf3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/ffixed.S gcc-11-20201220/libgcc/config/arm/cm0/ffixed.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/ffixed.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/ffixed.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,342 @@
+/* ffixed.S: Cortex M0 optimized float->int conversion
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_fixsfsi
+
+// int __aeabi_f2iz(float)
+// Converts a float in $r0 to signed integer, rounding toward 0.
+// Values out of range are forced to either INT_MAX or INT_MIN.
+// NAN becomes zero.
+.section .text.sorted.libgcc.fpcore.r.fixsfsi,"x"
+CM0_FUNC_START aeabi_f2iz
+CM0_FUNC_ALIAS fixsfsi aeabi_f2iz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #33
+        b       LSYM(__real_f2lz)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+        // Flag for signed conversion.
+        movs    r3,     #1
+
+    LSYM(__real_f2iz):
+        // Isolate the sign of the result.
+        asrs    r1,     r0,     #31
+        lsls    r0,     #1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Check for zero to avoid spurious underflow exception on -0.
+        beq     LSYM(__f2iz_return)
+  #endif
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+        // Test for NAN.
+        // Otherwise, NAN will be converted like +/-INF.
+        cmp     r2,     #255
+        beq     LSYM(__f2iz_nan)
+  #endif
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 31.
+        //  * An exponent of 128 will result in a shift of 30.
+        //  *  ...
+        //  * An exponent of 157 will result in a shift of 1.
+        //  * An exponent of 158 will result in no shift at all.
+        //  * An exponent larger than 158 will result in overflow.
+        rsbs    r2,     #0
+        adds    r2,     #158
+
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r3
+        blt     LSYM(__f2iz_overflow)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Would also catch -0.0f if not handled earlier.
+        cmn     r3,     r1
+        blt     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // Save a copy for remainder testing
+        movs    r3,     r0
+  #endif
+
+        // Truncate the fraction.
+        lsrs    r0,     r2
+
+        // Two's complement negation, if applicable.
+        // Bonus: the sign in $r1 provides a suitable long long result.
+        eors    r0,     r1
+        subs    r0,     r1
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+        // If any bits set in the remainder, raise FE_INEXACT
+        rsbs    r2,     #0
+        adds    r2,     #32
+        lsls    r3,     r2
+        bne     LSYM(__f2iz_inexact)
+  #endif
+
+    LSYM(__f2iz_return):
+        RET
+
+    LSYM(__f2iz_overflow):
+        // Positive unsigned integers (r1 == 0, r3 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r1 == -1, r3 == 0), return 0x00000000.
+        // Positive signed integers (r1 == 0, r3 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r1 == -1, r3 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^31).
+        mvns    r0,     r1
+        lsls    r3,     #31
+        eors    r0,     r3
+        RET
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+    LSYM(__f2iz_inexact):
+        // TODO: Another class of exceptions that doesn't overwrite $r0.
+        bkpt    #0
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_INEXACT)
+      #endif
+
+        b       SYM(__fp_exception)
+  #endif
+
+    LSYM(__f2iz_nan):
+        // Check for INF
+        lsls    r2,     r0,     #9
+        beq     LSYM(__f2iz_overflow)
+
+  #if defined(FP_EXCEPTION) && FP_EXCEPTION
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(CAST_UNDEFINED)
+      #endif
+
+        b       SYM(__fp_exception)
+  #else
+
+  #endif
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+
+        // TODO: Extend to long long
+
+        // TODO: bl  fp_check_nan
+      #endif
+
+        // Return long long 0 on NAN.
+        eors    r0,     r0
+        eors    r1,     r1
+        RET
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfsi
+CM0_FUNC_END aeabi_f2iz
+
+
+// unsigned int __aeabi_f2uiz(float)
+// Converts a float in $r0 to unsigned integer, rounding toward 0.
+// Values out of range are forced to UINT_MAX.
+// Negative values and NAN all become zero.
+.section .text.sorted.libgcc.fpcore.s.fixunssfsi,"x"
+CM0_FUNC_START aeabi_f2uiz
+CM0_FUNC_ALIAS fixunssfsi aeabi_f2uiz
+    CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Flag for unsigned conversion.
+        movs    r1,     #32
+        b       LSYM(__real_f2lz)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+        // Flag for unsigned conversion.
+        movs    r3,     #0
+        b       LSYM(__real_f2iz)
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfsi
+CM0_FUNC_END aeabi_f2uiz
+
+
+// long long aeabi_f2lz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to either INT64_MAX or INT64_MIN.
+// NAN becomes zero.
+.section .text.sorted.libgcc.fpcore.t.fixsfdi,"x"
+CM0_FUNC_START aeabi_f2lz
+CM0_FUNC_ALIAS fixsfdi aeabi_f2lz
+    CFI_START_FUNCTION
+
+        movs    r1,     #1
+
+    LSYM(__real_f2lz):
+        // Split the sign of the result from the mantissa/exponent field.
+        // Handle +/-0 specially to avoid spurious exceptions.
+        asrs    r3,     r0,     #31
+        lsls    r0,     #1
+        beq     LSYM(__f2lz_zero)
+
+        // If unsigned conversion of a negative value, also overflow.
+        // Specifically, is the LSB of $r1 clear when $r3 is equal to '-1'?
+        //
+        // $r3 (sign)   >=     $r2 (flag)
+        // 0xFFFFFFFF   false   0x00000000
+        // 0x00000000   true    0x00000000
+        // 0xFFFFFFFF   true    0x80000000
+        // 0x00000000   true    0x80000000
+        //
+        // (NOTE: This test will also trap -0.0f, unless handled earlier.)
+        lsls    r2,     r1,     #31
+        cmp     r3,     r2
+        blt     LSYM(__f2lz_overflow)
+
+        // Isolate the exponent.
+        lsrs    r2,     r0,     #24
+
+//   #if defined(TRAP_NANS) && TRAP_NANS
+//         // Test for NAN.
+//         // Otherwise, NAN will be converted like +/-INF.
+//         cmp     r2,     #255
+//         beq     LSYM(__f2lz_nan)
+//   #endif
+
+        // Calculate mantissa alignment. Given the implicit '1' in bit[31]:
+        //  * An exponent less than 127 will automatically flush to 0.
+        //  * An exponent of 127 will result in a shift of 63.
+        //  * An exponent of 128 will result in a shift of 62.
+        //  *  ...
+        //  * An exponent of 189 will result in a shift of 1.
+        //  * An exponent of 190 will result in no shift at all.
+        //  * An exponent larger than 190 will result in overflow
+        //     (189 in the case of signed integers).
+        rsbs    r2,     #0
+        adds    r2,     #190
+        // When the shift is less than minimum, the result will overflow.
+        // The only signed value to fail this test is INT_MIN (0x80000000),
+        //  but it will be returned correctly from the overflow branch.
+        cmp     r2,     r1
+        blt     LSYM(__f2lz_overflow)
+
+        // Extract the mantissa and restore the implicit '1'. Technically,
+        //  this is wrong for subnormals, but they flush to zero regardless.
+        lsls    r0,     #8
+        adds    r0,     #1
+        rors    r0,     r0
+
+        // Calculate the upper word.
+        // If the shift is greater than 32, gives an automatic '0'.
+        movs    r1,     r0
+        lsrs    r1,     r2
+
+        // Reduce the shift for the lower word.
+        // If the original shift was less than 32, the result may be split
+        //  between the upper and lower words.
+        subs    r2,     #32
+        blt     LSYM(__f2lz_split)
+
+        // Shift is still positive, keep moving right.
+        lsrs    r0,     r2
+
+        // TODO: Remainder test.
+        // $r1 is technically free, as long as it's zero by the time
+        //  this is over.
+
+    LSYM(__f2lz_return):
+        // Two's complement negation, if the original was negative.
+        eors    r0,     r3
+        eors    r1,     r3
+        subs    r0,     r3
+        sbcs    r1,     r3
+        RET
+
+    LSYM(__f2lz_split):
+        // Shift was negative, calculate the remainder
+        rsbs    r2,     #0
+        lsls    r0,     r2
+        b       LSYM(__f2lz_return)
+
+    LSYM(__f2lz_zero):
+        eors    r1,     r1
+        RET
+
+    LSYM(__f2lz_overflow):
+        // Positive unsigned integers (r3 == 0, r1 == 0), return 0xFFFFFFFF.
+        // Negative unsigned integers (r3 == -1, r1 == 0), return 0x00000000.
+        // Positive signed integers (r3 == 0, r1 == 1), return 0x7FFFFFFF.
+        // Negative signed integers (r3 == -1, r1 == 1), return 0x80000000.
+        // TODO: FE_INVALID exception, (but not for -2^63).
+        mvns    r0,     r3
+
+        // For 32-bit results
+        lsls    r2,     r1,     #26
+        lsls    r1,     #31
+        ands    r2,     r1
+        eors    r0,     r2
+
+//    LSYM(__f2lz_zero):
+        eors    r1,     r0
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixsfdi
+CM0_FUNC_END aeabi_f2lz
+
+
+// unsigned long long __aeabi_f2ulz(float)
+// Converts a float in $r0 to a 64 bit integer in $r1:$r0, rounding toward 0.
+// Values out of range are forced to UINT64_MAX.
+// Negative values and NAN all become zero.
+.section .text.sorted.libgcc.fpcore.u.fixunssfdi,"x"
+CM0_FUNC_START aeabi_f2ulz
+CM0_FUNC_ALIAS fixunssfdi aeabi_f2ulz
+    CFI_START_FUNCTION
+
+        eors    r1,     r1
+        b       LSYM(__real_f2lz)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fixunssfdi
+CM0_FUNC_END aeabi_f2ulz
+
+#endif /* L_arm_addsubsf3 */ 
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/ffloat.S gcc-11-20201220/libgcc/config/arm/cm0/ffloat.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/ffloat.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/ffloat.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,224 @@
+/* ffixed.S: Cortex M0 optimized int->float conversion
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+  
+#ifdef L_arm_floatsisf
+ 
+// float __aeabi_i2f(int)
+// Converts a signed integer in $r0 to float.
+.section .text.sorted.libgcc.fpcore.p.floatsisf,"x"
+
+// On little-endian cores (including all Cortex-M), __floatsisf() can be
+//  implemented as below in 5 instructions.  However, it can also be
+//  implemented by prefixing a single instruction to __floatdisf().
+// A memory savings of 4 instructions at a cost of only 2 execution cycles
+//  seems reasonable enough.  Plus, the trade-off only happens in programs
+//  that require both __floatsisf() and __floatdisf().  Programs only using
+//  __floatsisf() always get the smallest version.  
+// When the combined version will be provided, this standalone version
+//  must be declared WEAK, so that the combined version can supersede it.
+// '_arm_floatsisf' should appear before '_arm_floatdisf' in LIB1ASMFUNCS.
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+CM0_WEAK_START aeabi_i2f
+CM0_WEAK_ALIAS floatsisf aeabi_i2f
+#else /* !__OPTIMIZE_SIZE__ */ 
+CM0_FUNC_START aeabi_i2f
+CM0_FUNC_ALIAS floatsisf aeabi_i2f
+#endif /* !__OPTIMIZE_SIZE__ */
+    CFI_START_FUNCTION
+
+        // Save the sign.
+        asrs    r3,     r0,     #31
+
+        // Absolute value of the input. 
+        eors    r0,     r3
+        subs    r0,     r3
+
+        // Sign extension to long long unsigned.
+        eors    r1,     r1
+        b       SYM(__internal_uil2f_noswap)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatsisf
+CM0_FUNC_END aeabi_i2f
+
+#endif /* L_arm_floatsisf */
+
+
+#ifdef L_arm_floatdisf
+
+// float __aeabi_l2f(long long)
+// Converts a signed 64-bit integer in $r1:$r0 to a float in $r0.
+.section .text.sorted.libgcc.fpcore.p.floatdisf,"x"
+
+// See comments for __floatsisf() above. 
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+CM0_FUNC_START aeabi_i2f
+CM0_FUNC_ALIAS floatsisf aeabi_i2f
+    CFI_START_FUNCTION
+
+      #if defined(__ARMEB__) && __ARMEB__ 
+        // __floatdisf() expects a big-endian lower word in $r1.
+        movs    xxl,    r0
+      #endif  
+
+        // Sign extension to long long signed.
+        asrs    xxh,    xxl,    #31 
+
+#endif /* __OPTIMIZE_SIZE__ */
+
+CM0_FUNC_START aeabi_l2f
+CM0_FUNC_ALIAS floatdisf aeabi_l2f
+
+#if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    CFI_START_FUNCTION
+#endif
+
+        // Save the sign.
+        asrs    r3,     xxh,     #31
+
+        // Absolute value of the input.  
+        // Could this be arranged in big-endian mode so that this block also
+        //  swapped the input words?  Maybe.  But, since neither 'eors' nor
+        //  'sbcs' allow a third destination register, it seems unlikely to
+        //  save more than one cycle.  Also, the size of __floatdisf() and 
+        //  __floatundisf() together would increase by two instructions. 
+        eors    xxl,    r3
+        eors    xxh,    r3
+        subs    xxl,    r3
+        sbcs    xxh,    r3
+
+        b       SYM(__internal_uil2f)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatdisf
+CM0_FUNC_END aeabi_l2f
+
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+CM0_FUNC_END floatsisf
+CM0_FUNC_END aeabi_i2f
+#endif 
+
+#endif /* L_arm_floatsisf || L_arm_floatdisf */
+
+
+#ifdef L_arm_floatunsisf
+
+// float __aeabi_ui2f(unsigned)
+// Converts an unsigned integer in $r0 to float.
+.section .text.sorted.libgcc.fpcore.q.floatunsisf,"x"
+CM0_FUNC_START aeabi_ui2f
+CM0_FUNC_ALIAS floatunsisf aeabi_ui2f
+    CFI_START_FUNCTION
+
+      #if defined(__ARMEB__) && __ARMEB__ 
+        // In big-endian mode, function flow breaks down.  __floatundisf() 
+        //  wants to swap word order, but __floatunsisf() does not. The 
+        // The choice is between leaving these arguments un-swapped and 
+        //  branching, or canceling out the word swap in advance.
+        // The branching version would require one extra instruction to 
+        //  clear the sign ($r3) because of __floatdisf() dependencies.
+        // While the branching version is technically one cycle faster 
+        //  on the Cortex-M0 pipeline, branchless just feels better.
+
+        // Thus, __floatundisf() expects a big-endian lower word in $r1.
+        movs    xxl,    r0
+      #endif  
+
+        // Extend to unsigned long long and fall through.
+        eors    xxh,    xxh 
+
+#endif /* L_arm_floatunsisf */
+
+
+// The execution of __floatunsisf() flows directly into __floatundisf(), such
+//  that instructions must appear consecutively in the same memory section
+//  for proper flow control.  However, this construction inhibits the ability
+//  to discard __floatunsisf() when only using __floatundisf().
+// Therefore, this block configures __floatundisf() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __floatunsisf().  The standalone version
+//  must be declared WEAK, so that the combined version can supersede it
+//  and provide both symbols when required.
+// '_arm_floatundisf' should appear before '_arm_floatunsisf' in LIB1ASMFUNCS.
+#if defined(L_arm_floatunsisf) || defined(L_arm_floatundisf)
+
+#ifdef L_arm_floatundisf
+// float __aeabi_ul2f(unsigned long long)
+// Converts an unsigned 64-bit integer in $r1:$r0 to a float in $r0.
+.section .text.sorted.libgcc.fpcore.q.floatundisf,"x"
+CM0_WEAK_START aeabi_ul2f
+CM0_WEAK_ALIAS floatundisf aeabi_ul2f
+    CFI_START_FUNCTION
+
+#else 
+CM0_FUNC_START aeabi_ul2f
+CM0_FUNC_ALIAS floatundisf aeabi_ul2f
+
+#endif
+
+        // Sign is always positive.
+        eors    r3,     r3
+
+#ifdef L_arm_floatundisf 
+    CM0_WEAK_START internal_uil2f
+#else /* L_clzdi2 */
+    CM0_FUNC_START internal_uil2f
+#endif
+      #if defined(__ARMEB__) && __ARMEB__ 
+        // Swap word order for register compatibility with __fp_assemble().
+        // Could this be optimized by re-defining __fp_assemble()?  Maybe.  
+        // But the ramifications of dynamic register assignment on all 
+        //  the other callers of __fp_assemble() would be enormous.    
+        eors    r0,     r1  
+        eors    r1,     r0  
+        eors    r0,     r1  
+      #endif 
+
+#ifdef L_arm_floatundisf 
+    CM0_WEAK_START internal_uil2f_noswap
+#else /* L_clzdi2 */
+    CM0_FUNC_START internal_uil2f_noswap
+#endif
+        // Default exponent, relative to bit[30] of $r1.
+        movs    r2,     #(127 - 1 + 63)
+
+        // Format the sign.
+        lsls    r3,     #31
+        mov     ip,     r3
+
+        push    { rT, lr }
+        b       SYM(__fp_assemble)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END floatundisf
+CM0_FUNC_END aeabi_ul2f
+
+#ifdef L_arm_floatunsisf
+CM0_FUNC_END floatunsisf
+CM0_FUNC_END aeabi_ui2f
+#endif 
+
+#endif /* L_arm_floatunsisf || L_arm_floatundisf */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fmul.S gcc-11-20201220/libgcc/config/arm/cm0/fmul.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fmul.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fmul.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,215 @@
+/* fmul.S: Cortex M0 optimized 32-bit float multiplication
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_mulsf3
+
+// float __aeabi_fmul(float, float)
+// Returns $r0 after multiplication by $r1.
+.section .text.sorted.libgcc.fpcore.m.fmul,"x"
+CM0_FUNC_START aeabi_fmul
+CM0_FUNC_ALIAS mulsf3 aeabi_fmul
+    CFI_START_FUNCTION
+
+        // Standard registers, compatible with exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save the sign of the result.
+        movs    rT,     r1
+        eors    rT,     r0
+        lsrs    rT,     #31
+        lsls    rT,     #31
+        mov     ip,     rT
+
+        // Set up INF for comparison.
+        movs    rT,     #255
+        lsls    rT,     #24
+
+        // Check for multiplication by zero.
+        lsls    r2,     r0,     #1
+        beq     LSYM(__fmul_zero1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_zero2)
+
+        // Check for INF/NAN.
+        cmp     r3,     rT
+        bhs     LSYM(__fmul_special2)
+
+        cmp     r2,     rT
+        bhs     LSYM(__fmul_special1)
+
+        // Because neither operand is INF/NAN, the result will be finite.
+        // It is now safe to modify the original operand registers.
+        lsls    r0,     #9
+
+        // Isolate the first exponent.  When normal, add back the implicit '1'.
+        // The result is always aligned with the MSB in bit [31].
+        // Subnormal mantissas remain effectively multiplied by 2x relative to
+        //  normals, but this works because the weight of a subnormal is -126.
+        lsrs    r2,     #24
+        beq     LSYM(__fmul_normalize2)
+        adds    r0,     #1
+        rors    r0,     r0
+
+    LSYM(__fmul_normalize2):
+        // IMPORTANT: exp10i() jumps in here!
+        // Repeat for the mantissa of the second operand.
+        // Short-circuit when the mantissa is 1.0, as the
+        //  first mantissa is already prepared in $r0
+        lsls    r1,     #9
+
+        // When normal, add back the implicit '1'.
+        lsrs    r3,     #24
+        beq     LSYM(__fmul_go)
+        adds    r1,     #1
+        rors    r1,     r1
+
+    LSYM(__fmul_go):
+        // Calculate the final exponent, relative to bit [30].
+        adds    rT,     r2,     r3
+        subs    rT,     #127
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Short-circuit on multiplication by powers of 2.
+        lsls    r3,     r0,     #1
+        beq     LSYM(__fmul_simple1)
+
+        lsls    r3,     r1,     #1
+        beq     LSYM(__fmul_simple2)
+  #endif
+
+        // Save $ip across the call.
+        // (Alternatively, could push/pop a separate register,
+        //  but the four instructions here are equivally fast)
+        //  without imposing on the stack.
+        add     rT,     ip
+
+        // 32x32 unsigned multiplication, 64 bit result.
+        bl      SYM(__umulsidi3) __PLT__
+
+        // Separate the saved exponent and sign.
+        sxth    r2,     rT
+        subs    rT,     r2
+        mov     ip,     rT
+
+        b       SYM(__fp_assemble)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__fmul_simple2):
+        // Move the high bits of the result to $r1.
+        movs    r1,     r0
+
+    LSYM(__fmul_simple1):
+        // Clear the remainder.
+        eors    r0,     r0
+
+        // Adjust mantissa to match the exponent, relative to bit[30].
+        subs    r2,     rT,     #1
+        b       SYM(__fp_assemble)
+  #endif
+
+    LSYM(__fmul_zero1):
+        // $r0 was equal to 0, set up to check $r1 for INF/NAN.
+        lsls    r2,     r1,     #1
+
+    LSYM(__fmul_zero2):
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        movs    r3,     #(INFINITY_TIMES_ZERO)
+      #endif
+
+        // Check the non-zero operand for INF/NAN.
+        // If NAN, it should be returned.
+        // If INF, the result should be NAN.
+        // Otherwise, the result will be +/-0.
+        cmp     r2,     rT
+        beq     SYM(__fp_exception)
+
+        // If the second operand is finite, the result is 0.
+        blo     SYM(__fp_zero)
+
+      #if defined(STRICT_NANS) && STRICT_NANS
+        // Restore values that got mixed in zero testing, then go back
+        //  to sort out which one is the NAN.
+        lsls    r3,     r1,     #1
+        lsls    r2,     r0,     #1
+      #elif defined(TRAP_NANS) && TRAP_NANS
+        // Return NAN with the sign bit cleared.
+        lsrs    r0,     r2,     #1
+        b       SYM(__fp_check_nan)
+      #else
+        lsrs    r0,     r2,     #1
+        // Return NAN with the sign bit cleared.
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    LSYM(__fmul_special2):
+        // $r1 is INF/NAN.  In case of INF, check $r0 for NAN.
+        cmp     r2,     rT
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        // Force swap if $r0 is not NAN.
+        bls     LSYM(__fmul_swap)
+
+        // $r0 is NAN, keep if $r1 is INF
+        cmp     r3,     rT
+        beq     LSYM(__fmul_special1)
+
+        // Both are NAN, keep the smaller value (more likely to signal).
+        cmp     r2,     r3
+      #endif
+
+        // Prefer the NAN already in $r0.
+        //  (If TRAP_NANS, this is the smaller NAN).
+        bhi     LSYM(__fmul_special1)
+
+    LSYM(__fmul_swap):
+        movs    r0,     r1
+
+    LSYM(__fmul_special1):
+        // $r0 is either INF or NAN.  $r1 has already been examined.
+        // Flags are already set correctly.
+        lsls    r2,     r0,     #1
+        cmp     r2,     rT
+        beq     SYM(__fp_infinity)
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        b       SYM(__fp_check_nan)
+      #else
+        pop     { rT, pc }
+                .cfi_restore_state
+      #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsf3
+CM0_FUNC_END aeabi_fmul
+
+#endif /* L_arm_mulsf3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fneg.S gcc-11-20201220/libgcc/config/arm/cm0/fneg.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fneg.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fneg.S	2021-01-06 02:45:47.428262284 -0800
@@ -0,0 +1,76 @@
+/* fneg.S: Cortex M0 optimized 32-bit float negation
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_arm_negsf2
+
+// float __aeabi_fneg(float) [obsolete]
+// The argument and result are in $r0.
+// Uses $r1 and $r2 as scratch registers.
+.section .text.sorted.libgcc.fpcore.a.fneg,"x"
+CM0_FUNC_START aeabi_fneg
+CM0_FUNC_ALIAS negsf2 aeabi_fneg
+    CFI_START_FUNCTION
+
+  #if (defined(STRICT_NANS) && STRICT_NANS) || \
+      (defined(TRAP_NANS) && TRAP_NANS)
+        // Check for NAN.
+        lsls    r1,     r0,     #1
+        movs    r2,     #255
+        lsls    r2,     #24
+        cmp     r1,     r2
+
+      #if defined(TRAP_NANS) && TRAP_NANS
+        blo     SYM(__fneg_nan)
+      #else
+        blo     LSYM(__fneg_return)
+      #endif
+  #endif
+
+        // Flip the sign.
+        movs    r1,     #1
+        lsls    r1,     #31
+        eors    r0,     r1
+
+    LSYM(__fneg_return):
+        RET
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+    LSYM(__fneg_nan):
+        // Set up registers for exception handling.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        b       SYM(fp_check_nan)
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END negsf2
+CM0_FUNC_END aeabi_fneg
+
+#endif /* L_arm_negsf2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/fplib.h gcc-11-20201220/libgcc/config/arm/cm0/fplib.h
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/fplib.h	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/fplib.h	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,83 @@
+/* fplib.h: Cortex M0 optimized 64-bit header definitions 
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifndef __CM0_FPLIB_H
+#define __CM0_FPLIB_H 
+
+/* Enable exception interrupt handler.  
+   Exception implementation is opportunistic, and not fully tested.  */
+#define TRAP_EXCEPTIONS (0)
+#define EXCEPTION_CODES (0)
+
+/* Perform extra checks to avoid modifying the sign bit of NANs */
+#define STRICT_NANS (0)
+
+/* Trap signaling NANs regardless of context. */
+#define TRAP_NANS (0)
+
+/* TODO: Define service numbers according to the handler requirements */ 
+#define SVC_TRAP_NAN (0)
+#define SVC_FP_EXCEPTION (0)
+#define SVC_DIVISION_BY_ZERO (0)
+
+/* Push extra registers when required for 64-bit stack alignment */
+#define DOUBLE_ALIGN_STACK (1)
+
+/* Manipulate *div0() parameters to meet the ARM runtime ABI specification. */
+#define PEDANTIC_DIV0 (1)
+
+/* Define various exception codes.  These don't map to anything in particular */
+#define SUBTRACTED_INFINITY (20)
+#define INFINITY_TIMES_ZERO (21)
+#define DIVISION_0_BY_0 (22)
+#define DIVISION_INF_BY_INF (23)
+#define UNORDERED_COMPARISON (24)
+#define CAST_OVERFLOW (25)
+#define CAST_INEXACT (26)
+#define CAST_UNDEFINED (27)
+
+/* Exception control for quiet NANs.
+   If TRAP_NAN support is enabled, signaling NANs always raise exceptions. */
+#define FCMP_RAISE_EXCEPTIONS   16
+#define FCMP_NO_EXCEPTIONS      0
+
+/* The bit indexes in these assignments are significant.  See implementation.
+   They are shared publicly for eventual use by newlib.  */
+#define FCMP_3WAY           (1)
+#define FCMP_LT             (2)
+#define FCMP_EQ             (4)
+#define FCMP_GT             (8)
+
+#define FCMP_GE             (FCMP_EQ | FCMP_GT)
+#define FCMP_LE             (FCMP_LT | FCMP_EQ)
+#define FCMP_NE             (FCMP_LT | FCMP_GT)
+
+/* These flags affect the result of unordered comparisons.  See implementation.  */
+#define FCMP_UN_THREE       (128)
+#define FCMP_UN_POSITIVE    (64)
+#define FCMP_UN_ZERO        (32)
+#define FCMP_UN_NEGATIVE    (0)
+
+#endif /* __CM0_FPLIB_H */
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/futil.S gcc-11-20201220/libgcc/config/arm/cm0/futil.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/futil.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/futil.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,407 @@
+/* futil.S: Cortex M0 optimized 32-bit common routines
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+   
+#ifdef L_arm_addsubsf3
+ 
+// Internal function, decomposes the unsigned float in $r2.
+// The exponent will be returned in $r2, the mantissa in $r3.
+// If subnormal, the mantissa will be normalized, so that
+//  the MSB of the mantissa (if any) will be aligned at bit[31].
+// Preserves $r0 and $r1, uses $rT as scratch space.
+.section .text.sorted.libgcc.fpcore.y.normf,"x"
+CM0_FUNC_START fp_normalize2
+    CFI_START_FUNCTION
+
+        // Extract the mantissa.
+        lsls    r3,     r2,     #8
+
+        // Extract the exponent.
+        lsrs    r2,     #24
+        beq     SYM(__fp_lalign2)
+
+        // Restore the mantissa's implicit '1'.
+        adds    r3,     #1
+        rors    r3,     r3
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_normalize2
+
+
+// Internal function, aligns $r3 so the MSB is aligned in bit[31].
+// Simultaneously, subtracts the shift from the exponent in $r2
+.section .text.sorted.libgcc.fpcore.z.alignf,"x"
+CM0_FUNC_START fp_lalign2
+    CFI_START_FUNCTION
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Unroll the loop, similar to __clzsi2().
+        lsrs    rT,     r3,     #16
+        bne     LSYM(__align8)
+        subs    r2,     #16
+        lsls    r3,     #16
+
+    LSYM(__align8):
+        lsrs    rT,     r3,     #24
+        bne     LSYM(__align4)
+        subs    r2,     #8
+        lsls    r3,     #8
+
+    LSYM(__align4):
+        lsrs    rT,     r3,     #28
+        bne     LSYM(__align2)
+        subs    r2,     #4
+        lsls    r3,     #4
+  #endif
+
+    LSYM(__align2):
+        // Refresh the state of the N flag before entering the loop.
+        tst     r3,     r3
+
+    LSYM(__align_loop):
+        // Test before subtracting to compensate for the natural exponent.
+        // The largest subnormal should have an exponent of 0, not -1.
+        bmi     LSYM(__align_return)
+        subs    r2,     #1
+        lsls    r3,     #1
+        bne     LSYM(__align_loop)
+
+        // Not just a subnormal... 0!  By design, this should never happen.
+        // All callers of this internal function filter 0 as a special case.
+        // Was there an uncontrolled jump from somewhere else?  Cosmic ray?
+        eors    r2,     r2
+
+      #ifdef DEBUG
+        bkpt    #0
+      #endif
+
+    LSYM(__align_return):
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_lalign2
+
+
+// Internal function to combine mantissa, exponent, and sign. No return.
+// Expects the unsigned result in $r1.  To avoid underflow (slower),
+//  the MSB should be in bits [31:29].
+// Expects any remainder bits of the unrounded result in $r0.
+// Expects the exponent in $r2.  The exponent must be relative to bit[30].
+// Expects the sign of the result (and only the sign) in $ip.
+// Returns a correctly rounded floating value in $r0.
+.section .text.sorted.libgcc.fpcore.g.assemblef,"x"
+CM0_FUNC_START fp_assemble
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Examine the upper three bits [31:29] for underflow.
+        lsrs    r3,     r1,     #29
+        beq     LSYM(__fp_underflow)
+
+        // Convert bits [31:29] into an offset in the range of { 0, -1, -2 }.
+        // Right rotation aligns the MSB in bit [31], filling any LSBs with '0'.
+        lsrs    r3,     r1,     #1
+        mvns    r3,     r3
+        ands    r3,     r1
+        lsrs    r3,     #30
+        subs    r3,     #2
+        rors    r1,     r3
+
+        // Update the exponent, assuming the final result will be normal.
+        // The new exponent is 1 less than actual, to compensate for the
+        //  eventual addition of the implicit '1' in the result.
+        // If the final exponent becomes negative, proceed directly to gradual
+        //  underflow, without bothering to search for the MSB.
+        adds    r2,     r3
+
+CM0_FUNC_START fp_assemble2
+        bmi     LSYM(__fp_subnormal)
+
+    LSYM(__fp_normal):
+        // Check for overflow (remember the implicit '1' to be added later).
+        cmp     r2,     #254
+        bge     SYM(__fp_overflow)
+
+        // Save LSBs for the remainder. Position doesn't matter any more,
+        //  these are just tiebreakers for round-to-even.
+        lsls    rT,     r1,     #25
+
+        // Align the final result.
+        lsrs    r1,     #8
+
+    LSYM(__fp_round):
+        // If carry bit is '0', always round down.
+        bcc     LSYM(__fp_return)
+
+        // The carry bit is '1'.  Round to nearest, ties to even.
+        // If either the saved remainder bits [6:0], the additional remainder
+        //  bits in $r1, or the final LSB is '1', round up.
+        lsls    r3,     r1,     #31
+        orrs    r3,     rT
+        orrs    r3,     r0
+        beq     LSYM(__fp_return)
+
+        // If rounding up overflows, then the mantissa result becomes 2.0, 
+        //  which yields the correct return value up to and including INF. 
+        adds    r1,     #1
+
+    LSYM(__fp_return):
+        // Combine the mantissa and the exponent.
+        lsls    r2,     #23
+        adds    r0,     r1,     r2
+
+        // Combine with the saved sign.
+        // End of library call, return to user.
+        add     r0,     ip
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: Underflow/inexact reporting IFF remainder
+  #endif
+
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    LSYM(__fp_underflow):
+        // Set up to align the mantissa.
+        movs    r3,     r1
+        bne     LSYM(__fp_underflow2)
+
+        // MSB wasn't in the upper 32 bits, check the remainder.
+        // If the remainder is also zero, the result is +/-0.
+        movs    r3,     r0
+        beq     SYM(__fp_zero)
+
+        eors    r0,     r0
+        subs    r2,     #32
+
+    LSYM(__fp_underflow2):
+        // Save the pre-alignment exponent to align the remainder later.
+        movs    r1,     r2
+
+        // Align the mantissa with the MSB in bit[31].
+        bl      SYM(__fp_lalign2)
+
+        // Calculate the actual remainder shift.
+        subs    rT,     r1,     r2
+
+        // Align the lower bits of the remainder.
+        movs    r1,     r0
+        lsls    r0,     rT
+
+        // Combine the upper bits of the remainder with the aligned value.
+        rsbs    rT,     #0
+        adds    rT,     #32
+        lsrs    r1,     rT
+        adds    r1,     r3
+
+        // The MSB is now aligned at bit[31] of $r1.
+        // If the net exponent is still positive, the result will be normal.
+        // Because this function is used by fmul(), there is a possibility
+        //  that the value is still wider than 24 bits; always round.
+        tst     r2,     r2
+        bpl     LSYM(__fp_normal)
+
+    LSYM(__fp_subnormal):
+        // The MSB is aligned at bit[31], with a net negative exponent.
+        // The mantissa will need to be shifted right by the absolute value of
+        //  the exponent, plus the normal shift of 8.
+
+        // If the negative shift is smaller than -25, there is no result,
+        //  no rounding, no anything.  Return signed zero.
+        // (Otherwise, the shift for result and remainder may wrap.)
+        adds    r2,     #25
+        bmi     SYM(__fp_inexact_zero)
+
+        // Save the extra bits for the remainder.
+        movs    rT,     r1
+        lsls    rT,     r2
+
+        // Shift the mantissa to create a subnormal.
+        // Just like normal, round to nearest, ties to even.
+        movs    r3,     #33
+        subs    r3,     r2
+        eors    r2,     r2
+
+        // This shift must be last, leaving the shifted LSB in the C flag.
+        lsrs    r1,     r3
+        b       LSYM(__fp_round)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_assemble2
+CM0_FUNC_END fp_assemble
+
+
+// Recreate INF with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.sorted.libgcc.fpcore.h.infinityf,"x"
+CM0_FUNC_START fp_overflow
+    CFI_START_FUNCTION
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/overflow exception
+  #endif
+
+CM0_FUNC_START fp_infinity
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        movs    r0,     #255
+        lsls    r0,     #23
+        add     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_infinity
+CM0_FUNC_END fp_overflow
+
+
+// Recreate 0 with the appropriate sign.  No return.
+// Expects the sign of the result in $ip.
+.section .text.sorted.libgcc.fpcore.i.zerof,"x"
+CM0_FUNC_START fp_inexact_zero
+
+  #if defined(FP_EXCEPTIONS) && FP_EXCEPTIONS
+        // TODO: inexact/underflow exception
+  #endif
+
+CM0_FUNC_START fp_zero
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Return 0 with the correct sign.
+        mov     r0,     ip
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_zero
+CM0_FUNC_END fp_inexact_zero
+
+
+// Internal function to detect signaling NANs.  No return.
+// Uses $r2 as scratch space.
+.section .text.sorted.libgcc.fpcore.j.checkf,"x"
+CM0_FUNC_START fp_check_nan2
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+
+CM0_FUNC_START fp_check_nan
+
+        // Check for quiet NAN.
+        lsrs    r2,     r0,     #23
+        bcs     LSYM(__quiet_nan)
+
+        // Raise exception.  Preserves both $r0 and $r1.
+        svc     #(SVC_TRAP_NAN)
+
+        // Quiet the resulting NAN.
+        movs    r2,     #1
+        lsls    r2,     #22
+        orrs    r0,     r2
+
+    LSYM(__quiet_nan):
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_check_nan
+CM0_FUNC_END fp_check_nan2
+
+
+// Internal function to report floating point exceptions.  No return.
+// Expects the original argument(s) in $r0 (possibly also $r1).
+// Expects a code that describes the exception in $r3.
+.section .text.sorted.libgcc.fpcore.k.exceptf,"x"
+CM0_FUNC_START fp_exception
+    CFI_START_FUNCTION
+
+        // Work around CFI branching limitations.
+        .cfi_remember_state
+        .cfi_adjust_cfa_offset 8
+        .cfi_rel_offset rT, 0
+        .cfi_rel_offset lr, 4
+
+        // Create a quiet NAN.
+        movs    r2,     #255
+        lsls    r2,     #1
+        adds    r2,     #1
+        lsls    r2,     #22
+
+      #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+        // Annotate the exception type in the NAN field.
+        // Make sure that the exception is in the valid region 
+        lsls    rT,     r3,     #13
+        orrs    r2,     rT
+      #endif
+
+// Exception handler that expects the result already in $r2,
+//  typically when the result is not going to be NAN.
+CM0_FUNC_START fp_exception2
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_FP_EXCEPTION)
+      #endif
+
+        // TODO: Save exception flags in a static variable.
+
+        // Set up the result, now that the argument isn't required any more.
+        movs    r0,     r2
+
+        // HACK: for sincosf(), with 2 parameters to return.
+        movs    r1,     r2
+
+        // End of library call, return to user.
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END fp_exception2
+CM0_FUNC_END fp_exception
+
+#endif /* L_arm_addsubsf3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/idiv.S gcc-11-20201220/libgcc/config/arm/cm0/idiv.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/idiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/idiv.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,288 @@
+/* div.S: Cortex M0 optimized 32-bit integer division
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if 0
+ 
+// int __aeabi_idiv0(int)
+// Helper function for division by 0.
+.section .text.sorted.libgcc.idiv0,"x"
+CM0_WEAK_START aeabi_idiv0
+CM0_FUNC_ALIAS cm0_idiv0 aeabi_idiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END cm0_idiv0
+CM0_FUNC_END aeabi_idiv0
+
+#endif /* L_dvmd_tls */
+
+
+#ifdef L_divsi3
+
+// int __aeabi_idiv(int, int)
+// idiv_return __aeabi_idivmod(int, int)
+// Returns signed $r0 after division by $r1.
+// Also returns the signed remainder in $r1.
+// Same parent section as __divsi3() to keep branches within range.
+.section .text.sorted.libgcc.idiv.divsi3,"x"
+CM0_FUNC_START aeabi_idivmod
+CM0_FUNC_ALIAS aeabi_idiv aeabi_idivmod
+CM0_FUNC_ALIAS divsi3 aeabi_idivmod
+    CFI_START_FUNCTION
+
+        // Extend signs.
+        asrs    r2,     r0,     #31
+        asrs    r3,     r1,     #31
+
+        // Absolute value of the denominator, abort on division by zero.
+        eors    r1,     r3
+        subs    r1,     r3
+      #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+        beq     LSYM(__idivmod_zero)
+      #else 
+        beq     SYM(__uidivmod_zero)
+      #endif
+
+        // Absolute value of the numerator.
+        eors    r0,     r2
+        subs    r0,     r2
+
+        // Keep the sign of the numerator in bit[31] (for the remainder).
+        // Save the XOR of the signs in bits[15:0] (for the quotient).
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        lsrs    rT,     r3,     #16
+        eors    rT,     r2
+
+        // Handle division as unsigned.
+        bl      SYM(__uidivmod_nonzero) __PLT__ 
+
+        // Set the sign of the remainder.
+        asrs    r2,     rT,     #31
+        eors    r1,     r2
+        subs    r1,     r2
+
+        // Set the sign of the quotient.
+        sxth    r3,     rT
+        eors    r0,     r3
+        subs    r0,     r3
+
+    LSYM(__idivmod_return):
+        pop     { rT, pc }
+                .cfi_restore_state
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+    LSYM(__idivmod_zero):
+        // Set up the *div0() parameter specified in the ARM runtime ABI: 
+        //  * 0 if the numerator is 0,  
+        //  * Or, the largest value of the type manipulated by the calling 
+        //     division function if the numerator is positive,
+        //  * Or, the least value of the type manipulated by the calling
+        //     division function if the numerator is negative. 
+        subs    r1,     r0
+        orrs    r0,     r1
+        asrs    r0,     #31
+        lsrs    r0,     #1 
+        eors    r0,     r2 
+
+        // At least the __aeabi_idiv0() call is common.
+        b       SYM(__uidivmod_zero2)        
+  #endif /* PEDANTIC_DIV0 */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divsi3
+CM0_FUNC_END aeabi_idiv
+CM0_FUNC_END aeabi_idivmod
+
+#endif /* L_divsi3 */
+
+
+#ifdef L_udivsi3
+
+// int __aeabi_uidiv(unsigned int, unsigned int)
+// idiv_return __aeabi_uidivmod(unsigned int, unsigned int)
+// Returns unsigned $r0 after division by $r1.
+// Also returns the remainder in $r1.
+.section .text.sorted.libgcc.idiv.udivsi3,"x"
+CM0_FUNC_START aeabi_uidivmod
+CM0_FUNC_ALIAS aeabi_uidiv aeabi_uidivmod
+CM0_FUNC_ALIAS udivsi3 aeabi_uidivmod
+    CFI_START_FUNCTION
+
+        // Abort on division by zero.
+        tst     r1,     r1
+      #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+        beq     LSYM(__uidivmod_zero)
+      #else 
+        beq     SYM(__uidivmod_zero)
+      #endif
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+    // Public symbol for the sake of divsi3().  
+    CM0_FUNC_START uidivmod_nonzero 
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // The loop is destructive, save a copy of the numerator.
+        mov     ip,     r0
+
+        // Set up binary search.
+        movs    r3,     #16
+        movs    r2,     #1
+
+    LSYM(__uidivmod_align):
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    r0,     r3
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_skip)
+
+        // Multiply the denominator and the result together.
+        lsls    r1,     r3
+        lsls    r2,     r3
+
+    LSYM(__uidivmod_skip):
+        // Restore the numerator, and iterate until search goes to 0.
+        mov     r0,     ip
+        lsrs    r3,     #1
+        bne     LSYM(__uidivmod_align)
+
+        // In The result $r3 has been conveniently initialized to 0.
+        b       LSYM(__uidivmod_entry)
+
+    LSYM(__uidivmod_loop):
+        // Scale the denominator and the quotient together.
+        lsrs    r1,     #1
+        lsrs    r2,     #1
+        beq     LSYM(__uidivmod_return)
+
+    LSYM(__uidivmod_entry):
+        // Test if the denominator is smaller than the numerator.
+        cmp     r0,     r1
+        blo     LSYM(__uidivmod_loop)
+
+        // If the denominator is smaller, the next bit of the result is '1'.
+        // If the new remainder goes to 0, exit early.
+        adds    r3,     r2
+        subs    r0,     r1
+        bne     LSYM(__uidivmod_loop)
+
+    LSYM(__uidivmod_return):
+        mov     r1,     r0
+        mov     r0,     r3
+        RET
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+    LSYM(__uidivmod_zero):
+        // Set up the *div0() parameter specified in the ARM runtime ABI: 
+        //  * 0 if the numerator is 0,  
+        //  * Or, the largest value of the type manipulated by the calling 
+        //     division function if the numerator is positive.
+        subs    r1,     r0
+        orrs    r0,     r1
+        asrs    r0,     #31
+
+    CM0_FUNC_START uidivmod_zero2
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+      #else 
+        push    { lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 4
+                .cfi_rel_offset lr, 0
+      #endif 
+
+        // Since GCC implements __aeabi_idiv0() as a weak overridable function,
+        //  this call must be prepared for a jump beyond +/- 2 KB.
+        // NOTE: __aeabi_idiv0() can't be implemented as a tail call, since any
+        //  non-trivial override will (likely) corrupt a remainder in $r1.
+        bl      SYM(__aeabi_idiv0) __PLT__
+                
+        // Since the input to __aeabi_idiv0() was INF, there really isn't any
+        //  choice in which of the recommended *divmod() patterns to follow.  
+        // Clear the remainder to complete {INF, 0}.
+        eors    r1,     r1
+
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { rT, pc }
+                .cfi_restore_state
+      #else 
+        pop     { pc }
+                .cfi_restore_state
+      #endif 
+        
+  #else /* !PEDANTIC_DIV0 */   
+    CM0_FUNC_START uidivmod_zero  
+        // NOTE: The following code sets up a return pair of {0, numerator},   
+        //  the second preference given by the ARM runtime ABI specification.   
+        // The pedantic version is 18 bytes larger between __aeabi_idiv() and
+        //  __aeabi_uidiv().  However, this version does not conform to the
+        //  out-of-line parameter requirements given for __aeabi_idiv0(), and
+        //  also does not pass 'gcc/testsuite/gcc.target/arm/divzero.c'.
+        
+        // Since the numerator may be overwritten by __aeabi_idiv0(), save now.
+        // Afterwards, it can be restored directly as the remainder.
+        push    { r0, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset r0, 0
+                .cfi_rel_offset lr, 4
+
+        // Set up the quotient (not ABI compliant).
+        eors    r0,     r0
+
+        // Since GCC implements div0() as a weak overridable function,
+        //  this call must be prepared for a jump beyond +/- 2 KB.
+        bl      SYM(__aeabi_idiv0) __PLT__
+
+        // Restore the remainder and return.
+        pop     { r1, pc }
+                .cfi_restore_state
+      
+  #endif /* !PEDANTIC_DIV0 */
+  
+    CFI_END_FUNCTION
+CM0_FUNC_END udivsi3
+CM0_FUNC_END aeabi_uidiv
+CM0_FUNC_END aeabi_uidivmod
+
+#endif /* L_udivsi3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/lcmp.S gcc-11-20201220/libgcc/config/arm/cm0/lcmp.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/lcmp.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/lcmp.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,136 @@
+/* lcmp.S: Cortex M0 optimized 64-bit integer comparison
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+   
+#if defined(L_arm_lcmp) || defined(L_cmpdi2)   
+
+#ifdef L_arm_lcmp
+.section .text.sorted.libgcc.lcmp,"x"
+  #define LCMP_NAME aeabi_lcmp
+#else
+.section .text.sorted.libgcc.cmpdi2,"x"
+  #define LCMP_NAME cmpdi2 
+#endif
+
+// int __aeabi_lcmp(long long, long long)
+// int __cmpdi2(long long, long long)
+// Compares the 64 bit signed values in $r1:$r0 and $r3:$r2.
+// lcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// cmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+CM0_FUNC_START LCMP_NAME
+    CFI_START_FUNCTION
+
+        // Calculate the difference $r1:$r0 - $r3:$r2.
+        subs    xxl,    yyl 
+        sbcs    xxh,    yyh 
+
+        // With $r2 free, create a reference offset without affecting flags.
+        // Originally implemented as 'mov r2, r3' for ARM architectures 6+
+        //  with unified syntex.  However, this resulted in a compiler error 
+        //  for thumb-1: "MOV Rd, Rs with two low registers not permitted".  
+        // Since unified syntax deprecates the "cpy" instruction, shouldn't
+        //  there be a backwards-compatible tranlation in the assembler?
+        cpy     r2,     r3
+
+        // Finish the comparison.
+        blt     LSYM(__lcmp_lt)
+
+        // The reference offset ($r2 - $r3) will be +2 iff the first
+        //  argument is larger, otherwise the reference offset remains 0.
+        adds    r2,     #2
+
+    LSYM(__lcmp_lt):
+        // Check for zero equality (all 64 bits).
+        // It doesn't matter which register was originally "hi". 
+        orrs    r0,     r1
+        beq     LSYM(__lcmp_return)
+
+        // Convert the relative offset to an absolute value +/-1.
+        subs    r0,     r2,     r3
+        subs    r0,     #1
+
+    LSYM(__lcmp_return):
+      #ifdef L_cmpdi2 
+        // Shift to the correct output specification.
+        adds    r0,     #1
+      #endif 
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END LCMP_NAME 
+
+#endif /* L_arm_lcmp || L_cmpdi2 */
+
+
+#if defined(L_arm_ulcmp) || defined(L_ucmpdi2)
+
+#ifdef L_arm_ulcmp
+.section .text.sorted.libgcc.ulcmp,"x"
+  #define ULCMP_NAME aeabi_ulcmp
+#else
+.section .text.sorted.libgcc.ucmpdi2,"x"
+  #define ULCMP_NAME ucmpdi2 
+#endif
+
+// int __aeabi_ulcmp(unsigned long long, unsigned long long)
+// int __ucmpdi2(unsigned long long, unsigned long long)
+// Compares the 64 bit unsigned values in $r1:$r0 and $r3:$r2.
+// ulcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// ucmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+CM0_FUNC_START ULCMP_NAME 
+    CFI_START_FUNCTION
+
+        // Calculate the 'C' flag.
+        subs    xxl,    yyl 
+        sbcs    xxh,    yyh 
+
+        // Capture the carry flg.  
+        // $r2 will contain -1 if the first value is smaller,
+        //  0 if the first value is larger or equal.
+        sbcs    r2,     r2
+
+        // Check for zero equality (all 64 bits).
+        // It doesn't matter which register was originally "hi". 
+        orrs    r0,     r1
+        beq     LSYM(__ulcmp_return)
+
+        // $r0 should contain +1 or -1
+        movs    r0,     #1
+        orrs    r0,     r2
+
+    LSYM(__ulcmp_return):
+      #ifdef L_ucmpdi2 
+        adds    r0,     #1
+      #endif 
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ULCMP_NAME
+
+#endif /* L_arm_ulcmp || L_ucmpdi2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/ldiv.S gcc-11-20201220/libgcc/config/arm/cm0/ldiv.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/ldiv.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/ldiv.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,482 @@
+/* ldiv.S: Cortex M0 optimized 64-bit integer division
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if 0
+
+// long long __aeabi_ldiv0(long long)
+// Helper function for division by 0.
+.section .text.sorted.libgcc.ldiv0,"x"
+CM0_WEAK_START aeabi_ldiv0
+    CFI_START_FUNCTION
+
+      #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+        svc     #(SVC_DIVISION_BY_ZERO)
+      #endif
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_ldiv0
+
+#endif /* L_dvmd_tls */
+
+
+#ifdef L_divdi3 
+
+// long long __aeabi_ldiv(long long, long long)
+// lldiv_return __aeabi_ldivmod(long long, long long)
+// Returns signed $r1:$r0 after division by $r3:$r2.
+// Also returns the remainder in $r3:$r2.
+// Same parent section as __divsi3() to keep branches within range.
+.section .text.sorted.libgcc.ldiv.divdi3,"x"
+CM0_FUNC_START aeabi_ldivmod
+CM0_FUNC_ALIAS aeabi_ldiv aeabi_ldivmod
+CM0_FUNC_ALIAS divdi3 aeabi_ldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before pushing registers.
+        cmp     yyl,    #0
+        bne     LSYM(__ldivmod_valid)
+
+        cmp     yyh,    #0
+      #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+        beq     LSYM(__ldivmod_zero)
+      #else
+        beq     SYM(__uldivmod_zero)
+      #endif
+
+    LSYM(__ldivmod_valid):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+      #else
+        push    { rP, rQ, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 12
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset lr, 8
+      #endif
+
+        // Absolute value of the numerator.
+        asrs    rP,     xxh,    #31
+        eors    xxl,    rP
+        eors    xxh,    rP
+        subs    xxl,    rP
+        sbcs    xxh,    rP
+
+        // Absolute value of the denominator.
+        asrs    rQ,     yyh,    #31
+        eors    yyl,    rQ
+        eors    yyh,    rQ
+        subs    yyl,    rQ
+        sbcs    yyh,    rQ
+
+        // Keep the XOR of signs for the quotient.
+        eors    rQ,     rP
+
+        // Handle division as unsigned.
+        bl      SYM(__uldivmod_nonzero) __PLT__
+
+        // Set the sign of the quotient.
+        eors    xxl,    rQ
+        eors    xxh,    rQ
+        subs    xxl,    rQ
+        sbcs    xxh,    rQ
+
+        // Set the sign of the remainder.
+        eors    yyl,    rP
+        eors    yyh,    rP
+        subs    yyl,    rP
+        sbcs    yyh,    rP
+
+    LSYM(__ldivmod_return):
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { rP, rQ, pc }
+                .cfi_restore_state
+      #endif
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+    LSYM(__ldivmod_zero):
+        // Save the sign of the numerator.
+        asrs    yyl,     xxh,    #31
+
+        // Set up the *div0() parameter specified in the ARM runtime ABI:
+        //  * 0 if the numerator is 0,
+        //  * Or, the largest value of the type manipulated by the calling
+        //     division function if the numerator is positive,
+        //  * Or, the least value of the type manipulated by the calling
+        //     division function if the numerator is negative.
+        rsbs    xxl,    #0
+        sbcs    yyh,    xxh
+        orrs    xxh,    yyh
+        asrs    xxl,    xxh,   #31
+        lsrs    xxh,    xxl,   #1
+        eors    xxh,    yyl
+        eors    xxl,    yyl 
+
+        // At least the __aeabi_ldiv0() call is common.
+        b       SYM(__uldivmod_zero2)
+  #endif /* PEDANTIC_DIV0 */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END divdi3
+CM0_FUNC_END aeabi_ldiv
+CM0_FUNC_END aeabi_ldivmod
+
+#endif /* L_divdi3 */
+
+
+#ifdef L_udivdi3 
+
+// unsigned long long __aeabi_uldiv(unsigned long long, unsigned long long)
+// ulldiv_return __aeabi_uldivmod(unsigned long long, unsigned long long)
+// Returns unsigned $r1:$r0 after division by $r3:$r2.
+// Also returns the remainder in $r3:$r2.
+.section .text.sorted.libgcc.ldiv.udivdi3,"x"
+CM0_FUNC_START aeabi_uldivmod
+CM0_FUNC_ALIAS aeabi_uldiv aeabi_uldivmod
+CM0_FUNC_ALIAS udivdi3 aeabi_uldivmod
+    CFI_START_FUNCTION
+
+        // Test the denominator for zero before changing the stack.
+        cmp     yyh,    #0
+        bne     SYM(__uldivmod_nonzero)
+
+        cmp     yyl,    #0
+      #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+        beq     LSYM(__uldivmod_zero)
+      #else
+        beq     SYM(__uldivmod_zero)
+      #endif
+
+  #if defined(OPTIMIZE_SPEED) && OPTIMIZE_SPEED
+        // MAYBE: Optimize division by a power of 2
+  #endif
+
+    CM0_FUNC_START uldivmod_nonzero
+        push    { rP, rQ, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset rP, 0
+                .cfi_rel_offset rQ, 4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+
+        // Set up denominator shift, assuming a single width result.
+        movs    rP,     #32
+
+        // If the upper word of the denominator is 0 ...
+        tst     yyh,    yyh
+        bne     LSYM(__uldivmod_setup)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // ... and the upper word of the numerator is also 0,
+        //  single width division will be at least twice as fast.
+        tst     xxh,    xxh
+        beq     LSYM(__uldivmod_small)
+  #endif
+
+        // ... and the lower word of the denominator is less than or equal
+        //     to the upper word of the numerator ...
+        cmp     xxh,    yyl
+        blo     LSYM(__uldivmod_setup)
+
+        //  ... then the result will be double width, at least 33 bits.
+        // Set up a flag in $rP to seed the shift for the second word.
+        movs    yyh,    yyl
+        eors    yyl,    yyl
+        adds    rP,     #64
+
+    LSYM(__uldivmod_setup):
+        // Pre division: Shift the denominator as far as possible left
+        //  without making it larger than the numerator.
+        // Since search is destructive, first save a copy of the numerator.
+        mov     ip,     xxl
+        mov     lr,     xxh
+
+        // Set up binary search.
+        movs    rQ,     #16
+        eors    rT,     rT
+
+    LSYM(__uldivmod_align):
+        // Maintain a secondary shift $rT = 32 - $rQ, making the overlapping
+        //  shifts between low and high words easier to construct.
+        adds    rT,     rQ
+
+        // Prefer dividing the numerator to multipying the denominator
+        //  (multiplying the denominator may result in overflow).
+        lsrs    xxh,    rQ
+
+        // Measure the high bits of denominator against the numerator.
+        cmp     xxh,    yyh
+        blo     LSYM(__uldivmod_skip)
+        bhi     LSYM(__uldivmod_shift)
+
+        // If the high bits are equal, construct the low bits for checking.
+        mov     xxh,    lr
+        lsls    xxh,    rT
+
+        lsrs    xxl,    rQ
+        orrs    xxh,    xxl
+
+        cmp     xxh,    yyl
+        blo     LSYM(__uldivmod_skip)
+
+    LSYM(__uldivmod_shift):
+        // Scale the denominator and the result together.
+        subs    rP,     rQ
+
+        // If the reduced numerator is still larger than or equal to the
+        //  denominator, it is safe to shift the denominator left.
+        movs    xxh,    yyl
+        lsrs    xxh,    rT
+        lsls    yyh,    rQ
+
+        lsls    yyl,    rQ
+        orrs    yyh,    xxh
+
+    LSYM(__uldivmod_skip):
+        // Restore the numerator.
+        mov     xxl,    ip
+        mov     xxh,    lr
+
+        // Iterate until the shift goes to 0.
+        lsrs    rQ,     #1
+        bne     LSYM(__uldivmod_align)
+
+        // Initialize the result (zero).
+        mov     ip,     rQ
+
+        // HACK: Compensate for the first word test.
+        lsls    rP,     #6
+
+    LSYM(__uldivmod_word2):
+        // Is there another word?
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return)
+
+        // Shift the calculated result by 1 word.
+        mov     lr,     ip
+        mov     ip,     rQ
+
+        // Set up the MSB of the next word of the quotient
+        movs    rQ,     #1
+        rors    rQ,     rP
+        b     LSYM(__uldivmod_entry)
+
+    LSYM(__uldivmod_loop):
+        // Divide the denominator by 2.
+        // It could be slightly faster to multiply the numerator,
+        //  but that would require shifting the remainder at the end.
+        lsls    rT,     yyh,    #31
+        lsrs    yyh,    #1
+        lsrs    yyl,    #1
+        adds    yyl,    rT
+
+        // Step to the next bit of the result.
+        lsrs    rQ,     #1
+        beq     LSYM(__uldivmod_word2)
+
+    LSYM(__uldivmod_entry):
+        // Test if the denominator is smaller, high byte first.
+        cmp     xxh,    yyh
+        blo     LSYM(__uldivmod_loop)
+        bhi     LSYM(__uldivmod_quotient)
+
+        cmp     xxl,    yyl
+        blo     LSYM(__uldivmod_loop)
+
+    LSYM(__uldivmod_quotient):
+        // Smaller denominator: the next bit of the quotient will be set.
+        add     ip,     rQ
+
+        // Subtract the denominator from the remainder.
+        // If the new remainder goes to 0, exit early.
+        subs    xxl,    yyl
+        sbcs    xxh,    yyh
+        bne     LSYM(__uldivmod_loop)
+
+        tst     xxl,    xxl
+        bne     LSYM(__uldivmod_loop)
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+        // Check whether there's still a second word to calculate.
+        lsrs    rP,     #6
+        beq     LSYM(__uldivmod_return)
+
+        // If so, shift the result left by a full word.
+        mov     lr,     ip
+        mov     ip,     xxh // zero
+  #else
+        eors    rQ,     rQ
+        b       LSYM(__uldivmod_word2)
+  #endif
+
+    LSYM(__uldivmod_return):
+        // Move the remainder to the second half of the result.
+        movs    yyl,    xxl
+        movs    yyh,    xxh
+
+        // Move the quotient to the first half of the result.
+        mov     xxl,    ip
+        mov     xxh,    lr
+
+        pop     { rP, rQ, rT, pc }
+                .cfi_restore_state
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+    LSYM(__uldivmod_zero):
+        // Set up the *div0() parameter specified in the ARM runtime ABI:
+        //  * 0 if the numerator is 0,
+        //  * Or, the largest value of the type manipulated by the calling
+        //     division function if the numerator is positive.
+        subs    yyl,    xxl
+        sbcs    yyh,    xxh
+        orrs    xxh,    yyh
+        asrs    xxh,    #31
+        movs    xxl,    xxh
+
+    CM0_FUNC_START uldivmod_zero2
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+      #else
+        push    { lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 4
+                .cfi_rel_offset lr, 0
+      #endif
+
+        // Since GCC implements __aeabi_ldiv0() as a weak overridable function,
+        //  this call must be prepared for a jump beyond +/- 2 KB.
+        // NOTE: __aeabi_ldiv0() can't be implemented as a tail call, since any
+        //  non-trivial override will (likely) corrupt a remainder in $r3:$r2.
+        bl      SYM(__aeabi_ldiv0) __PLT__
+
+        // Since the input to __aeabi_ldiv0() was INF, there really isn't any
+        //  choice in which of the recommended *divmod() patterns to follow.
+        // Clear the remainder to complete {INF, 0}.
+        eors    yyl,    yyl
+        eors    yyh,    yyh
+
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { pc }
+                .cfi_restore_state
+      #endif
+
+  #else /* !PEDANTIC_DIV0 */
+    CM0_FUNC_START uldivmod_zero
+        // NOTE: The following code sets up a return pair of {0, numerator},
+        //  the second preference given by the ARM runtime ABI specification.
+        // The pedantic version is 30 bytes larger between __aeabi_ldiv() and
+        //  __aeabi_uldiv().  However, this version does not conform to the
+        //  out-of-line parameter requirements given for __aeabi_ldiv0(), and
+        //  also does not pass 'gcc/testsuite/gcc.target/arm/divzero.c'.
+
+        // Since the numerator may be overwritten by __aeabi_ldiv0(), save now.
+        // Afterwards, they can be restored directly as the remainder.
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        push    { r0, r1, rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 16
+                .cfi_rel_offset xxl,0
+                .cfi_rel_offset xxh,4
+                .cfi_rel_offset rT, 8
+                .cfi_rel_offset lr, 12
+      #else
+        push    { r0, r1, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 12
+                .cfi_rel_offset xxl,0
+                .cfi_rel_offset xxh,4
+                .cfi_rel_offset lr, 8
+      #endif
+
+        // Set up the quotient.
+        eors    xxl,    xxl
+        eors    xxh,    xxh
+
+        // Since GCC implements div0() as a weak overridable function,
+        //  this call must be prepared for a jump beyond +/- 2 KB.
+        bl      SYM(__aeabi_ldiv0) __PLT__
+
+        // Restore the remainder and return.  
+      #if defined(DOUBLE_ALIGN_STACK) && DOUBLE_ALIGN_STACK
+        pop     { r2, r3, rT, pc }
+                .cfi_restore_state
+      #else
+        pop     { r2, r3, pc }
+                .cfi_restore_state
+      #endif
+  #endif /* !PEDANTIC_DIV0 */
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+    LSYM(__uldivmod_small):
+        // Arrange operands for (much faster) 32-bit division.
+      #if defined(__ARMEB__) && __ARMEB__
+        movs    r0,     r1
+        movs    r1,     r3
+      #else 
+        movs    r1,     r2 
+      #endif 
+
+        bl      SYM(__uidivmod_nonzero) __PLT__
+
+        // Arrange results back into 64-bit format. 
+      #if defined(__ARMEB__) && __ARMEB__
+        movs    r3,     r1
+        movs    r1,     r0
+      #else 
+        movs    r2,     r1
+      #endif
+ 
+        // Extend quotient and remainder to 64 bits, unsigned.
+        eors    xxh,    xxh
+        eors    yyh,    yyh
+        pop     { rP, rQ, rT, pc }
+  #endif
+
+    CFI_END_FUNCTION
+CM0_FUNC_END udivdi3
+CM0_FUNC_END aeabi_uldiv
+CM0_FUNC_END aeabi_uldivmod
+
+#endif /* L_udivdi3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/lmul.S gcc-11-20201220/libgcc/config/arm/cm0/lmul.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/lmul.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/lmul.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,213 @@
+/* lmul.S: Cortex M0 optimized 64-bit integer multiplication 
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+   
+#ifdef L_muldi3 
+
+// long long __aeabi_lmul(long long, long long)
+// Returns the least significant 64 bits of a 64 bit multiplication.
+// Expects the two multiplicands in $r1:$r0 and $r3:$r2.
+// Returns the product in $r1:$r0 (does not distinguish signed types).
+// Uses $r4 and $r5 as scratch space.
+.section .text.sorted.libgcc.lmul.muldi3,"x"
+CM0_FUNC_START aeabi_lmul
+CM0_FUNC_ALIAS muldi3 aeabi_lmul
+    CFI_START_FUNCTION
+
+        // $r1:$r0 = 0xDDDDCCCCBBBBAAAA
+        // $r3:$r2 = 0xZZZZYYYYXXXXWWWW
+
+        // The following operations that only affect the upper 64 bits
+        //  can be safely discarded:
+        //   DDDD * ZZZZ
+        //   DDDD * YYYY
+        //   DDDD * XXXX
+        //   CCCC * ZZZZ
+        //   CCCC * YYYY
+        //   BBBB * ZZZZ
+
+        // MAYBE: Test for multiply by ZERO on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+        // (0xDDDDCCCC * 0xXXXXWWWW), free $r1
+        muls    xxh,    yyl 
+
+        // (0xZZZZYYYY * 0xBBBBAAAA), free $r3
+        muls    yyh,    xxl 
+        adds    yyh,    xxh 
+
+        // Put the parameters in the correct form for umulsidi3().
+        movs    xxh,    yyl 
+        b       LSYM(__mul_overflow)
+
+    CFI_END_FUNCTION
+CM0_FUNC_END aeabi_lmul
+CM0_FUNC_END muldi3
+
+#endif /* L_muldi3 */
+
+
+// The following implementation of __umulsidi3() integrates with __muldi3()
+//  above to allow the fast tail call while still preserving the extra  
+//  hi-shifted bits of the result.  However, these extra bits add a few 
+//  instructions not otherwise required when using only __umulsidi3().
+// Therefore, this block configures __umulsidi3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version adds the hi bits of __muldi3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols in programs that multiply long doubles.
+// This means '_umulsidi3' should appear before '_muldi3' in LIB1ASMFUNCS.
+#if defined(L_muldi3) || defined(L_umulsidi3)
+
+#ifdef L_umulsidi3
+// unsigned long long __umulsidi3(unsigned int, unsigned int)
+// Returns all 64 bits of a 32 bit multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $ip as scratch space.
+.section .text.sorted.libgcc.lmul.umulsidi3,"x"
+CM0_WEAK_START umulsidi3
+    CFI_START_FUNCTION
+
+#else /* L_muldi3 */
+CM0_FUNC_START umulsidi3
+    CFI_START_FUNCTION
+
+        // 32x32 multiply with 64 bit result.
+        // Expand the multiply into 4 parts, since muls only returns 32 bits.
+        //         (a16h * b16h / 2^32)
+        //       + (a16h * b16l / 2^48) + (a16l * b16h / 2^48)
+        //       + (a16l * b16l / 2^64)
+
+        // MAYBE: Test for multiply by 0 on implementations with a 32-cycle
+        //  'muls' instruction, and skip over the operation in that case.
+
+        eors    yyh,    yyh 
+
+    LSYM(__mul_overflow):
+        mov     ip,     yyh 
+
+#endif /* !L_muldi3 */ 
+
+        // a16h * b16h
+        lsrs    r2,     xxl,    #16
+        lsrs    r3,     xxh,    #16
+        muls    r2,     r3
+
+      #ifdef L_muldi3 
+        add     ip,     r2
+      #else 
+        mov     ip,     r2
+      #endif 
+
+        // a16l * b16h; save a16h first!
+        lsrs    r2,     xxl,    #16
+    #if (__ARM_ARCH >= 6)    
+        uxth    xxl,    xxl
+    #else /* __ARM_ARCH < 6 */
+        lsls    xxl,    #16
+        lsrs    xxl,    #16 
+    #endif  
+        muls    r3,     xxl
+
+        // a16l * b16l
+    #if (__ARM_ARCH >= 6)    
+        uxth    xxh,    xxh 
+    #else /* __ARM_ARCH < 6 */
+        lsls    xxh,    #16
+        lsrs    xxh,    #16 
+    #endif  
+        muls    xxl,    xxh 
+
+        // a16h * b16l
+        muls    xxh,    r2
+
+        // Distribute intermediate results.
+        eors    r2,     r2
+        adds    xxh,    r3
+        adcs    r2,     r2
+        lsls    r3,     xxh,    #16
+        lsrs    xxh,    #16
+        lsls    r2,     #16
+        adds    xxl,    r3
+        adcs    xxh,    r2
+
+        // Add in the high bits.
+        add     xxh,     ip
+
+        RET
+
+    CFI_END_FUNCTION
+CM0_FUNC_END umulsidi3
+
+#endif /* L_muldi3 || L_umulsidi3 */
+
+
+#ifdef L_mulsidi3
+
+// long long mulsidi3(int, int)
+// Returns all 64 bits of a 32 bit signed multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $rT as scratch space.
+.section .text.sorted.libgcc.lmul.mulsidi3,"x"
+CM0_FUNC_START mulsidi3
+    CFI_START_FUNCTION
+
+        // Push registers for function call.
+        push    { rT, lr }
+                .cfi_remember_state
+                .cfi_adjust_cfa_offset 8
+                .cfi_rel_offset rT, 0
+                .cfi_rel_offset lr, 4
+
+        // Save signs of the arguments.
+        asrs    r3,     r0,     #31
+        asrs    rT,     r1,     #31
+
+        // Absolute value of the arguments.
+        eors    r0,     r3
+        eors    r1,     rT
+        subs    r0,     r3
+        subs    r1,     rT
+
+        // Save sign of the result.
+        eors    rT,     r3
+
+        bl      SYM(__umulsidi3) __PLT__
+
+        // Apply sign of the result.
+        eors    xxl,     rT
+        eors    xxh,     rT
+        subs    xxl,     rT
+        sbcs    xxh,     rT
+
+        pop     { rT, pc }
+                .cfi_restore_state
+
+    CFI_END_FUNCTION
+CM0_FUNC_END mulsidi3
+
+#endif /* L_mulsidi3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/lshift.S gcc-11-20201220/libgcc/config/arm/cm0/lshift.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/lshift.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/lshift.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,202 @@
+/* lshift.S: Cortex M0 optimized 64-bit integer shift 
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+   
+
+#ifdef L_ashldi3
+
+// long long __aeabi_llsl(long long, int)
+// Logical shift left the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.sorted.libgcc.ashldi3,"x"
+CM0_FUNC_START aeabi_llsl
+CM0_FUNC_ALIAS ashldi3 aeabi_llsl
+    CFI_START_FUNCTION
+
+  #if defined(__thumb__) && __thumb__
+
+        // Save a copy for the remainder.
+        movs    r3,     xxl 
+
+        // Assume a simple shift.
+        lsls    xxl,    r2
+        lsls    xxh,    r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsl_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsrs    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsl_large):
+        // Apply any remaining shift
+        lsls    r3,     r2
+
+        // Merge remainder and result.
+        adds    xxh,    r3
+        RET
+
+  #else /* !__thumb__ */
+
+        // Moved here from lib1funcs.S
+        subs    r3,     r2,     #32
+        rsb     ip,     r2,     #32
+        movmi   xxh,    xxh,    lsl r2
+        movpl   xxh,    xxl,    lsl r3
+        orrmi   xxh,    xxh,    xxl,    lsr ip
+        mov     xxl,    xxl,    lsl r2
+        RET
+
+  #endif /* !__thumb__ */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashldi3
+CM0_FUNC_END aeabi_llsl
+
+#endif /* L_ashldi3 */
+
+
+#ifdef L_lshrdi3
+
+// long long __aeabi_llsr(long long, int)
+// Logical shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.sorted.libgcc.lshrdi3,"x"
+CM0_FUNC_START aeabi_llsr
+CM0_FUNC_ALIAS lshrdi3 aeabi_llsr
+    CFI_START_FUNCTION
+
+  #if defined(__thumb__) && __thumb__
+
+        // Save a copy for the remainder.
+        movs    r3,     xxh 
+
+        // Assume a simple shift.
+        lsrs    xxl,    r2
+        lsrs    xxh,    r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__llsr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__llsr_large):
+        // Apply any remaining shift
+        lsrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    xxl,    r3
+        RET
+
+  #else /* !__thumb__ */
+
+        // Moved here from lib1funcs.S
+        subs    r3,     r2,     #32
+        rsb     ip,     r2,     #32
+        movmi   xxl,    xxl,    lsr r2
+        movpl   xxl,    xxh,    lsr r3
+        orrmi   xxl,    xxl,    xxh,    lsl ip
+        mov     xxh,    xxh,    lsr r2
+        RET
+
+  #endif /* !__thumb__ */
+
+
+    CFI_END_FUNCTION
+CM0_FUNC_END lshrdi3
+CM0_FUNC_END aeabi_llsr
+
+#endif /* L_lshrdi3 */
+
+
+#ifdef L_ashrdi3
+
+// long long __aeabi_lasr(long long, int)
+// Arithmetic shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+.section .text.sorted.libgcc.ashrdi3,"x"
+CM0_FUNC_START aeabi_lasr
+CM0_FUNC_ALIAS ashrdi3 aeabi_lasr
+    CFI_START_FUNCTION
+
+  #if defined(__thumb__) && __thumb__
+
+        // Save a copy for the remainder.
+        movs    r3,     xxh 
+
+        // Assume a simple shift.
+        lsrs    xxl,    r2
+        asrs    xxh,    r2
+
+        // Test if the shift distance is larger than 1 word.
+        subs    r2,     #32
+        bhs     LSYM(__lasr_large)
+
+        // The remainder is opposite the main shift, (32 - x) bits.
+        rsbs    r2,     #0
+        lsls    r3,     r2
+
+        // Cancel any remaining shift.
+        eors    r2,     r2
+
+    LSYM(__lasr_large):
+        // Apply any remaining shift
+        asrs    r3,     r2
+
+        // Merge remainder and result.
+        adds    xxl,    r3
+        RET
+
+  #else /* !__thumb__ */
+
+        // Moved here from lib1funcs.S
+        subs    r3,     r2,     #32
+        rsb     ip,     r2,     #32
+        movmi   xxl,    xxl,    lsr r2
+        movpl   xxl,    xxh,    asr r3
+        orrmi   xxl,    xxl,    xxh,    lsl ip
+        mov     xxh,    xxh,    asr r2
+        RET
+
+  #endif /* !__thumb__ */  
+
+    CFI_END_FUNCTION
+CM0_FUNC_END ashrdi3
+CM0_FUNC_END aeabi_lasr
+
+#endif /* L_ashrdi3 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/parity.S gcc-11-20201220/libgcc/config/arm/cm0/parity.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/parity.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/parity.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,122 @@
+/* parity.S: Cortex M0 optimized parity functions
+
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef L_paritydi2
+   
+// int __paritydi2(int)
+// Returns '0' if the number of bits set in $r1:r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+.section .text.sorted.libgcc.paritydi2,"x"
+CM0_FUNC_START paritydi2
+    CFI_START_FUNCTION
+    
+        // Combine the upper and lower words, then fall through. 
+        // Byte-endianness does not matter for this function.  
+        eors    r0,     r1
+
+#endif /* L_paritydi2 */ 
+
+
+// The implementation of __paritydi2() tightly couples with __paritysi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __paritydi2() when only using __paritysi2().
+// Therefore, this block configures __paritysi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __paritydi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_paritysi2' should appear before '_paritydi2' in LIB1ASMFUNCS.
+#if defined(L_paritysi2) || defined(L_paritydi2) 
+
+#ifdef L_paritysi2            
+// int __paritysi2(int)
+// Returns '0' if the number of bits set in $r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+// Uses $r2 as scratch space.
+.section .text.sorted.libgcc.paritysi2,"x"
+CM0_WEAK_START paritysi2
+    CFI_START_FUNCTION
+
+#else /* L_paritydi2 */
+CM0_FUNC_START paritysi2
+
+#endif
+
+  #if defined(__thumb__) && __thumb__
+    #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+        // Size optimized: 16 bytes, 40 cycles
+        // Speed optimized: 24 bytes, 14 cycles
+        movs    r2,     #16 
+        
+    LSYM(__parity_loop):
+        // Calculate the parity of successively smaller half-words into the MSB.  
+        movs    r1,     r0 
+        lsls    r1,     r2 
+        eors    r0,     r1 
+        lsrs    r2,     #1 
+        bne     LSYM(__parity_loop)
+   
+    #else /* !__OPTIMIZE_SIZE__ */
+        
+        // Unroll the loop.  The 'libgcc' reference C implementation replaces 
+        //  the x2 and the x1 shifts with a constant.  However, since it takes 
+        //  4 cycles to load, index, and mask the constant result, it doesn't 
+        //  cost anything to keep shifting (and saves a few bytes).  
+        lsls    r1,     r0,     #16 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #8 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #4 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #2 
+        eors    r0,     r1 
+        lsls    r1,     r0,     #1 
+        eors    r0,     r1 
+        
+    #endif /* !__OPTIMIZE_SIZE__ */
+  #else /* !__thumb__ */
+   
+        eors    r0,    r0,     r0,     lsl #16
+        eors    r0,    r0,     r0,     lsl #8
+        eors    r0,    r0,     r0,     lsl #4
+        eors    r0,    r0,     r0,     lsl #2
+        eors    r0,    r0,     r0,     lsl #1
+
+  #endif /* !__thumb__ */
+ 
+        lsrs    r0,     #31 
+        RET
+        
+    CFI_END_FUNCTION
+CM0_FUNC_END paritysi2
+
+#ifdef L_paritydi2
+CM0_FUNC_END paritydi2
+#endif 
+
+#endif /* L_paritysi2 || L_paritydi2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/cm0/popcnt.S gcc-11-20201220/libgcc/config/arm/cm0/popcnt.S
--- gcc-11-20201220-clean/libgcc/config/arm/cm0/popcnt.S	1969-12-31 16:00:00.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/cm0/popcnt.S	2021-01-06 02:45:47.432262214 -0800
@@ -0,0 +1,199 @@
+/* popcnt.S: Cortex M0 optimized popcount functions
+
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (gnu@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#ifdef L_popcountdi2
+   
+// int __popcountdi2(int)
+// Returns the number of bits set in $r1:$r0.
+// Returns the result in $r0.
+.section .text.sorted.libgcc.popcountdi2,"x"
+CM0_FUNC_START popcountdi2
+    CFI_START_FUNCTION
+    
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Initialize the result.
+        // Compensate for the two extra loop (one for each word)
+        //  required to detect zero arguments.  
+        movs    r2,     #2
+
+    LSYM(__popcountd_loop):
+        // Same as __popcounts_loop below, except for $r1.
+        subs    r2,     #1
+        subs    r3,     r1,     #1
+        ands    r1,     r3 
+        bcs     LSYM(__popcountd_loop)
+        
+        // Repeat the operation for the second word.  
+        b       LSYM(__popcounts_loop)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+        // Load the one-bit alternating mask.
+        ldr     r3,     LSYM(__popcount_1b)
+
+        // Reduce the second word. 
+        lsrs    r2,     r1,     #1
+        ands    r2,     r3
+        subs    r1,     r2 
+
+        // Reduce the first word. 
+        lsrs    r2,     r0,     #1
+        ands    r2,     r3
+        subs    r0,     r2 
+
+        // Load the two-bit alternating mask. 
+        ldr     r3,     LSYM(__popcount_2b)
+
+        // Reduce the second word.
+        lsrs    r2,     r1,     #2
+        ands    r2,     r3
+        ands    r1,     r3
+        adds    r1,     r2
+
+        // Reduce the first word. 
+        lsrs    r2,     r0,     #2
+        ands    r2,     r3
+        ands    r0,     r3
+        adds    r0,     r2    
+
+        // There will be a maximum of 8 bits in each 4-bit field.   
+        // Jump into the single word flow to combine and complete.
+        b       LSYM(__popcounts_merge)
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+#endif /* L_popcountdi2 */ 
+
+
+// The implementation of __popcountdi2() tightly couples with __popcountsi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __popcountdi2() when only using __popcountsi2().
+// Therefore, this block configures __popcountsi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __popcountdi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_popcountsi2' should appear before '_popcountdi2' in LIB1ASMFUNCS.
+#if defined(L_popcountsi2) || defined(L_popcountdi2) 
+
+#ifdef L_popcountsi2            
+// int __popcountsi2(int)
+// Returns '0' if the number of bits set in $r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+// Uses $r2 as scratch space.
+.section .text.sorted.libgcc.popcountsi2,"x"
+CM0_WEAK_START popcountsi2
+    CFI_START_FUNCTION
+
+#else /* L_popcountdi2 */
+CM0_FUNC_START popcountsi2
+
+#endif
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+        // Initialize the result.
+        // Compensate for the extra loop required to detect zero.
+        movs    r2,     #1
+
+        // Kernighan's algorithm for __popcount(x): 
+        //     for (c = 0; x; c++)
+        //         x &= x - 1;
+
+    LSYM(__popcounts_loop):
+        // Every loop counts for a '1' set in the argument.  
+        // Count down since it's easier to initialize positive compensation, 
+        //  and the negation before function return is free.  
+        subs    r2,     #1
+
+        // Clear one bit per loop.  
+        subs    r3,     r0,     #1
+        ands    r0,     r3 
+
+        // If this is a test for zero, it will be impossible to distinguish
+        //  between zero and one bits set: both terminate after one loop.  
+        // Instead, subtraction underflow flags when zero entered the loop.
+        bcs     LSYM(__popcountd_loop)
+       
+        // Invert the result, since we have been counting negative.   
+        rsbs    r0,     r2,     #0 
+        RET
+
+  #else /* !__OPTIMIZE_SIZE__ */
+
+        // Load the one-bit alternating mask.
+        ldr     r3,     LSYM(__popcount_1b)
+
+        // Reduce the word. 
+        lsrs    r1,     r0,     #1
+        ands    r1,     r3
+        subs    r0,     r1 
+
+        // Load the two-bit alternating mask. 
+        ldr     r3,     LSYM(__popcount_2b)
+
+        // Reduce the word. 
+        lsrs    r1,     r0,     #2
+        ands    r0,     r3
+        ands    r1,     r3
+    LSYM(__popcounts_merge):
+        adds    r0,     r1
+
+        // Load the four-bit alternating mask.  
+        ldr     r3,     LSYM(__popcount_4b)
+
+        // Reduce the word. 
+        lsrs    r1,     r0,     #4
+        ands    r0,     r3
+        ands    r1,     r3
+        adds    r0,     r1
+
+        // Accumulate individual byte sums into the MSB.
+        lsls    r1,     r0,     #8
+        adds    r0,     r1 
+        lsls    r1,     r0,     #16
+        adds    r0,     r1
+
+        // Isolate the cumulative sum.
+        lsrs    r0,     #24
+        RET
+
+        .align 2
+    LSYM(__popcount_1b):
+        .word 0x55555555
+    LSYM(__popcount_2b):
+        .word 0x33333333
+    LSYM(__popcount_4b):
+        .word 0x0F0F0F0F
+        
+  #endif /* !__OPTIMIZE_SIZE__ */
+
+    CFI_END_FUNCTION
+CM0_FUNC_END popcountsi2
+
+#ifdef L_popcountdi2
+CM0_FUNC_END popcountdi2
+#endif
+
+#endif /* L_popcountsi2 || L_popcountdi2 */
+
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/lib1funcs.S gcc-11-20201220/libgcc/config/arm/lib1funcs.S
--- gcc-11-20201220-clean/libgcc/config/arm/lib1funcs.S	2020-12-20 14:32:15.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/lib1funcs.S	2021-01-06 02:45:47.436262144 -0800
@@ -1050,6 +1050,10 @@
 /* ------------------------------------------------------------------------ */
 /*		Start of the Real Functions				    */
 /* ------------------------------------------------------------------------ */
+
+/* Disable these functions for v6m in favor of the versions below */
+#ifndef NOT_ISA_TARGET_32BIT
+
 #ifdef L_udivsi3
 
 #if defined(__prefer_thumb__)
@@ -1455,6 +1459,8 @@
 	DIV_FUNC_END modsi3 signed
 
 #endif /* L_modsi3 */
+#endif /* NOT_ISA_TARGET_32BIT */
+
 /* ------------------------------------------------------------------------ */
 #ifdef L_dvmd_tls
 
@@ -1472,7 +1478,8 @@
 	FUNC_END div0
 #endif
 	
-#endif /* L_divmodsi_tools */
+#endif /* L_dvmd_tls */
+
 /* ------------------------------------------------------------------------ */
 #ifdef L_dvmd_lnx
 @ GNU/Linux division-by zero handler.  Used in place of L_dvmd_tls
@@ -1509,6 +1516,7 @@
 #endif
 	
 #endif /* L_dvmd_lnx */
+
 #ifdef L_clear_cache
 #if defined __ARM_EABI__ && defined __linux__
 @ EABI GNU/Linux call to cacheflush syscall.
@@ -1584,305 +1592,12 @@
    case of logical shifts) or the sign (for asr).  */
 
 #ifdef __ARMEB__
-#define al	r1
-#define ah	r0
-#else
-#define al	r0
-#define ah	r1
-#endif
-
-/* Prevent __aeabi double-word shifts from being produced on SymbianOS.  */
-#ifndef __symbian__
-
-#ifdef L_lshrdi3
-
-	FUNC_START lshrdi3
-	FUNC_ALIAS aeabi_llsr lshrdi3
-	
-#ifdef __thumb__
-	lsrs	al, r2
-	movs	r3, ah
-	lsrs	ah, r2
-	mov	ip, r3
-	subs	r2, #32
-	lsrs	r3, r2
-	orrs	al, r3
-	negs	r2, r2
-	mov	r3, ip
-	lsls	r3, r2
-	orrs	al, r3
-	RET
+#define al      r1
+#define ah      r0
 #else
-	subs	r3, r2, #32
-	rsb	ip, r2, #32
-	movmi	al, al, lsr r2
-	movpl	al, ah, lsr r3
-	orrmi	al, al, ah, lsl ip
-	mov	ah, ah, lsr r2
-	RET
-#endif
-	FUNC_END aeabi_llsr
-	FUNC_END lshrdi3
-
-#endif
-	
-#ifdef L_ashrdi3
-	
-	FUNC_START ashrdi3
-	FUNC_ALIAS aeabi_lasr ashrdi3
-	
-#ifdef __thumb__
-	lsrs	al, r2
-	movs	r3, ah
-	asrs	ah, r2
-	subs	r2, #32
-	@ If r2 is negative at this point the following step would OR
-	@ the sign bit into all of AL.  That's not what we want...
-	bmi	1f
-	mov	ip, r3
-	asrs	r3, r2
-	orrs	al, r3
-	mov	r3, ip
-1:
-	negs	r2, r2
-	lsls	r3, r2
-	orrs	al, r3
-	RET
-#else
-	subs	r3, r2, #32
-	rsb	ip, r2, #32
-	movmi	al, al, lsr r2
-	movpl	al, ah, asr r3
-	orrmi	al, al, ah, lsl ip
-	mov	ah, ah, asr r2
-	RET
-#endif
-
-	FUNC_END aeabi_lasr
-	FUNC_END ashrdi3
-
-#endif
-
-#ifdef L_ashldi3
-
-	FUNC_START ashldi3
-	FUNC_ALIAS aeabi_llsl ashldi3
-	
-#ifdef __thumb__
-	lsls	ah, r2
-	movs	r3, al
-	lsls	al, r2
-	mov	ip, r3
-	subs	r2, #32
-	lsls	r3, r2
-	orrs	ah, r3
-	negs	r2, r2
-	mov	r3, ip
-	lsrs	r3, r2
-	orrs	ah, r3
-	RET
-#else
-	subs	r3, r2, #32
-	rsb	ip, r2, #32
-	movmi	ah, ah, lsl r2
-	movpl	ah, al, lsl r3
-	orrmi	ah, ah, al, lsr ip
-	mov	al, al, lsl r2
-	RET
+#define al      r0
+#define ah      r1
 #endif
-	FUNC_END aeabi_llsl
-	FUNC_END ashldi3
-
-#endif
-
-#endif /* __symbian__ */
-
-#ifdef L_clzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzsi2
-	movs	r1, #28
-	movs	r3, #1
-	lsls	r3, r3, #16
-	cmp	r0, r3 /* 0x10000 */
-	bcc	2f
-	lsrs	r0, r0, #16
-	subs	r1, r1, #16
-2:	lsrs	r3, r3, #8
-	cmp	r0, r3 /* #0x100 */
-	bcc	2f
-	lsrs	r0, r0, #8
-	subs	r1, r1, #8
-2:	lsrs	r3, r3, #4
-	cmp	r0, r3 /* #0x10 */
-	bcc	2f
-	lsrs	r0, r0, #4
-	subs	r1, r1, #4
-2:	adr	r2, 1f
-	ldrb	r0, [r2, r0]
-	adds	r0, r0, r1
-	bx lr
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
-	FUNC_END clzsi2
-#else
-ARM_FUNC_START clzsi2
-# if defined (__ARM_FEATURE_CLZ)
-	clz	r0, r0
-	RET
-# else
-	mov	r1, #28
-	cmp	r0, #0x10000
-	do_it	cs, t
-	movcs	r0, r0, lsr #16
-	subcs	r1, r1, #16
-	cmp	r0, #0x100
-	do_it	cs, t
-	movcs	r0, r0, lsr #8
-	subcs	r1, r1, #8
-	cmp	r0, #0x10
-	do_it	cs, t
-	movcs	r0, r0, lsr #4
-	subcs	r1, r1, #4
-	adr	r2, 1f
-	ldrb	r0, [r2, r0]
-	add	r0, r0, r1
-	RET
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
-# endif /* !defined (__ARM_FEATURE_CLZ) */
-	FUNC_END clzsi2
-#endif
-#endif /* L_clzsi2 */
-
-#ifdef L_clzdi2
-#if !defined (__ARM_FEATURE_CLZ)
-
-# ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzdi2
-	push	{r4, lr}
-	cmp	xxh, #0
-	bne	1f
-#  ifdef __ARMEB__
-	movs	r0, xxl
-	bl	__clzsi2
-	adds	r0, r0, #32
-	b 2f
-1:
-	bl	__clzsi2
-#  else
-	bl	__clzsi2
-	adds	r0, r0, #32
-	b 2f
-1:
-	movs	r0, xxh
-	bl	__clzsi2
-#  endif
-2:
-	pop	{r4, pc}
-# else /* NOT_ISA_TARGET_32BIT */
-ARM_FUNC_START clzdi2
-	do_push	{r4, lr}
-	cmp	xxh, #0
-	bne	1f
-#  ifdef __ARMEB__
-	mov	r0, xxl
-	bl	__clzsi2
-	add	r0, r0, #32
-	b 2f
-1:
-	bl	__clzsi2
-#  else
-	bl	__clzsi2
-	add	r0, r0, #32
-	b 2f
-1:
-	mov	r0, xxh
-	bl	__clzsi2
-#  endif
-2:
-	RETLDM	r4
-	FUNC_END clzdi2
-# endif /* NOT_ISA_TARGET_32BIT */
-
-#else /* defined (__ARM_FEATURE_CLZ) */
-
-ARM_FUNC_START clzdi2
-	cmp	xxh, #0
-	do_it	eq, et
-	clzeq	r0, xxl
-	clzne	r0, xxh
-	addeq	r0, r0, #32
-	RET
-	FUNC_END clzdi2
-
-#endif
-#endif /* L_clzdi2 */
-
-#ifdef L_ctzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START ctzsi2
-	negs	r1, r0
-	ands	r0, r0, r1
-	movs	r1, #28
-	movs	r3, #1
-	lsls	r3, r3, #16
-	cmp	r0, r3 /* 0x10000 */
-	bcc	2f
-	lsrs	r0, r0, #16
-	subs	r1, r1, #16
-2:	lsrs	r3, r3, #8
-	cmp	r0, r3 /* #0x100 */
-	bcc	2f
-	lsrs	r0, r0, #8
-	subs	r1, r1, #8
-2:	lsrs	r3, r3, #4
-	cmp	r0, r3 /* #0x10 */
-	bcc	2f
-	lsrs	r0, r0, #4
-	subs	r1, r1, #4
-2:	adr	r2, 1f
-	ldrb	r0, [r2, r0]
-	subs	r0, r0, r1
-	bx lr
-.align 2
-1:
-.byte	27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-	FUNC_END ctzsi2
-#else
-ARM_FUNC_START ctzsi2
-	rsb	r1, r0, #0
-	and	r0, r0, r1
-# if defined (__ARM_FEATURE_CLZ)
-	clz	r0, r0
-	rsb	r0, r0, #31
-	RET
-# else
-	mov	r1, #28
-	cmp	r0, #0x10000
-	do_it	cs, t
-	movcs	r0, r0, lsr #16
-	subcs	r1, r1, #16
-	cmp	r0, #0x100
-	do_it	cs, t
-	movcs	r0, r0, lsr #8
-	subcs	r1, r1, #8
-	cmp	r0, #0x10
-	do_it	cs, t
-	movcs	r0, r0, lsr #4
-	subcs	r1, r1, #4
-	adr	r2, 1f
-	ldrb	r0, [r2, r0]
-	sub	r0, r0, r1
-	RET
-.align 2
-1:
-.byte	27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-# endif /* !defined (__ARM_FEATURE_CLZ) */
-	FUNC_END ctzsi2
-#endif
-#endif /* L_clzsi2 */
 
 /* ------------------------------------------------------------------------ */
 /* These next two sections are here despite the fact that they contain Thumb 
@@ -2190,4 +1905,77 @@
 #else /* NOT_ISA_TARGET_32BIT */
 #include "bpabi-v6m.S"
 #endif /* NOT_ISA_TARGET_32BIT */
+
+
+/* Temp registers. */
+#define rP r4
+#define rQ r5
+#define rS r6
+#define rT r7
+
+.macro CM0_FUNC_START name
+.global SYM(__\name)
+.type SYM(__\name),function
+THUMB_CODE
+THUMB_FUNC
+.align 1
+    SYM(__\name):
+.endm
+
+.macro CM0_WEAK_START name 
+.weak SYM(__\name)
+CM0_FUNC_START \name 
+.endm
+
+.macro CM0_FUNC_ALIAS new old
+.global	SYM (__\new)
+.thumb_set SYM (__\new), SYM (__\old)
+.endm
+
+.macro CM0_WEAK_ALIAS new old
+.weak SYM(__\new)
+CM0_FUNC_ALIAS \new \old 
+.endm
+
+.macro CM0_FUNC_END name
+.size SYM(__\name), . - SYM(__\name)
+.endm
+
+#include "cm0/fplib.h"
+
+/* These have no conflicts with existing ARM implementations, 
+    so these these files can be built for all architectures. */ 
+#include "cm0/ctz2.S"
+#include "cm0/clz2.S"
+#include "cm0/lcmp.S"
+#include "cm0/lmul.S"
+#include "cm0/lshift.S"
+#include "cm0/parity.S"
+#include "cm0/popcnt.S"
+
+#ifdef NOT_ISA_TARGET_32BIT 
+
+/* These have existing ARM implementations that may be preferred 
+    for non-v6m architectures.  For example, use of the hardware 
+    instructions for 'clz' and 'umull'/'smull'.  Comprehensive 
+    integration may be possible in the future. */
+#include "cm0/idiv.S"
+#include "cm0/ldiv.S"
+
+#include "cm0/fcmp.S"
+
+/* Section names in the following files are selected to maximize 
+    the utility of +/- 256 byte conditional branches. */
+#include "cm0/fneg.S"
+#include "cm0/fadd.S"
+#include "cm0/futil.S"
+#include "cm0/fmul.S"
+#include "cm0/fdiv.S"
+
+#include "cm0/ffloat.S"
+#include "cm0/ffixed.S"
+#include "cm0/fconv.S"
+
+#endif /* NOT_ISA_TARGET_32BIT */ 
+
 #endif /* !__symbian__ */
diff -ruN gcc-11-20201220-clean/libgcc/config/arm/t-elf gcc-11-20201220/libgcc/config/arm/t-elf
--- gcc-11-20201220-clean/libgcc/config/arm/t-elf	2020-12-20 14:32:15.000000000 -0800
+++ gcc-11-20201220/libgcc/config/arm/t-elf	2021-01-06 02:45:47.436262144 -0800
@@ -10,23 +10,31 @@
 # inclusion create when only multiplication is used, thus avoiding pulling in
 # useless division code.
 ifneq (__ARM_ARCH_ISA_THUMB 1,$(ARM_ISA)$(THUMB1_ISA))
-LIB1ASMFUNCS += _arm_muldf3 _arm_mulsf3
+LIB1ASMFUNCS += _arm_muldf3
 endif
 endif # !__symbian__
 
+
+# Preferred WEAK implementations should appear first.  See implementation notes.
+LIB1ASMFUNCS += _arm_mulsf3 _arm_addsf3 _umulsidi3 _arm_floatsisf _arm_floatundisf \
+	_clzsi2 _ctzsi2 _ffssi2 _clrsbsi2 _paritysi2 _popcountsi2 
+
+
 # For most CPUs we have an assembly soft-float implementations.
-# However this is not true for ARMv6M.  Here we want to use the soft-fp C
-# implementation.  The soft-fp code is only build for ARMv6M.  This pulls
-# in the asm implementation for other CPUs.
-LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \
-	_call_via_rX _interwork_call_via_rX \
-	_lshrdi3 _ashrdi3 _ashldi3 \
+LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _udivdi3 _divdi3 \
+	_dvmd_tls _bb_init_func _call_via_rX _interwork_call_via_rX \
+	_lshrdi3 _ashrdi3 _ashldi3 _mulsidi3 _muldi3 \
+	_arm_lcmp _cmpdi2 _arm_ulcmp _ucmpdi2 \
 	_arm_negdf2 _arm_addsubdf3 _arm_muldivdf3 _arm_cmpdf2 _arm_unorddf2 \
-	_arm_fixdfsi _arm_fixunsdfsi \
-	_arm_truncdfsf2 _arm_negsf2 _arm_addsubsf3 _arm_muldivsf3 \
-	_arm_cmpsf2 _arm_unordsf2 _arm_fixsfsi _arm_fixunssfsi \
-	_arm_floatdidf _arm_floatdisf _arm_floatundidf _arm_floatundisf \
-	_clzsi2 _clzdi2 _ctzsi2
+	_arm_fixdfsi _arm_fixunsdfsi _arm_fixsfsi _arm_fixunssfsi \
+	_arm_f2h _arm_h2f _arm_d2f _arm_f2d _arm_truncdfsf2 \
+	_arm_negsf2 _arm_addsubsf3 _arm_frsubsf3 _arm_divsf3 _arm_muldivsf3 \
+	_arm_cmpsf2 _arm_unordsf2 _arm_eqsf2 _arm_gesf2 \
+ 	_arm_fcmpeq _arm_fcmpne _arm_fcmplt _arm_fcmple _arm_fcmpge _arm_fcmpgt \
+	_arm_cfcmpeq _arm_cfcmple _arm_cfrcmple \
+	_arm_floatdidf _arm_floatundidf _arm_floatdisf _arm_floatunsisf \
+	_clzdi2 _ctzdi2 _ffsdi2 _clrsbdi2 _paritydi2 _popcountdi2 
+
 
 # Currently there is a bug somewhere in GCC's alias analysis
 # or scheduling code that is breaking _fpmul_parts in fp-bit.c.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-06 11:20       ` [PATCH v3] " Daniel Engel
@ 2021-01-06 17:05         ` Richard Earnshaw
  2021-01-07  0:59           ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Earnshaw @ 2021-01-06 17:05 UTC (permalink / raw)
  To: Daniel Engel, Christophe Lyon; +Cc: gcc Patches

On 06/01/2021 11:20, Daniel Engel wrote:
> Hi Christophe, 
> 
> On Wed, Dec 16, 2020, at 9:15 AM, Christophe Lyon wrote:
>> On Wed, 2 Dec 2020 at 04:31, Daniel Engel <libgcc@danielengel.com> wrote:
>>>
>>> Hi Christophe,
>>>
>>> On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
>>>> Hi,
>>>>
>>>> On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> This patch adds an efficient assembly-language implementation of IEEE-
>>>>> 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
>>>>> 1).  This is the libgcc portion of a larger library originally
>>>>> described in 2018:
>>>>>
>>>>>     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
>>>>>
>>>>> Since that time, I've separated the libm functions for submission to
>>>>> newlib.  The remaining libgcc functions in the attached patch have
>>>>> the following characteristics:
>>>>>
>>>>>     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
>>>>>     __clzsi2                        42                  23              0       exact
>>>>>     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
>>>>>     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
>>>>>
>>>>>     __umulsidi3                     44                  24              0       exact
>>>>>     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
>>>>>     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
>>>>>     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
>>>>>     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
>>>>>     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
>>>>>
>>>>>     __aeabi_lcmp                    20                   13             0       exact
>>>>>     __aeabi_ulcmp                   16                  10              0       exact
>>>>>
>>>>>     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
>>>>>     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
>>>>>     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
>>>>>     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
>>>>>     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
>>>>>
>>>>>     __shared_float                  178
>>>>>     __shared_float (OPTIMIZE_SIZE)  154
>>>>>
>>>>>     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
>>>>>     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
>>>>>     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>>>>>     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>>>>>     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
>>>>>     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
>>>>>     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
>>>>>     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
>>>>>
>>>>>     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
>>>>>     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>
>>>>>     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
>>>>>     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
>>>>>     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
>>>>>     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
>>>>>     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
>>>>>
>>>>>     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
>>>>>     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
>>>>>     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
>>>>>     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
>>>>>     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
>>>>>
>>>>>     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
>>>>>     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
>>>>>     __aeabi_h2f                     34+__shared_float 34             8     exact
>>>>>     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
>>>>>
>>>>> Copyright assignment is on file with the FSF.
>>>>>
>>>>> I've built the gcc-arm-none-eabi cross-compiler using the 20201108
>>>>> snapshot of GCC plus this patch, and successfully compiled a test
>>>>> program:
>>>>>
>>>>>     extern int main (void)
>>>>>     {
>>>>>         volatile int x = 1;
>>>>>         volatile unsigned long long int y = 10;
>>>>>         volatile long long int z = x / y; // 64-bit division
>>>>>
>>>>>         volatile float a = x; // 32-bit casting
>>>>>         volatile float b = y; // 64 bit casting
>>>>>         volatile float c = z / b; // float division
>>>>>         volatile float d = a + c; // float addition
>>>>>         volatile float e = c * b; // float multiplication
>>>>>         volatile float f = d - e - c; // float subtraction
>>>>>
>>>>>         if (f != c) // float comparison
>>>>>             y -= (long long int)d; // float casting
>>>>>     }
>>>>>
>>>>> As one point of comparison, the test program links to 876 bytes of
>>>>> libgcc code from the patched toolchain, vs 10276 bytes from the
>>>>> latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
>>>>> 90% size reduction.
>>>>
>>>> This looks awesome!
>>>>
>>>>>
>>>>> I have extensive test vectors, and have passed these tests on an
>>>>> STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
>>>>> and IEEECC754 [3] sources, plus some of my own creation.
>>>>> Unfortunately, I'm not sure how "make check" should work for a cross
>>>>> compiler run time library.
>>>>>
>>>>> Although I believe this patch can be incorporated as-is, there are
>>>>> at least two points that might bear discussion:
>>>>>
>>>>> * I'm not sure where or how they would be integrated, but I would be
>>>>>   happy to provide sources for my test vectors.
>>>>>
>>>>> * The library is currently built for the ARM v6m architecture only.
>>>>>   It is likely that some of the other Cortex variants would benefit
>>>>>   from these routines.  However, I would need some guidance on this
>>>>>   to proceed without introducing regressions.  I do not currently
>>>>>   have a test strategy for architectures beyond Cortex M0, and I
>>>>>   have NOT profiled the existing thumb-2 implementations (ieee754-
>>>>>   sf.S) for comparison.
>>>>
>>>> I tried your patch, and I see many regressions in the GCC testsuite
>>>> because many tests fail to link with errors like:
>>>> ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function
>>>> `__clzdi2':
>>>> /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
>>>> `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
>>>> first defined here
>>>>
>>>> This happens with a toolchain configured with --target arm-none-eabi,
>>>> default cpu/fpu/mode,
>>>> --enable-multilib --with-multilib-list=rmprofile and running the tests with
>>>> -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
>>>>
>>>> Does it work for you?
>>>
>>> Thanks for the feedback.
>>>
>>> I'm afraid I'm quite ignorant as to the gcc test suite
>>> infrastructure, so I don't know how to use the options you've shared
>>> above.  I'm cross- compiling the Windows toolchain on Ubuntu.  Would
>>> you mind sharing a full command line you would use for testing?  The
>>> toolchain is built with the default options, which includes "--
>>> target arm-none-eabi".
>>>
>>
>> Why put Windows in the picture? This seems unnecessarily
>> complicated... I suggest you build your cross-toolchain on x86_64
>> ubuntu and run it on x86_64 ubuntu (of course targetting arm)
> 
> Mostly because I had not previously committed the time to understand the
> GCC regression test environment.  My company and personal computers both
> run Windows.  I created an Ubuntu virtual machine for this project, and
> I'd been trying to get by with the build scripts provided by the ARM
> toolchain.  Clearly that was insufficient.
> 
>> The above options where GCC configure options, except for the last one
>> which I used when running the tests.
>>
>> There is some documentation about how to run the GCC testsuite there:
>> https://gcc.gnu.org/install/test.html
> 
> Thanks.  I was able to take this document, plus some additional pages
> about constructing a combined tree with newlib, and put together a
> working regression test.  GDB didn't want to build cleanly at first, so
> eventually I gave up and disabled that part.
> 
>> Basically 'make check' should mostly work except for execution tests
>> for which you'll need to teach DejaGnu how to run the generated
>> programs on a real board or on a simulator.
>>
>> I didn't analyze your patch, I just submitted it to my validation
>> system:
>> https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/report-build-info.html
>> - the red "regressed" items indicate regressions in the testsuite. You
>>   can click on "log" to download the corresponding gcc.log
>> - the dark-red "build broken" items indicate that the toolchain build
>>   failed
>> - the orange "interrupted" items indicate an infrastructure problem,
>>   so you can ignore such cases
>> - similarly the dark red "ref build failed" indicate that the
>>   reference build failed for some infrastructure reason
>>
>> for the arm-none-eabi target, several toolchain versions fail to
>> build, some succeed. This is because I use different multilib
>> configuration flags, it looks like the ones involving --with-
>> multilib=rmprofile are broken with your patch.
>>
>> These ones should be reasonably easy to fix: no 'make check' involved.
>>
>> For instance if you configure GCC with:
>> --target arm-none-eabi --enable-multilib --with-multilib-list=rmprofile
>> you should see the build failure.
> 
> So far, I have not found a cause for the build failures you are seeing.
> The ARM toolchain script I was using before did build with the
> 'rmprofile' option.  With my current configure options, gcc builds
> 'rmprofile', 'aprofile', and even 'armeb'.  I did find a number of link
> issues with 'make check' due to incorrect usage of the 'L_'  defines in
> LIB1ASMFUNCS.  These are fixed in the new version attached.
> 
> Returning to the build failures you logged, I do consistently see this
> message in the logs [1]: "fatal error: cm0/fplib.h: No such file or
> directory".  I recognize the file, since it's one of the new files in
> my patch (the full sub-directory is libgcc/config/arm/cm0/fplib.h).
> Do I have to format patches in some different way so that new files
> get created?
> 
> Regression testing also showed that the previous patch was failing the
> "arm/divzero" test because I wasn't providing the same arguments to
> div0() as the existing implementation.  Having made that change, I think
> the patch is clean.  (I don't think there is a strict specification for
> div0(), and the changes add a non-trivial number of instructions, but
> I'll hold that discussion for another time).
> 
> Do you have time to re-check this patch on your build system?
> 
> Thanks,
> Daniel
> 
> [1] Line 36054: <https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/arm-none-eabi/build-rh70-arm-none-eabi-default-default-default-mthumb.-mcpu=cortex-m0.-mfloat-abi=soft.-march=armv6s-m.log.xz>
> 
>>
>> HTH
>>
>> Christophe
>>
>>> I did see similar errors once before.  It turned out then that I omitted
>>> one of the ".S" files from the build.  My interpretation at that point
>>> was that gcc had been searching multiple versions of "libgcc.a" and
>>> unable to merge the symbols.  In hindsight, that was a really bad
>>> interpretation.   I was able to reproduce the error above by simply
>>> adding a line like "volatile double m = 1.0; m += 2;".
>>>
>>> After reviewing the existing asm implementations more closely, I
>>> believe that I have not been using the function guard macros (L_arm_*)
>>> as intended.  The make script appears to compile "lib1funcs.S" dozens of
>>> times -- once for each function guard macro listed in LIB1ASMFUNCS --
>>> with the intent of generating a separate ".o" file for each function.
>>> Because they were unguarded, my new library functions were duplicated
>>> into every ".o" file, which caused the link errors you saw.
>>>
>>> I have attached an updated patch that implements the macros.
>>>
>>> However, I'm not sure whether my usage is really consistent with the
>>> spirit of the make script.  If there's a README or HOWTO, I haven't
>>> found it yet.  The following points summarize my concerns as I was
>>> making these updates:
>>>
>>> 1.  While some of the new functions (e.g. __cmpsf2) are standalone,
>>>     there is a common core in the new library shared by several related
>>>     functions.  That keeps the library small.  For now, I've elected to
>>>     group all of these related functions together in a single object
>>>     file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
>>>     within this unit.  Notice that I manually assigned section names in
>>>     the code, so there still shouldn't be any unnecessary code linked in
>>>     the final build.  Does the multiple-".o" files strategy predate "-gc-
>>>     sections", or should I be trying harder to break these related
>>>     functions into separate compilation units?
>>>
>>> 2.  I introduced a few new macro keywords for functions/groups (e.g.
>>>     "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
>>>     files compiled for the non-v6m architectures will be benign.
>>>
>>> 3.  The "t-elf" make script implies that __mulsf3() should not be
>>>     compiled in thumb mode (it's inside a conditional), but this is one
>>>     of the new functions.  Moot for now, since my __mulsf3() is grouped
>>>     with the common core functions (see point 1) and is thus currently
>>>     guarded by the "_arm_addsubsf3.o" macro.
>>>
>>> 4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
>>>     to be working.  I have defined __clzsi2() as a weak symbol to be
>>>     overridden by the combined function __clzdi2().  I can also see
>>>     (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
>>>     "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
>>>     always linked, even in programs that only call __clzsi2(),  A minor
>>>     annoyance at this point.
>>>
>>> 5.  Is there a permutation of the makefile that compiles libgcc with
>>>     __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
>>>     optimize either way, yet the final product only seems to have the
>>>     "fast" code.  At this optimization level, the sample program above
>>>     pulls in 1012 bytes of library code instead of 836. Perhaps this is
>>>     meant to be controlled by the toolchain configuration step, but it
>>>     doesn't follow that the optimization for the cross-compiler would
>>>     automatically translate to the target runtime libraries.
>>>
>>> Thanks again,
>>> Daniel
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>>
>>>>> I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
>>>>>
>>>>> Thanks,
>>>>> Daniel Engel
>>>>>
>>>>> [1] http://www.netlib.org/fp/ucbtest.tgz
>>>>> [2] http://www.jhauser.us/arithmetic/TestFloat.html
>>>>> [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
>>>>
>>
> 

Thanks for working on this, Daniel.

This is clearly stage1 material, so we've got time for a couple of
iterations to sort things out.

Firstly, the patch is very large, but contains a large number of
distinct changes, so it would really benefit from being broken down into
a number of distinct patches.  This will make reviewing the individual
changes much more straight-forward.  I'd suggest:

1) Some basic makefile cleanups to ease initial integration - in
particular where we have things like

LIB1FUNCS += <long list of functions>

that this be rewritten with one function per line (and sorted
alphabetically) - then we can see which functions are being changed in
subsequent patches.  It makes the Makefile fragments longer, but the
improvement in clarity for makes this worthwhile.

2) The changes for the existing integer functions - preferably one
function per patch.

3) The new integer functions that you're adding

4) The floating-point support.

Some more general observations:

- where functions are already in lib1funcs.asm, please leave them there.
- lets avoid having the cm0 subdirectory - in particular we should do
this when there is existing code for other targets in the same source
files.  It's OK to have any new files in the main 'arm' directory of the
source tree - just name the files appropriately if really needed.
- let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
better term, and that is used widely elsewhere in the compiler.  So if
you really need a term just use THUMB1, or even T1.
- For the 64-bit shift functions, I expect the existing code to be
preferable whenever possible - I think it can even be tweaked to build
for thumb2 by inserting a suitable IT instruction.  So your code should
only be used when

 #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1

- most, if not all, of your LSYM symbols should not be needed after
assembly, so should start with a captial 'L' (and no leading
underscores), the assembler will then automatically discard any that are
not needed for relocations.
- you'll need to write suitable commit messages for each patch, which
also contain a suitable ChangeLog style entry.
- finally, your popcount implementations have data in the code segment.
 That's going to cause problems when we have compilation options such as
-mpure-code.

I strongly suggest that, rather than using gcc snapshots (I'm assuming
this based on the diff style and directory naming in your patches), you
switch to using a git tree, then you'll be able to use tools such as
rebasing and the git posting tools to send the patch series for
subsequent review.

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-06 17:05         ` Richard Earnshaw
@ 2021-01-07  0:59           ` Daniel Engel
  2021-01-07 12:56             ` Richard Earnshaw
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-07  0:59 UTC (permalink / raw)
  To: Richard Earnshaw, Christophe Lyon; +Cc: gcc Patches

--snip--

On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:

> 
> Thanks for working on this, Daniel.
> 
> This is clearly stage1 material, so we've got time for a couple of
> iterations to sort things out.

I appreciate your feedback.  I had been hoping that with no regressions
this might still be eligible for stage2.  Christophe never indicated
either way. but the fact that he was looking at it seemed positive.
I thought I would be a couple of weeks faster with this last
iteration, but holidays got in the way.

I actually think your comments below could all be addressable within a
couple of days.  But, I'm not accounting for the review process.
 
> Firstly, the patch is very large, but contains a large number of
> distinct changes, so it would really benefit from being broken down into
> a number of distinct patches.  This will make reviewing the individual
> changes much more straight-forward.  

I have no context for "large" or "small" with respect to gcc.  This
patch comprises about 30% of a previously-monolithic library that's
been shipping since ~2016 (the rest is libm material).  Other than
(1) the aforementioned change to div0(), (2) a nascent adaptation
for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
the bitwise functions, the library remains pretty much as it was
originally released.

The driving force in the development of this library was small size,
which of course was never possible with the softfp routines.  It's not
half-slow, either, for the limitations of the M0 architecture.   And,
it's IEEE compliant.  But, that means that most of the functions are
highly interconnected.  So, some of it can be broken up as you outline
below, but that last patch is still worth more than half of the total.

I also have ~70k lines of test vectors that seem mostly redundant, but
not completely.  I haven't decided what to do here.  For example, I have
coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
with this code it will be in a separate thread.

> I'd suggest:
> 
> 1) Some basic makefile cleanups to ease initial integration - in
> particular where we have things like
> 
> LIB1FUNCS += <long list of functions>
> 
> that this be rewritten with one function per line (and sorted
> alphabetically) - then we can see which functions are being changed in
> subsequent patches.  It makes the Makefile fragments longer, but the
> improvement in clarity for makes this worthwhile.

I know next to nothing about Makefiles, particularly ones as complex as
GCC's.  I was just trying to work with the existing style to avoid
breaking something.  However, I can certainly adopt this suggestion.
 
> 2) The changes for the existing integer functions - preferably one
> function per patch.
> 
> 3) The new integer functions that you're adding

These wouldn't be too hard to do, but what are the expectations for
testing?  A clean build of GCC takes about 6 hours in my VM, and
regression testing takes about 4 hours per architecture.  You would want
a full regression report for each incremental patch?  I have no idea how
to target regression tests that apply to particular runtime functions
without the risk of missing something.

> 4) The floating-point support.
> 
> Some more general observations:
> 
> - where functions are already in lib1funcs.asm, please leave them there.

I guess I have a different vision here.  I have had a really hard time
following all of the nested #ifdefs in lib1funcs, so I thought it would
be helpful to begin breaking it up into logical units.

The functions removed were all functions for which I had THUMB1
sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
__ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
the same number of instructions as the previous ARM/THUMB2 version.

You will find all of the previous ARM versions of these functions merged
into the new files (with attribution) and the same preprocessor
selection path.  So no architecture variant should be any worse off than
before this patch, and some beyond v6m should benefit.

In the future, I think that my versions of __divsi3 and __divdi3 will
prove faster/smaller than the existing THUMB2 versions.  I know that my
float routines are less than half the compiled size of THUMB2 versions
in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
differences so I have left all this work for future patches. (It's also
quite likely that my version can be further-refined with a few judicious
uses of THUMB2 alternatives.)

My long-term vision would be use lib1funcs as an architectural wrapper
distinct from the implementation code.

> - lets avoid having the cm0 subdirectory - in particular we should do
> this when there is existing code for other targets in the same source
> files.  It's OK to have any new files in the main 'arm' directory of the
> source tree - just name the files appropriately if really needed.

Fair point on the name.  In v1 of this patch, all these files were all
preprocessor-selected for v6m only.  However, as I've stumbled through
the finer points of integration, that line has blurred.  Name aside,
the subdirectory does still represent a standalone library.   I think
I've managed to add enough integration hooks that it works well in
a libgcc context, but it still has a very distinct implementation style.

I don't have a strong opinion on this, just preference.  But, keeping
the subdirectory with a neutral name will probably make maintenance
easier in the short term.  I would suggest "lib0" (since it caters to
the lowest common denominator) or "eabi" (since that was the original
target).  There are precedents in other architectures (e.g. avr).

> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> better term, and that is used widely elsewhere in the compiler.  So if
> you really need a term just use THUMB1, or even T1.

Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
of this is probably THUMB1 clean, but it wasn't a design requirement.

The CM0_FUNC_START exists so that I can specify subsections of ".text"
for each function.  This was a fairly fundamental design decision that
allowed me to make a number of branch optimizations between functions.
The other macros are just duplicates for naming symmetry.

The existing  FUNC_START macro inserts extra conflicting ".text"
directives that would break the build.  Of course, the prefix was
arbitrary; I just took CM0 from the library name.  But, there's nothing
architecturally significant about this macro at all, so THUMB1 and T1
seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
two parameters? For example:

    FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2

Instead of: 

    .section .text.sorted.libgcc.clz2.clzdi2,"x"
    CM0_FUNC_START clzdi2

> - For the 64-bit shift functions, I expect the existing code to be
> preferable whenever possible - I think it can even be tweaked to build
> for thumb2 by inserting a suitable IT instruction.  So your code should
> only be used when
> 
>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1

That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
(The name doesn't seem quite right for Cortex M0, since it does support
some 32 bit instructions, but that's beside the point.)

The current lib1funcs ARM code path still exists, as described above. My
THUMB1 implementations were 1 - 3 instructions shorter than the current
versions, which is why I took the time to merge the files.

Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
see an advantage to eliminating the branch unless these functions were
written with cryptographic side channel attacks in mind.

> - most, if not all, of your LSYM symbols should not be needed after
> assembly, so should start with a captial 'L' (and no leading
> underscores), the assembler will then automatically discard any that are
> not needed for relocations.

You don't want debugging symbols for libgcc internals :) ?  I sort of
understand that, but those symbols have been useful to me in the past.
The "." by itself seems to keep visibility local, so the extra symbols
won't cause linker issuess. Would you object to a macro variant (e.g.
LLSYM) that prepends the "L" but is easier to disable?

> - you'll need to write suitable commit messages for each patch, which
> also contain a suitable ChangeLog style entry.

OK.

> - finally, your popcount implementations have data in the code segment.
>  That's going to cause problems when we have compilation options such as
> -mpure-code.

I am just following the precedent of existing lib1funcs (e.g. __clz2si).
If this matters, you'll need to point in the right direction for the
fix.  I'm not sure it does matter, since these functions are PIC anyway.

> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> this based on the diff style and directory naming in your patches), you
> switch to using a git tree, then you'll be able to use tools such as
> rebasing and the git posting tools to send the patch series for
> subsequent review.

Your assumption is correct.  I didn't think that I would have to get so
deep into the gcc development process for this contribution.  I used
this library as a bare metal alternative for libgcc/libm in the product
for years, so I thought it would just drop in.  But, the libgcc compile
mechanics have proved much more 'interesting'. I'm assuming this
architecture was created years before the introduction of -ffunction-
sections...

>
> Richard.
>

Thanks again,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-07  0:59           ` Daniel Engel
@ 2021-01-07 12:56             ` Richard Earnshaw
  2021-01-07 13:27               ` Christophe Lyon
  2021-01-09 12:28               ` Daniel Engel
  0 siblings, 2 replies; 26+ messages in thread
From: Richard Earnshaw @ 2021-01-07 12:56 UTC (permalink / raw)
  To: Daniel Engel, Christophe Lyon; +Cc: gcc Patches

On 07/01/2021 00:59, Daniel Engel wrote:
> --snip--
> 
> On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> 
>>
>> Thanks for working on this, Daniel.
>>
>> This is clearly stage1 material, so we've got time for a couple of
>> iterations to sort things out.
> 
> I appreciate your feedback.  I had been hoping that with no regressions
> this might still be eligible for stage2.  Christophe never indicated
> either way. but the fact that he was looking at it seemed positive.
> I thought I would be a couple of weeks faster with this last
> iteration, but holidays got in the way.

GCC doesn't have a stage 2 any more (historical wart).  We were in
(late) stage3 when this was first posted, and because of the significant
impact this might have on not just CM0 but other targets as well, I
don't think it's something we should try to squeeze in at the last
minute.  We're now in stage 4, so that is doubly the case.

Christophe is a very valuable member of our community, but he's not a
port maintainer and thus cannot really rule on what can go into the
tools, or when.

> 
> I actually think your comments below could all be addressable within a
> couple of days.  But, I'm not accounting for the review process.
>  
>> Firstly, the patch is very large, but contains a large number of
>> distinct changes, so it would really benefit from being broken down into
>> a number of distinct patches.  This will make reviewing the individual
>> changes much more straight-forward.  
> 
> I have no context for "large" or "small" with respect to gcc.  This
> patch comprises about 30% of a previously-monolithic library that's
> been shipping since ~2016 (the rest is libm material).  Other than
> (1) the aforementioned change to div0(), (2) a nascent adaptation
> for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> the bitwise functions, the library remains pretty much as it was
> originally released.

Large, like many other terms is relative.  For assembler file changes,
which this is primarily, the overall size can be much smaller and still
be considered 'large'.

> 
> The driving force in the development of this library was small size,
> which of course was never possible with the softfp routines.  It's not
> half-slow, either, for the limitations of the M0 architecture.   And,
> it's IEEE compliant.  But, that means that most of the functions are
> highly interconnected.  So, some of it can be broken up as you outline
> below, but that last patch is still worth more than half of the total.

Nevertheless, having the floating-point code separated out will make
reviewing more straight forward.  I'll likely need to ask one of our FP
experts to have a specific look at that part and that will be easier if
it is disentangled from the other changes.

> 
> I also have ~70k lines of test vectors that seem mostly redundant, but
> not completely.  I haven't decided what to do here.  For example, I have
> coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> with this code it will be in a separate thread.

Publishing the test code, even if it isn't integrated into the GCC
testsuite would be useful.  Perhaps someone else could then help with that.

> 
>> I'd suggest:
>>
>> 1) Some basic makefile cleanups to ease initial integration - in
>> particular where we have things like
>>
>> LIB1FUNCS += <long list of functions>
>>
>> that this be rewritten with one function per line (and sorted
>> alphabetically) - then we can see which functions are being changed in
>> subsequent patches.  It makes the Makefile fragments longer, but the
>> improvement in clarity for makes this worthwhile.
> 
> I know next to nothing about Makefiles, particularly ones as complex as
> GCC's.  I was just trying to work with the existing style to avoid
> breaking something.  However, I can certainly adopt this suggestion.
>  
>> 2) The changes for the existing integer functions - preferably one
>> function per patch.
>>
>> 3) The new integer functions that you're adding
> 
> These wouldn't be too hard to do, but what are the expectations for
> testing?  A clean build of GCC takes about 6 hours in my VM, and
> regression testing takes about 4 hours per architecture.  You would want
> a full regression report for each incremental patch?  I have no idea how
> to target regression tests that apply to particular runtime functions
> without the risk of missing something.
> 

Most of this can be tested in a cross-compile environment using qemu as
a model.  A cross build shouldn't take that long (especially if you
restrict the compiler to just C and C++ - other languages are
vanishingly unlikely to pick up errors in the parts of the compiler
you're changing).  But build checks will be mostly sufficient for most
of the intermediate patches.

>> 4) The floating-point support.
>>
>> Some more general observations:
>>
>> - where functions are already in lib1funcs.asm, please leave them there.
> 
> I guess I have a different vision here.  I have had a really hard time
> following all of the nested #ifdefs in lib1funcs, so I thought it would
> be helpful to begin breaking it up into logical units.

Agreed, it's not easy.  But the restructuring, if any, should be done
separately from other changes, not mixed up with them.

> 
> The functions removed were all functions for which I had THUMB1
> sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> the same number of instructions as the previous ARM/THUMB2 version.
> 
> You will find all of the previous ARM versions of these functions merged
> into the new files (with attribution) and the same preprocessor
> selection path.  So no architecture variant should be any worse off than
> before this patch, and some beyond v6m should benefit.
> 
> In the future, I think that my versions of __divsi3 and __divdi3 will
> prove faster/smaller than the existing THUMB2 versions.  I know that my
> float routines are less than half the compiled size of THUMB2 versions
> in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> differences so I have left all this work for future patches. (It's also
> quite likely that my version can be further-refined with a few judicious
> uses of THUMB2 alternatives.)
> 
> My long-term vision would be use lib1funcs as an architectural wrapper
> distinct from the implementation code.
> 
>> - lets avoid having the cm0 subdirectory - in particular we should do
>> this when there is existing code for other targets in the same source
>> files.  It's OK to have any new files in the main 'arm' directory of the
>> source tree - just name the files appropriately if really needed.
> 
> Fair point on the name.  In v1 of this patch, all these files were all
> preprocessor-selected for v6m only.  However, as I've stumbled through
> the finer points of integration, that line has blurred.  Name aside,
> the subdirectory does still represent a standalone library.   I think
> I've managed to add enough integration hooks that it works well in
> a libgcc context, but it still has a very distinct implementation style.
> 
> I don't have a strong opinion on this, just preference.  But, keeping
> the subdirectory with a neutral name will probably make maintenance
> easier in the short term.  I would suggest "lib0" (since it caters to
> the lowest common denominator) or "eabi" (since that was the original
> target).  There are precedents in other architectures (e.g. avr).

The issue here is that the selection of code from the various
subdirectories is not consistent.  In some cases we might be pulling in
a thumb1 implementation into a thumb2 environment, so having the code in
a directory that doesn't reflect this makes maintaining the code harder.
 I don't mind too much if some new files are introduced and their names
reflect both their function and the architecture they support - eg
t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.

> 
>> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
>> better term, and that is used widely elsewhere in the compiler.  So if
>> you really need a term just use THUMB1, or even T1.
> 
> Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> of this is probably THUMB1 clean, but it wasn't a design requirement.

It's particularly the Thumb1 issue, just more the name is for a specific
CPU which might cause confusion later.  v6m would be preferable to that
if there really is a dependency on the instructions that are not in the
original Thumb1 ISA.

> 
> The CM0_FUNC_START exists so that I can specify subsections of ".text"
> for each function.  This was a fairly fundamental design decision that
> allowed me to make a number of branch optimizations between functions.
> The other macros are just duplicates for naming symmetry.

This is something we'll have to get to during the main review of the
code - we used to have support for PE-COFF object files.  That might now
be obsolete, wince support is certainly deprecated - but we can't assume
that ELF is the only object format we'll ever have to support.

> 
> The existing  FUNC_START macro inserts extra conflicting ".text"
> directives that would break the build.  Of course, the prefix was
> arbitrary; I just took CM0 from the library name.  But, there's nothing
> architecturally significant about this macro at all, so THUMB1 and T1
> seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> two parameters? For example:
> 
>     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> 
> Instead of: 
> 
>     .section .text.sorted.libgcc.clz2.clzdi2,"x"
>     CM0_FUNC_START clzdi2
> 
>> - For the 64-bit shift functions, I expect the existing code to be
>> preferable whenever possible - I think it can even be tweaked to build
>> for thumb2 by inserting a suitable IT instruction.  So your code should
>> only be used when
>>
>>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> 
> That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> (The name doesn't seem quite right for Cortex M0, since it does support
> some 32 bit instructions, but that's beside the point.)

The terms Thumb1 and Thumb2 predate the arm-v8m architecture
specifications - even then the term Thumb1 was interpreted as "mostly
16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
16/32-bit spilt has become more blurred and that will likely continue in
future since the 16-bit encoding space is pretty full.

> 
> The current lib1funcs ARM code path still exists, as described above. My
> THUMB1 implementations were 1 - 3 instructions shorter than the current
> versions, which is why I took the time to merge the files.
> 
> Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> see an advantage to eliminating the branch unless these functions were
> written with cryptographic side channel attacks in mind.

On high performance cores branches are predicted - if the branch is
predictable then the common path will be taken and the unneeded
instructions will never be used.  But library functions like this tend
to have very unpredictable values used for calling them, so it's much
less likely that the hardware will predict the right path - at this
point conditional instructions tend to win (especially if there aren't
very many of them) because the cost (on average) of not executing the
unneeded instructions is much lower than the cost (on average) of
unwinding the processor state to execute the other arm of the
conditional branch.

> 
>> - most, if not all, of your LSYM symbols should not be needed after
>> assembly, so should start with a captial 'L' (and no leading
>> underscores), the assembler will then automatically discard any that are
>> not needed for relocations.
> 
> You don't want debugging symbols for libgcc internals :) ?  I sort of
> understand that, but those symbols have been useful to me in the past.
> The "." by itself seems to keep visibility local, so the extra symbols
> won't cause linker issuess. Would you object to a macro variant (e.g.
> LLSYM) that prepends the "L" but is easier to disable?

It is a matter of taste, but I really prefer the local symbols to
disappear entirely once the file is compiled - it makes things like
backtrace gdb show the proper call heirarchy.

> 
>> - you'll need to write suitable commit messages for each patch, which
>> also contain a suitable ChangeLog style entry.
> 
> OK.
> 
>> - finally, your popcount implementations have data in the code segment.
>>  That's going to cause problems when we have compilation options such as
>> -mpure-code.
> 
> I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> If this matters, you'll need to point in the right direction for the
> fix.  I'm not sure it does matter, since these functions are PIC anyway.

That might be a bug in the clz implementations - Christophe: Any thoughts?

> 
>> I strongly suggest that, rather than using gcc snapshots (I'm assuming
>> this based on the diff style and directory naming in your patches), you
>> switch to using a git tree, then you'll be able to use tools such as
>> rebasing and the git posting tools to send the patch series for
>> subsequent review.
> 
> Your assumption is correct.  I didn't think that I would have to get so
> deep into the gcc development process for this contribution.  I used
> this library as a bare metal alternative for libgcc/libm in the product
> for years, so I thought it would just drop in.  But, the libgcc compile
> mechanics have proved much more 'interesting'. I'm assuming this
> architecture was created years before the introduction of -ffunction-
> sections...
> 

I don't think I've time to write a history lesson, even if you wanted
it.  Suffice to say, this does date back to the days of a.out format
object files (with 4 relocation types, STABS debugging, and one code,
one data and one bss section).

>>
>> Richard.
>>
> 
> Thanks again,
> Daniel
> 

R.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-07 12:56             ` Richard Earnshaw
@ 2021-01-07 13:27               ` Christophe Lyon
  2021-01-07 16:44                 ` Richard Earnshaw
  2021-01-09 12:28               ` Daniel Engel
  1 sibling, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-07 13:27 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: Daniel Engel, gcc Patches

On Thu, 7 Jan 2021 at 13:56, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 07/01/2021 00:59, Daniel Engel wrote:
> > --snip--
> >
> > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> >
> >>
> >> Thanks for working on this, Daniel.
> >>
> >> This is clearly stage1 material, so we've got time for a couple of
> >> iterations to sort things out.
> >
> > I appreciate your feedback.  I had been hoping that with no regressions
> > this might still be eligible for stage2.  Christophe never indicated
> > either way. but the fact that he was looking at it seemed positive.
> > I thought I would be a couple of weeks faster with this last
> > iteration, but holidays got in the way.
>
> GCC doesn't have a stage 2 any more (historical wart).  We were in
> (late) stage3 when this was first posted, and because of the significant
> impact this might have on not just CM0 but other targets as well, I
> don't think it's something we should try to squeeze in at the last
> minute.  We're now in stage 4, so that is doubly the case.
>
> Christophe is a very valuable member of our community, but he's not a
> port maintainer and thus cannot really rule on what can go into the
> tools, or when.
>
> >
> > I actually think your comments below could all be addressable within a
> > couple of days.  But, I'm not accounting for the review process.
> >
> >> Firstly, the patch is very large, but contains a large number of
> >> distinct changes, so it would really benefit from being broken down into
> >> a number of distinct patches.  This will make reviewing the individual
> >> changes much more straight-forward.
> >
> > I have no context for "large" or "small" with respect to gcc.  This
> > patch comprises about 30% of a previously-monolithic library that's
> > been shipping since ~2016 (the rest is libm material).  Other than
> > (1) the aforementioned change to div0(), (2) a nascent adaptation
> > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> > the bitwise functions, the library remains pretty much as it was
> > originally released.
>
> Large, like many other terms is relative.  For assembler file changes,
> which this is primarily, the overall size can be much smaller and still
> be considered 'large'.
>
> >
> > The driving force in the development of this library was small size,
> > which of course was never possible with the softfp routines.  It's not
> > half-slow, either, for the limitations of the M0 architecture.   And,
> > it's IEEE compliant.  But, that means that most of the functions are
> > highly interconnected.  So, some of it can be broken up as you outline
> > below, but that last patch is still worth more than half of the total.
>
> Nevertheless, having the floating-point code separated out will make
> reviewing more straight forward.  I'll likely need to ask one of our FP
> experts to have a specific look at that part and that will be easier if
> it is disentangled from the other changes.
>
> >
> > I also have ~70k lines of test vectors that seem mostly redundant, but
> > not completely.  I haven't decided what to do here.  For example, I have
> > coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> > with this code it will be in a separate thread.
>
> Publishing the test code, even if it isn't integrated into the GCC
> testsuite would be useful.  Perhaps someone else could then help with that.
>
> >
> >> I'd suggest:
> >>
> >> 1) Some basic makefile cleanups to ease initial integration - in
> >> particular where we have things like
> >>
> >> LIB1FUNCS += <long list of functions>
> >>
> >> that this be rewritten with one function per line (and sorted
> >> alphabetically) - then we can see which functions are being changed in
> >> subsequent patches.  It makes the Makefile fragments longer, but the
> >> improvement in clarity for makes this worthwhile.
> >
> > I know next to nothing about Makefiles, particularly ones as complex as
> > GCC's.  I was just trying to work with the existing style to avoid
> > breaking something.  However, I can certainly adopt this suggestion.
> >
> >> 2) The changes for the existing integer functions - preferably one
> >> function per patch.
> >>
> >> 3) The new integer functions that you're adding
> >
> > These wouldn't be too hard to do, but what are the expectations for
> > testing?  A clean build of GCC takes about 6 hours in my VM, and
> > regression testing takes about 4 hours per architecture.  You would want
> > a full regression report for each incremental patch?  I have no idea how
> > to target regression tests that apply to particular runtime functions
> > without the risk of missing something.
> >
>
> Most of this can be tested in a cross-compile environment using qemu as
> a model.  A cross build shouldn't take that long (especially if you
> restrict the compiler to just C and C++ - other languages are
> vanishingly unlikely to pick up errors in the parts of the compiler
> you're changing).  But build checks will be mostly sufficient for most
> of the intermediate patches.
>
> >> 4) The floating-point support.
> >>
> >> Some more general observations:
> >>
> >> - where functions are already in lib1funcs.asm, please leave them there.
> >
> > I guess I have a different vision here.  I have had a really hard time
> > following all of the nested #ifdefs in lib1funcs, so I thought it would
> > be helpful to begin breaking it up into logical units.
>
> Agreed, it's not easy.  But the restructuring, if any, should be done
> separately from other changes, not mixed up with them.
>
> >
> > The functions removed were all functions for which I had THUMB1
> > sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> > __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> > the same number of instructions as the previous ARM/THUMB2 version.
> >
> > You will find all of the previous ARM versions of these functions merged
> > into the new files (with attribution) and the same preprocessor
> > selection path.  So no architecture variant should be any worse off than
> > before this patch, and some beyond v6m should benefit.
> >
> > In the future, I think that my versions of __divsi3 and __divdi3 will
> > prove faster/smaller than the existing THUMB2 versions.  I know that my
> > float routines are less than half the compiled size of THUMB2 versions
> > in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> > differences so I have left all this work for future patches. (It's also
> > quite likely that my version can be further-refined with a few judicious
> > uses of THUMB2 alternatives.)
> >
> > My long-term vision would be use lib1funcs as an architectural wrapper
> > distinct from the implementation code.
> >
> >> - lets avoid having the cm0 subdirectory - in particular we should do
> >> this when there is existing code for other targets in the same source
> >> files.  It's OK to have any new files in the main 'arm' directory of the
> >> source tree - just name the files appropriately if really needed.
> >
> > Fair point on the name.  In v1 of this patch, all these files were all
> > preprocessor-selected for v6m only.  However, as I've stumbled through
> > the finer points of integration, that line has blurred.  Name aside,
> > the subdirectory does still represent a standalone library.   I think
> > I've managed to add enough integration hooks that it works well in
> > a libgcc context, but it still has a very distinct implementation style.
> >
> > I don't have a strong opinion on this, just preference.  But, keeping
> > the subdirectory with a neutral name will probably make maintenance
> > easier in the short term.  I would suggest "lib0" (since it caters to
> > the lowest common denominator) or "eabi" (since that was the original
> > target).  There are precedents in other architectures (e.g. avr).
>
> The issue here is that the selection of code from the various
> subdirectories is not consistent.  In some cases we might be pulling in
> a thumb1 implementation into a thumb2 environment, so having the code in
> a directory that doesn't reflect this makes maintaining the code harder.
>  I don't mind too much if some new files are introduced and their names
> reflect both their function and the architecture they support - eg
> t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
>
> >
> >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> >> better term, and that is used widely elsewhere in the compiler.  So if
> >> you really need a term just use THUMB1, or even T1.
> >
> > Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> > of this is probably THUMB1 clean, but it wasn't a design requirement.
>
> It's particularly the Thumb1 issue, just more the name is for a specific
> CPU which might cause confusion later.  v6m would be preferable to that
> if there really is a dependency on the instructions that are not in the
> original Thumb1 ISA.
>
> >
> > The CM0_FUNC_START exists so that I can specify subsections of ".text"
> > for each function.  This was a fairly fundamental design decision that
> > allowed me to make a number of branch optimizations between functions.
> > The other macros are just duplicates for naming symmetry.
>
> This is something we'll have to get to during the main review of the
> code - we used to have support for PE-COFF object files.  That might now
> be obsolete, wince support is certainly deprecated - but we can't assume
> that ELF is the only object format we'll ever have to support.
>
> >
> > The existing  FUNC_START macro inserts extra conflicting ".text"
> > directives that would break the build.  Of course, the prefix was
> > arbitrary; I just took CM0 from the library name.  But, there's nothing
> > architecturally significant about this macro at all, so THUMB1 and T1
> > seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> > two parameters? For example:
> >
> >     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> >
> > Instead of:
> >
> >     .section .text.sorted.libgcc.clz2.clzdi2,"x"
> >     CM0_FUNC_START clzdi2
> >
> >> - For the 64-bit shift functions, I expect the existing code to be
> >> preferable whenever possible - I think it can even be tweaked to build
> >> for thumb2 by inserting a suitable IT instruction.  So your code should
> >> only be used when
> >>
> >>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> >
> > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> > (The name doesn't seem quite right for Cortex M0, since it does support
> > some 32 bit instructions, but that's beside the point.)
>
> The terms Thumb1 and Thumb2 predate the arm-v8m architecture
> specifications - even then the term Thumb1 was interpreted as "mostly
> 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
> 16/32-bit spilt has become more blurred and that will likely continue in
> future since the 16-bit encoding space is pretty full.
>
> >
> > The current lib1funcs ARM code path still exists, as described above. My
> > THUMB1 implementations were 1 - 3 instructions shorter than the current
> > versions, which is why I took the time to merge the files.
> >
> > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> > see an advantage to eliminating the branch unless these functions were
> > written with cryptographic side channel attacks in mind.
>
> On high performance cores branches are predicted - if the branch is
> predictable then the common path will be taken and the unneeded
> instructions will never be used.  But library functions like this tend
> to have very unpredictable values used for calling them, so it's much
> less likely that the hardware will predict the right path - at this
> point conditional instructions tend to win (especially if there aren't
> very many of them) because the cost (on average) of not executing the
> unneeded instructions is much lower than the cost (on average) of
> unwinding the processor state to execute the other arm of the
> conditional branch.
>
> >
> >> - most, if not all, of your LSYM symbols should not be needed after
> >> assembly, so should start with a captial 'L' (and no leading
> >> underscores), the assembler will then automatically discard any that are
> >> not needed for relocations.
> >
> > You don't want debugging symbols for libgcc internals :) ?  I sort of
> > understand that, but those symbols have been useful to me in the past.
> > The "." by itself seems to keep visibility local, so the extra symbols
> > won't cause linker issuess. Would you object to a macro variant (e.g.
> > LLSYM) that prepends the "L" but is easier to disable?
>
> It is a matter of taste, but I really prefer the local symbols to
> disappear entirely once the file is compiled - it makes things like
> backtrace gdb show the proper call heirarchy.
>
> >
> >> - you'll need to write suitable commit messages for each patch, which
> >> also contain a suitable ChangeLog style entry.
> >
> > OK.
> >
> >> - finally, your popcount implementations have data in the code segment.
> >>  That's going to cause problems when we have compilation options such as
> >> -mpure-code.
> >
> > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > If this matters, you'll need to point in the right direction for the
> > fix.  I'm not sure it does matter, since these functions are PIC anyway.
>
> That might be a bug in the clz implementations - Christophe: Any thoughts?
>

Indeed that looks suspicious. I'm wondering why I saw no problem during testing.
Is it possible that __clzsi2 is not covered by GCC's 'make check' ?

> >
> >> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> >> this based on the diff style and directory naming in your patches), you
> >> switch to using a git tree, then you'll be able to use tools such as
> >> rebasing and the git posting tools to send the patch series for
> >> subsequent review.
> >
> > Your assumption is correct.  I didn't think that I would have to get so
> > deep into the gcc development process for this contribution.  I used
> > this library as a bare metal alternative for libgcc/libm in the product
> > for years, so I thought it would just drop in.  But, the libgcc compile
> > mechanics have proved much more 'interesting'. I'm assuming this
> > architecture was created years before the introduction of -ffunction-
> > sections...
> >
>
> I don't think I've time to write a history lesson, even if you wanted
> it.  Suffice to say, this does date back to the days of a.out format
> object files (with 4 relocation types, STABS debugging, and one code,
> one data and one bss section).
>
> >>
> >> Richard.
> >>
> >
> > Thanks again,
> > Daniel
> >
>
> R.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-07 13:27               ` Christophe Lyon
@ 2021-01-07 16:44                 ` Richard Earnshaw
  0 siblings, 0 replies; 26+ messages in thread
From: Richard Earnshaw @ 2021-01-07 16:44 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Daniel Engel, gcc Patches

On 07/01/2021 13:27, Christophe Lyon via Gcc-patches wrote:
> On Thu, 7 Jan 2021 at 13:56, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 07/01/2021 00:59, Daniel Engel wrote:
>>> --snip--
>>>
>>> On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
>>>
>>>>
>>>> Thanks for working on this, Daniel.
>>>>
>>>> This is clearly stage1 material, so we've got time for a couple of
>>>> iterations to sort things out.
>>>
>>> I appreciate your feedback.  I had been hoping that with no regressions
>>> this might still be eligible for stage2.  Christophe never indicated
>>> either way. but the fact that he was looking at it seemed positive.
>>> I thought I would be a couple of weeks faster with this last
>>> iteration, but holidays got in the way.
>>
>> GCC doesn't have a stage 2 any more (historical wart).  We were in
>> (late) stage3 when this was first posted, and because of the significant
>> impact this might have on not just CM0 but other targets as well, I
>> don't think it's something we should try to squeeze in at the last
>> minute.  We're now in stage 4, so that is doubly the case.
>>
>> Christophe is a very valuable member of our community, but he's not a
>> port maintainer and thus cannot really rule on what can go into the
>> tools, or when.
>>
>>>
>>> I actually think your comments below could all be addressable within a
>>> couple of days.  But, I'm not accounting for the review process.
>>>
>>>> Firstly, the patch is very large, but contains a large number of
>>>> distinct changes, so it would really benefit from being broken down into
>>>> a number of distinct patches.  This will make reviewing the individual
>>>> changes much more straight-forward.
>>>
>>> I have no context for "large" or "small" with respect to gcc.  This
>>> patch comprises about 30% of a previously-monolithic library that's
>>> been shipping since ~2016 (the rest is libm material).  Other than
>>> (1) the aforementioned change to div0(), (2) a nascent adaptation
>>> for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
>>> the bitwise functions, the library remains pretty much as it was
>>> originally released.
>>
>> Large, like many other terms is relative.  For assembler file changes,
>> which this is primarily, the overall size can be much smaller and still
>> be considered 'large'.
>>
>>>
>>> The driving force in the development of this library was small size,
>>> which of course was never possible with the softfp routines.  It's not
>>> half-slow, either, for the limitations of the M0 architecture.   And,
>>> it's IEEE compliant.  But, that means that most of the functions are
>>> highly interconnected.  So, some of it can be broken up as you outline
>>> below, but that last patch is still worth more than half of the total.
>>
>> Nevertheless, having the floating-point code separated out will make
>> reviewing more straight forward.  I'll likely need to ask one of our FP
>> experts to have a specific look at that part and that will be easier if
>> it is disentangled from the other changes.
>>
>>>
>>> I also have ~70k lines of test vectors that seem mostly redundant, but
>>> not completely.  I haven't decided what to do here.  For example, I have
>>> coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
>>> with this code it will be in a separate thread.
>>
>> Publishing the test code, even if it isn't integrated into the GCC
>> testsuite would be useful.  Perhaps someone else could then help with that.
>>
>>>
>>>> I'd suggest:
>>>>
>>>> 1) Some basic makefile cleanups to ease initial integration - in
>>>> particular where we have things like
>>>>
>>>> LIB1FUNCS += <long list of functions>
>>>>
>>>> that this be rewritten with one function per line (and sorted
>>>> alphabetically) - then we can see which functions are being changed in
>>>> subsequent patches.  It makes the Makefile fragments longer, but the
>>>> improvement in clarity for makes this worthwhile.
>>>
>>> I know next to nothing about Makefiles, particularly ones as complex as
>>> GCC's.  I was just trying to work with the existing style to avoid
>>> breaking something.  However, I can certainly adopt this suggestion.
>>>
>>>> 2) The changes for the existing integer functions - preferably one
>>>> function per patch.
>>>>
>>>> 3) The new integer functions that you're adding
>>>
>>> These wouldn't be too hard to do, but what are the expectations for
>>> testing?  A clean build of GCC takes about 6 hours in my VM, and
>>> regression testing takes about 4 hours per architecture.  You would want
>>> a full regression report for each incremental patch?  I have no idea how
>>> to target regression tests that apply to particular runtime functions
>>> without the risk of missing something.
>>>
>>
>> Most of this can be tested in a cross-compile environment using qemu as
>> a model.  A cross build shouldn't take that long (especially if you
>> restrict the compiler to just C and C++ - other languages are
>> vanishingly unlikely to pick up errors in the parts of the compiler
>> you're changing).  But build checks will be mostly sufficient for most
>> of the intermediate patches.
>>
>>>> 4) The floating-point support.
>>>>
>>>> Some more general observations:
>>>>
>>>> - where functions are already in lib1funcs.asm, please leave them there.
>>>
>>> I guess I have a different vision here.  I have had a really hard time
>>> following all of the nested #ifdefs in lib1funcs, so I thought it would
>>> be helpful to begin breaking it up into logical units.
>>
>> Agreed, it's not easy.  But the restructuring, if any, should be done
>> separately from other changes, not mixed up with them.
>>
>>>
>>> The functions removed were all functions for which I had THUMB1
>>> sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
>>> __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
>>> the same number of instructions as the previous ARM/THUMB2 version.
>>>
>>> You will find all of the previous ARM versions of these functions merged
>>> into the new files (with attribution) and the same preprocessor
>>> selection path.  So no architecture variant should be any worse off than
>>> before this patch, and some beyond v6m should benefit.
>>>
>>> In the future, I think that my versions of __divsi3 and __divdi3 will
>>> prove faster/smaller than the existing THUMB2 versions.  I know that my
>>> float routines are less than half the compiled size of THUMB2 versions
>>> in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
>>> differences so I have left all this work for future patches. (It's also
>>> quite likely that my version can be further-refined with a few judicious
>>> uses of THUMB2 alternatives.)
>>>
>>> My long-term vision would be use lib1funcs as an architectural wrapper
>>> distinct from the implementation code.
>>>
>>>> - lets avoid having the cm0 subdirectory - in particular we should do
>>>> this when there is existing code for other targets in the same source
>>>> files.  It's OK to have any new files in the main 'arm' directory of the
>>>> source tree - just name the files appropriately if really needed.
>>>
>>> Fair point on the name.  In v1 of this patch, all these files were all
>>> preprocessor-selected for v6m only.  However, as I've stumbled through
>>> the finer points of integration, that line has blurred.  Name aside,
>>> the subdirectory does still represent a standalone library.   I think
>>> I've managed to add enough integration hooks that it works well in
>>> a libgcc context, but it still has a very distinct implementation style.
>>>
>>> I don't have a strong opinion on this, just preference.  But, keeping
>>> the subdirectory with a neutral name will probably make maintenance
>>> easier in the short term.  I would suggest "lib0" (since it caters to
>>> the lowest common denominator) or "eabi" (since that was the original
>>> target).  There are precedents in other architectures (e.g. avr).
>>
>> The issue here is that the selection of code from the various
>> subdirectories is not consistent.  In some cases we might be pulling in
>> a thumb1 implementation into a thumb2 environment, so having the code in
>> a directory that doesn't reflect this makes maintaining the code harder.
>>  I don't mind too much if some new files are introduced and their names
>> reflect both their function and the architecture they support - eg
>> t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
>>
>>>
>>>> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
>>>> better term, and that is used widely elsewhere in the compiler.  So if
>>>> you really need a term just use THUMB1, or even T1.
>>>
>>> Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
>>> of this is probably THUMB1 clean, but it wasn't a design requirement.
>>
>> It's particularly the Thumb1 issue, just more the name is for a specific
>> CPU which might cause confusion later.  v6m would be preferable to that
>> if there really is a dependency on the instructions that are not in the
>> original Thumb1 ISA.
>>
>>>
>>> The CM0_FUNC_START exists so that I can specify subsections of ".text"
>>> for each function.  This was a fairly fundamental design decision that
>>> allowed me to make a number of branch optimizations between functions.
>>> The other macros are just duplicates for naming symmetry.
>>
>> This is something we'll have to get to during the main review of the
>> code - we used to have support for PE-COFF object files.  That might now
>> be obsolete, wince support is certainly deprecated - but we can't assume
>> that ELF is the only object format we'll ever have to support.
>>
>>>
>>> The existing  FUNC_START macro inserts extra conflicting ".text"
>>> directives that would break the build.  Of course, the prefix was
>>> arbitrary; I just took CM0 from the library name.  But, there's nothing
>>> architecturally significant about this macro at all, so THUMB1 and T1
>>> seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
>>> two parameters? For example:
>>>
>>>     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
>>>
>>> Instead of:
>>>
>>>     .section .text.sorted.libgcc.clz2.clzdi2,"x"
>>>     CM0_FUNC_START clzdi2
>>>
>>>> - For the 64-bit shift functions, I expect the existing code to be
>>>> preferable whenever possible - I think it can even be tweaked to build
>>>> for thumb2 by inserting a suitable IT instruction.  So your code should
>>>> only be used when
>>>>
>>>>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
>>>
>>> That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
>>> (The name doesn't seem quite right for Cortex M0, since it does support
>>> some 32 bit instructions, but that's beside the point.)
>>
>> The terms Thumb1 and Thumb2 predate the arm-v8m architecture
>> specifications - even then the term Thumb1 was interpreted as "mostly
>> 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
>> 16/32-bit spilt has become more blurred and that will likely continue in
>> future since the 16-bit encoding space is pretty full.
>>
>>>
>>> The current lib1funcs ARM code path still exists, as described above. My
>>> THUMB1 implementations were 1 - 3 instructions shorter than the current
>>> versions, which is why I took the time to merge the files.
>>>
>>> Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
>>> see an advantage to eliminating the branch unless these functions were
>>> written with cryptographic side channel attacks in mind.
>>
>> On high performance cores branches are predicted - if the branch is
>> predictable then the common path will be taken and the unneeded
>> instructions will never be used.  But library functions like this tend
>> to have very unpredictable values used for calling them, so it's much
>> less likely that the hardware will predict the right path - at this
>> point conditional instructions tend to win (especially if there aren't
>> very many of them) because the cost (on average) of not executing the
>> unneeded instructions is much lower than the cost (on average) of
>> unwinding the processor state to execute the other arm of the
>> conditional branch.
>>
>>>
>>>> - most, if not all, of your LSYM symbols should not be needed after
>>>> assembly, so should start with a captial 'L' (and no leading
>>>> underscores), the assembler will then automatically discard any that are
>>>> not needed for relocations.
>>>
>>> You don't want debugging symbols for libgcc internals :) ?  I sort of
>>> understand that, but those symbols have been useful to me in the past.
>>> The "." by itself seems to keep visibility local, so the extra symbols
>>> won't cause linker issuess. Would you object to a macro variant (e.g.
>>> LLSYM) that prepends the "L" but is easier to disable?
>>
>> It is a matter of taste, but I really prefer the local symbols to
>> disappear entirely once the file is compiled - it makes things like
>> backtrace gdb show the proper call heirarchy.
>>
>>>
>>>> - you'll need to write suitable commit messages for each patch, which
>>>> also contain a suitable ChangeLog style entry.
>>>
>>> OK.
>>>
>>>> - finally, your popcount implementations have data in the code segment.
>>>>  That's going to cause problems when we have compilation options such as
>>>> -mpure-code.
>>>
>>> I am just following the precedent of existing lib1funcs (e.g. __clz2si).
>>> If this matters, you'll need to point in the right direction for the
>>> fix.  I'm not sure it does matter, since these functions are PIC anyway.
>>
>> That might be a bug in the clz implementations - Christophe: Any thoughts?
>>
> 
> Indeed that looks suspicious. I'm wondering why I saw no problem during testing.
> Is it possible that __clzsi2 is not covered by GCC's 'make check' ?

It's possible, but don't forget that modern cores have a CLZ
instruction, so most tests will not end up using the library implementation.

> 
>>>
>>>> I strongly suggest that, rather than using gcc snapshots (I'm assuming
>>>> this based on the diff style and directory naming in your patches), you
>>>> switch to using a git tree, then you'll be able to use tools such as
>>>> rebasing and the git posting tools to send the patch series for
>>>> subsequent review.
>>>
>>> Your assumption is correct.  I didn't think that I would have to get so
>>> deep into the gcc development process for this contribution.  I used
>>> this library as a bare metal alternative for libgcc/libm in the product
>>> for years, so I thought it would just drop in.  But, the libgcc compile
>>> mechanics have proved much more 'interesting'. I'm assuming this
>>> architecture was created years before the introduction of -ffunction-
>>> sections...
>>>
>>
>> I don't think I've time to write a history lesson, even if you wanted
>> it.  Suffice to say, this does date back to the days of a.out format
>> object files (with 4 relocation types, STABS debugging, and one code,
>> one data and one bss section).
>>
>>>>
>>>> Richard.
>>>>
>>>
>>> Thanks again,
>>> Daniel
>>>
>>
>> R.

R.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-07 12:56             ` Richard Earnshaw
  2021-01-07 13:27               ` Christophe Lyon
@ 2021-01-09 12:28               ` Daniel Engel
  2021-01-09 13:09                 ` Christophe Lyon
  1 sibling, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-09 12:28 UTC (permalink / raw)
  To: Richard Earnshaw, Christophe Lyon; +Cc: gcc Patches

On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> On 07/01/2021 00:59, Daniel Engel wrote:
> > --snip--
> > 
> > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > 
> >>
> >> Thanks for working on this, Daniel.
> >>
> >> This is clearly stage1 material, so we've got time for a couple of
> >> iterations to sort things out.
> > 
> > I appreciate your feedback.  I had been hoping that with no regressions
> > this might still be eligible for stage2.  Christophe never indicated
> > either way. but the fact that he was looking at it seemed positive.
> > I thought I would be a couple of weeks faster with this last
> > iteration, but holidays got in the way.
> 
> GCC doesn't have a stage 2 any more (historical wart).  We were in
> (late) stage3 when this was first posted, and because of the significant
> impact this might have on not just CM0 but other targets as well, I
> don't think it's something we should try to squeeze in at the last
> minute.  We're now in stage 4, so that is doubly the case.

Of course I meant stage3.  Oops.  I actually thought stage 3 would
continue through next week based on the average historical dates.

It would have been nice to get this feedback when I emailed you a
preview version of this patch (2020-Nov-11).  Christophe's logs have
been very helpful on the technical integration, but it's proving a chore
to go back and re-create some of the intermediate chunks.

Regardless, I still have free time for at least a little while longer to
work on this, so I'll push forward with any further feedback you are
willing to give me.  I have failed to free up any time during the last 2
years to actually work on this during stage1, and I have no guarantee
this coming year will be different.

> 
> Christophe is a very valuable member of our community, but he's not a
> port maintainer and thus cannot really rule on what can go into the
> tools, or when.
> 
> > 
> > I actually think your comments below could all be addressable within a
> > couple of days.  But, I'm not accounting for the review process.
> >  
> >> Firstly, the patch is very large, but contains a large number of
> >> distinct changes, so it would really benefit from being broken down into
> >> a number of distinct patches.  This will make reviewing the individual
> >> changes much more straight-forward.  
> > 
> > I have no context for "large" or "small" with respect to gcc.  This
> > patch comprises about 30% of a previously-monolithic library that's
> > been shipping since ~2016 (the rest is libm material).  Other than
> > (1) the aforementioned change to div0(), (2) a nascent adaptation
> > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> > the bitwise functions, the library remains pretty much as it was
> > originally released.
> 
> Large, like many other terms is relative.  For assembler file changes,
> which this is primarily, the overall size can be much smaller and still
> be considered 'large'.
> 
> > 
> > The driving force in the development of this library was small size,
> > which of course was never possible with the softfp routines.  It's not
> > half-slow, either, for the limitations of the M0 architecture.   And,
> > it's IEEE compliant.  But, that means that most of the functions are
> > highly interconnected.  So, some of it can be broken up as you outline
> > below, but that last patch is still worth more than half of the total.
> 
> Nevertheless, having the floating-point code separated out will make
> reviewing more straight forward.  I'll likely need to ask one of our FP
> experts to have a specific look at that part and that will be easier if
> it is disentangled from the other changes.
> > 
> > I also have ~70k lines of test vectors that seem mostly redundant, but
> > not completely.  I haven't decided what to do here.  For example, I have
> > coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> > with this code it will be in a separate thread.
> 
> Publishing the test code, even if it isn't integrated into the GCC
> testsuite would be useful.  Perhaps someone else could then help with that.

Very brute force stuff, not production quality:
    <http://danielengel.com/cm0_test_vectors.tgz> (160 kb)

> >> I'd suggest:
> >>
> >> 1) Some basic makefile cleanups to ease initial integration - in
> >> particular where we have things like
> >>
> >> LIB1FUNCS += <long list of functions>
> >>
> >> that this be rewritten with one function per line (and sorted
> >> alphabetically) - then we can see which functions are being changed in
> >> subsequent patches.  It makes the Makefile fragments longer, but the
> >> improvement in clarity for makes this worthwhile.
> > 
> > I know next to nothing about Makefiles, particularly ones as complex as
> > GCC's.  I was just trying to work with the existing style to avoid
> > breaking something.  However, I can certainly adopt this suggestion.
> >  
> >> 2) The changes for the existing integer functions - preferably one
> >> function per patch.
> >>
> >> 3) The new integer functions that you're adding
> > 
> > These wouldn't be too hard to do, but what are the expectations for
> > testing?  A clean build of GCC takes about 6 hours in my VM, and
> > regression testing takes about 4 hours per architecture.  You would want
> > a full regression report for each incremental patch?  I have no idea how
> > to target regression tests that apply to particular runtime functions
> > without the risk of missing something.
> > 
> 
> Most of this can be tested in a cross-compile environment using qemu as
> a model.  A cross build shouldn't take that long (especially if you
> restrict the compiler to just C and C++ - other languages are
> vanishingly unlikely to pick up errors in the parts of the compiler
> you're changing).  But build checks will be mostly sufficient for most
> of the intermediate patches.
> 
> >> 4) The floating-point support.
> >>
> >> Some more general observations:
> >>
> >> - where functions are already in lib1funcs.asm, please leave them there.
> > 
> > I guess I have a different vision here.  I have had a really hard time
> > following all of the nested #ifdefs in lib1funcs, so I thought it would
> > be helpful to begin breaking it up into logical units.
> 
> Agreed, it's not easy.  But the restructuring, if any, should be done
> separately from other changes, not mixed up with them.
> 
> > 
> > The functions removed were all functions for which I had THUMB1
> > sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> > __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> > the same number of instructions as the previous ARM/THUMB2 version.
> > 
> > You will find all of the previous ARM versions of these functions merged
> > into the new files (with attribution) and the same preprocessor
> > selection path.  So no architecture variant should be any worse off than
> > before this patch, and some beyond v6m should benefit.
> > 
> > In the future, I think that my versions of __divsi3 and __divdi3 will
> > prove faster/smaller than the existing THUMB2 versions.  I know that my
> > float routines are less than half the compiled size of THUMB2 versions
> > in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> > differences so I have left all this work for future patches. (It's also
> > quite likely that my version can be further-refined with a few judicious
> > uses of THUMB2 alternatives.)
> > 
> > My long-term vision would be use lib1funcs as an architectural wrapper
> > distinct from the implementation code.
> > 
> >> - lets avoid having the cm0 subdirectory - in particular we should do
> >> this when there is existing code for other targets in the same source
> >> files.  It's OK to have any new files in the main 'arm' directory of the
> >> source tree - just name the files appropriately if really needed.
> > 
> > Fair point on the name.  In v1 of this patch, all these files were all
> > preprocessor-selected for v6m only.  However, as I've stumbled through
> > the finer points of integration, that line has blurred.  Name aside,
> > the subdirectory does still represent a standalone library.   I think
> > I've managed to add enough integration hooks that it works well in
> > a libgcc context, but it still has a very distinct implementation style.
> > 
> > I don't have a strong opinion on this, just preference.  But, keeping
> > the subdirectory with a neutral name will probably make maintenance
> > easier in the short term.  I would suggest "lib0" (since it caters to
> > the lowest common denominator) or "eabi" (since that was the original
> > target).  There are precedents in other architectures (e.g. avr).
> 
> The issue here is that the selection of code from the various
> subdirectories is not consistent.  In some cases we might be pulling in
> a thumb1 implementation into a thumb2 environment, so having the code in
> a directory that doesn't reflect this makes maintaining the code harder.
>  I don't mind too much if some new files are introduced and their names
> reflect both their function and the architecture they support - eg
> t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.

You didn't say that a neutral directory name is off the table.  
I will propose something other than 'cm0'.  
 
> > 
> >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> >> better term, and that is used widely elsewhere in the compiler.  So if
> >> you really need a term just use THUMB1, or even T1.
> > 
> > Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> > of this is probably THUMB1 clean, but it wasn't a design requirement.
> 
> It's particularly the Thumb1 issue, just more the name is for a specific
> CPU which might cause confusion later.  v6m would be preferable to that
> if there really is a dependency on the instructions that are not in the
> original Thumb1 ISA.

I will remove the CM0 prefix and use/extend the standard macro names. 

> 
> > 
> > The CM0_FUNC_START exists so that I can specify subsections of ".text"
> > for each function.  This was a fairly fundamental design decision that
> > allowed me to make a number of branch optimizations between functions.
> > The other macros are just duplicates for naming symmetry.
> 
> This is something we'll have to get to during the main review of the
> code - we used to have support for PE-COFF object files.  That might now
> be obsolete, wince support is certainly deprecated - but we can't assume
> that ELF is the only object format we'll ever have to support.
> 
> > 
> > The existing  FUNC_START macro inserts extra conflicting ".text"
> > directives that would break the build.  Of course, the prefix was
> > arbitrary; I just took CM0 from the library name.  But, there's nothing
> > architecturally significant about this macro at all, so THUMB1 and T1
> > seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> > two parameters? For example:
> > 
> >     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> > 
> > Instead of: 
> > 
> >     .section .text.sorted.libgcc.clz2.clzdi2,"x"
> >     CM0_FUNC_START clzdi2
> > 
> >> - For the 64-bit shift functions, I expect the existing code to be
> >> preferable whenever possible - I think it can even be tweaked to build
> >> for thumb2 by inserting a suitable IT instruction.  So your code should
> >> only be used when
> >>
> >>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> > 
> > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> > (The name doesn't seem quite right for Cortex M0, since it does support
> > some 32 bit instructions, but that's beside the point.)
> 
> The terms Thumb1 and Thumb2 predate the arm-v8m architecture
> specifications - even then the term Thumb1 was interpreted as "mostly
> 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
> 16/32-bit spilt has become more blurred and that will likely continue in
> future since the 16-bit encoding space is pretty full.
> 
> > 
> > The current lib1funcs ARM code path still exists, as described above. My
> > THUMB1 implementations were 1 - 3 instructions shorter than the current
> > versions, which is why I took the time to merge the files.
> > 
> > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> > see an advantage to eliminating the branch unless these functions were
> > written with cryptographic side channel attacks in mind.
> 
> On high performance cores branches are predicted - if the branch is
> predictable then the common path will be taken and the unneeded
> instructions will never be used.  But library functions like this tend
> to have very unpredictable values used for calling them, so it's much
> less likely that the hardware will predict the right path - at this
> point conditional instructions tend to win (especially if there aren't
> very many of them) because the cost (on average) of not executing the
> unneeded instructions is much lower than the cost (on average) of
> unwinding the processor state to execute the other arm of the
> conditional branch.

Got it.  I have been counting branches as 3 cycles of fixed cost, and
ignoring penalties if a branch skips at least 2 instructions.

Going forward, I will add 'IT<c>' compile options for any new code with
scope beyond v6m.

> >> - most, if not all, of your LSYM symbols should not be needed after
> >> assembly, so should start with a captial 'L' (and no leading
> >> underscores), the assembler will then automatically discard any that are
> >> not needed for relocations.
> > 
> > You don't want debugging symbols for libgcc internals :) ?  I sort of
> > understand that, but those symbols have been useful to me in the past.
> > The "." by itself seems to keep visibility local, so the extra symbols
> > won't cause linker issuess. Would you object to a macro variant (e.g.
> > LLSYM) that prepends the "L" but is easier to disable?
> 
> It is a matter of taste, but I really prefer the local symbols to
> disappear entirely once the file is compiled - it makes things like
> backtrace gdb show the proper call heirarchy.
 
Hearing no objection to LLSYM, I'll implement that for debugging.
The released version will have ".L" symbols stripped.

> >> - you'll need to write suitable commit messages for each patch, which
> >> also contain a suitable ChangeLog style entry.
> > 
> > OK.
> > 
> >> - finally, your popcount implementations have data in the code segment.
> >>  That's going to cause problems when we have compilation options such as
> >> -mpure-code.
> > 
> > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > If this matters, you'll need to point in the right direction for the
> > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> 
> That might be a bug in the clz implementations - Christophe: Any thoughts?

__clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"

The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
appears to be valid only when the 'movt' instruction is available, which
means that the 'clz' instruction will also be available, so no array loads.

Is the -mpure-code state detectable as a preprocessor flag?  While
'movw'/'movt' appears to be the canonical solution, I'm not sure it
should be the default just because a processor supports Thumb-2.

Do users wanting to use -mpure-code recompile the toolchain to avoid
constant data in compiled C functions?  I don't think this is the
default for the typical toolchain scripts.

> >> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> >> this based on the diff style and directory naming in your patches), you
> >> switch to using a git tree, then you'll be able to use tools such as
> >> rebasing and the git posting tools to send the patch series for
> >> subsequent review.
> > 
> > Your assumption is correct.  I didn't think that I would have to get so
> > deep into the gcc development process for this contribution.  I used
> > this library as a bare metal alternative for libgcc/libm in the product
> > for years, so I thought it would just drop in.  But, the libgcc compile
> > mechanics have proved much more 'interesting'. I'm assuming this
> > architecture was created years before the introduction of -ffunction-
> > sections...
> > 
> 
> I don't think I've time to write a history lesson, even if you wanted
> it.  Suffice to say, this does date back to the days of a.out format
> object files (with 4 relocation types, STABS debugging, and one code,
> one data and one bss section).
> 
> >>
> >> Richard.
> >>
> > 
> > Thanks again,
> > Daniel
> > 
> 
> R.
>

To reiterate what I said above, I intend to push forward and incorporate
your current recommendations plus any further feedback I may get.  I
expect you to say that this doesn't merit inclusion yet, but I'd rather
spend the time while I have it.

I'll post a patch series for review within the next day or so.

Thanks again,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-09 12:28               ` Daniel Engel
@ 2021-01-09 13:09                 ` Christophe Lyon
  2021-01-09 18:04                   ` Daniel Engel
                                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Christophe Lyon @ 2021-01-09 13:09 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
>
> On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > On 07/01/2021 00:59, Daniel Engel wrote:
> > > --snip--
> > >
> > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > >
> > >>
> > >> Thanks for working on this, Daniel.
> > >>
> > >> This is clearly stage1 material, so we've got time for a couple of
> > >> iterations to sort things out.
> > >
> > > I appreciate your feedback.  I had been hoping that with no regressions
> > > this might still be eligible for stage2.  Christophe never indicated
> > > either way. but the fact that he was looking at it seemed positive.
> > > I thought I would be a couple of weeks faster with this last
> > > iteration, but holidays got in the way.
> >
> > GCC doesn't have a stage 2 any more (historical wart).  We were in
> > (late) stage3 when this was first posted, and because of the significant
> > impact this might have on not just CM0 but other targets as well, I
> > don't think it's something we should try to squeeze in at the last
> > minute.  We're now in stage 4, so that is doubly the case.
>
> Of course I meant stage3.  Oops.  I actually thought stage 3 would
> continue through next week based on the average historical dates.

I expected stage4 to start on Jan 1st :-)

> It would have been nice to get this feedback when I emailed you a
> preview version of this patch (2020-Nov-11).  Christophe's logs have
> been very helpful on the technical integration, but it's proving a chore
> to go back and re-create some of the intermediate chunks.
>
> Regardless, I still have free time for at least a little while longer to
> work on this, so I'll push forward with any further feedback you are
> willing to give me.  I have failed to free up any time during the last 2
> years to actually work on this during stage1, and I have no guarantee
> this coming year will be different.
>
> >
> > Christophe is a very valuable member of our community, but he's not a
> > port maintainer and thus cannot really rule on what can go into the
> > tools, or when.
> >
> > >
> > > I actually think your comments below could all be addressable within a
> > > couple of days.  But, I'm not accounting for the review process.
> > >
> > >> Firstly, the patch is very large, but contains a large number of
> > >> distinct changes, so it would really benefit from being broken down into
> > >> a number of distinct patches.  This will make reviewing the individual
> > >> changes much more straight-forward.

And if you can generate the patch with git, that will help: the reason for the
previous build errors was because I had to "reformat" your patch before
submitting it for validation, and I messed things up.

> > >
> > > I have no context for "large" or "small" with respect to gcc.  This
> > > patch comprises about 30% of a previously-monolithic library that's
> > > been shipping since ~2016 (the rest is libm material).  Other than
> > > (1) the aforementioned change to div0(), (2) a nascent adaptation
> > > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> > > the bitwise functions, the library remains pretty much as it was
> > > originally released.
> >
> > Large, like many other terms is relative.  For assembler file changes,
> > which this is primarily, the overall size can be much smaller and still
> > be considered 'large'.
> >
> > >
> > > The driving force in the development of this library was small size,
> > > which of course was never possible with the softfp routines.  It's not
> > > half-slow, either, for the limitations of the M0 architecture.   And,
> > > it's IEEE compliant.  But, that means that most of the functions are
> > > highly interconnected.  So, some of it can be broken up as you outline
> > > below, but that last patch is still worth more than half of the total.
> >
> > Nevertheless, having the floating-point code separated out will make
> > reviewing more straight forward.  I'll likely need to ask one of our FP
> > experts to have a specific look at that part and that will be easier if
> > it is disentangled from the other changes.
> > >
> > > I also have ~70k lines of test vectors that seem mostly redundant, but
> > > not completely.  I haven't decided what to do here.  For example, I have
> > > coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> > > with this code it will be in a separate thread.
> >
> > Publishing the test code, even if it isn't integrated into the GCC
> > testsuite would be useful.  Perhaps someone else could then help with that.
>
> Very brute force stuff, not production quality:
>     <http://danielengel.com/cm0_test_vectors.tgz> (160 kb)
>
> > >> I'd suggest:
> > >>
> > >> 1) Some basic makefile cleanups to ease initial integration - in
> > >> particular where we have things like
> > >>
> > >> LIB1FUNCS += <long list of functions>
> > >>
> > >> that this be rewritten with one function per line (and sorted
> > >> alphabetically) - then we can see which functions are being changed in
> > >> subsequent patches.  It makes the Makefile fragments longer, but the
> > >> improvement in clarity for makes this worthwhile.
> > >
> > > I know next to nothing about Makefiles, particularly ones as complex as
> > > GCC's.  I was just trying to work with the existing style to avoid
> > > breaking something.  However, I can certainly adopt this suggestion.
> > >
> > >> 2) The changes for the existing integer functions - preferably one
> > >> function per patch.
> > >>
> > >> 3) The new integer functions that you're adding
> > >
> > > These wouldn't be too hard to do, but what are the expectations for
> > > testing?  A clean build of GCC takes about 6 hours in my VM, and
> > > regression testing takes about 4 hours per architecture.  You would want
> > > a full regression report for each incremental patch?  I have no idea how
> > > to target regression tests that apply to particular runtime functions
> > > without the risk of missing something.
> > >
> >
> > Most of this can be tested in a cross-compile environment using qemu as
> > a model.  A cross build shouldn't take that long (especially if you
> > restrict the compiler to just C and C++ - other languages are
> > vanishingly unlikely to pick up errors in the parts of the compiler
> > you're changing).  But build checks will be mostly sufficient for most
> > of the intermediate patches.
> >
> > >> 4) The floating-point support.
> > >>
> > >> Some more general observations:
> > >>
> > >> - where functions are already in lib1funcs.asm, please leave them there.
> > >
> > > I guess I have a different vision here.  I have had a really hard time
> > > following all of the nested #ifdefs in lib1funcs, so I thought it would
> > > be helpful to begin breaking it up into logical units.
> >
> > Agreed, it's not easy.  But the restructuring, if any, should be done
> > separately from other changes, not mixed up with them.
> >
> > >
> > > The functions removed were all functions for which I had THUMB1
> > > sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> > > __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> > > the same number of instructions as the previous ARM/THUMB2 version.
> > >
> > > You will find all of the previous ARM versions of these functions merged
> > > into the new files (with attribution) and the same preprocessor
> > > selection path.  So no architecture variant should be any worse off than
> > > before this patch, and some beyond v6m should benefit.
> > >
> > > In the future, I think that my versions of __divsi3 and __divdi3 will
> > > prove faster/smaller than the existing THUMB2 versions.  I know that my
> > > float routines are less than half the compiled size of THUMB2 versions
> > > in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> > > differences so I have left all this work for future patches. (It's also
> > > quite likely that my version can be further-refined with a few judicious
> > > uses of THUMB2 alternatives.)
> > >
> > > My long-term vision would be use lib1funcs as an architectural wrapper
> > > distinct from the implementation code.
> > >
> > >> - lets avoid having the cm0 subdirectory - in particular we should do
> > >> this when there is existing code for other targets in the same source
> > >> files.  It's OK to have any new files in the main 'arm' directory of the
> > >> source tree - just name the files appropriately if really needed.
> > >
> > > Fair point on the name.  In v1 of this patch, all these files were all
> > > preprocessor-selected for v6m only.  However, as I've stumbled through
> > > the finer points of integration, that line has blurred.  Name aside,
> > > the subdirectory does still represent a standalone library.   I think
> > > I've managed to add enough integration hooks that it works well in
> > > a libgcc context, but it still has a very distinct implementation style.
> > >
> > > I don't have a strong opinion on this, just preference.  But, keeping
> > > the subdirectory with a neutral name will probably make maintenance
> > > easier in the short term.  I would suggest "lib0" (since it caters to
> > > the lowest common denominator) or "eabi" (since that was the original
> > > target).  There are precedents in other architectures (e.g. avr).
> >
> > The issue here is that the selection of code from the various
> > subdirectories is not consistent.  In some cases we might be pulling in
> > a thumb1 implementation into a thumb2 environment, so having the code in
> > a directory that doesn't reflect this makes maintaining the code harder.
> >  I don't mind too much if some new files are introduced and their names
> > reflect both their function and the architecture they support - eg
> > t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
>
> You didn't say that a neutral directory name is off the table.
> I will propose something other than 'cm0'.
>
> > >
> > >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> > >> better term, and that is used widely elsewhere in the compiler.  So if
> > >> you really need a term just use THUMB1, or even T1.
> > >
> > > Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> > > of this is probably THUMB1 clean, but it wasn't a design requirement.
> >
> > It's particularly the Thumb1 issue, just more the name is for a specific
> > CPU which might cause confusion later.  v6m would be preferable to that
> > if there really is a dependency on the instructions that are not in the
> > original Thumb1 ISA.
>
> I will remove the CM0 prefix and use/extend the standard macro names.
>
> >
> > >
> > > The CM0_FUNC_START exists so that I can specify subsections of ".text"
> > > for each function.  This was a fairly fundamental design decision that
> > > allowed me to make a number of branch optimizations between functions.
> > > The other macros are just duplicates for naming symmetry.
> >
> > This is something we'll have to get to during the main review of the
> > code - we used to have support for PE-COFF object files.  That might now
> > be obsolete, wince support is certainly deprecated - but we can't assume
> > that ELF is the only object format we'll ever have to support.
> >
> > >
> > > The existing  FUNC_START macro inserts extra conflicting ".text"
> > > directives that would break the build.  Of course, the prefix was
> > > arbitrary; I just took CM0 from the library name.  But, there's nothing
> > > architecturally significant about this macro at all, so THUMB1 and T1
> > > seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> > > two parameters? For example:
> > >
> > >     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> > >
> > > Instead of:
> > >
> > >     .section .text.sorted.libgcc.clz2.clzdi2,"x"
> > >     CM0_FUNC_START clzdi2
> > >
> > >> - For the 64-bit shift functions, I expect the existing code to be
> > >> preferable whenever possible - I think it can even be tweaked to build
> > >> for thumb2 by inserting a suitable IT instruction.  So your code should
> > >> only be used when
> > >>
> > >>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> > >
> > > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> > > (The name doesn't seem quite right for Cortex M0, since it does support
> > > some 32 bit instructions, but that's beside the point.)
> >
> > The terms Thumb1 and Thumb2 predate the arm-v8m architecture
> > specifications - even then the term Thumb1 was interpreted as "mostly
> > 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
> > 16/32-bit spilt has become more blurred and that will likely continue in
> > future since the 16-bit encoding space is pretty full.
> >
> > >
> > > The current lib1funcs ARM code path still exists, as described above. My
> > > THUMB1 implementations were 1 - 3 instructions shorter than the current
> > > versions, which is why I took the time to merge the files.
> > >
> > > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> > > see an advantage to eliminating the branch unless these functions were
> > > written with cryptographic side channel attacks in mind.
> >
> > On high performance cores branches are predicted - if the branch is
> > predictable then the common path will be taken and the unneeded
> > instructions will never be used.  But library functions like this tend
> > to have very unpredictable values used for calling them, so it's much
> > less likely that the hardware will predict the right path - at this
> > point conditional instructions tend to win (especially if there aren't
> > very many of them) because the cost (on average) of not executing the
> > unneeded instructions is much lower than the cost (on average) of
> > unwinding the processor state to execute the other arm of the
> > conditional branch.
>
> Got it.  I have been counting branches as 3 cycles of fixed cost, and
> ignoring penalties if a branch skips at least 2 instructions.
>
> Going forward, I will add 'IT<c>' compile options for any new code with
> scope beyond v6m.
>
> > >> - most, if not all, of your LSYM symbols should not be needed after
> > >> assembly, so should start with a captial 'L' (and no leading
> > >> underscores), the assembler will then automatically discard any that are
> > >> not needed for relocations.
> > >
> > > You don't want debugging symbols for libgcc internals :) ?  I sort of
> > > understand that, but those symbols have been useful to me in the past.
> > > The "." by itself seems to keep visibility local, so the extra symbols
> > > won't cause linker issuess. Would you object to a macro variant (e.g.
> > > LLSYM) that prepends the "L" but is easier to disable?
> >
> > It is a matter of taste, but I really prefer the local symbols to
> > disappear entirely once the file is compiled - it makes things like
> > backtrace gdb show the proper call heirarchy.
>
> Hearing no objection to LLSYM, I'll implement that for debugging.
> The released version will have ".L" symbols stripped.
>
> > >> - you'll need to write suitable commit messages for each patch, which
> > >> also contain a suitable ChangeLog style entry.
> > >
> > > OK.
> > >
> > >> - finally, your popcount implementations have data in the code segment.
> > >>  That's going to cause problems when we have compilation options such as
> > >> -mpure-code.
> > >
> > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > If this matters, you'll need to point in the right direction for the
> > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> >
> > That might be a bug in the clz implementations - Christophe: Any thoughts?
>
> __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
Thanks, I'll have a closer look at why I didn't see problems.

> The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> appears to be valid only when the 'movt' instruction is available, which
> means that the 'clz' instruction will also be available, so no array loads.
No, -mpure-code is also supported with v6m.

> Is the -mpure-code state detectable as a preprocessor flag?  While
No.

> 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> should be the default just because a processor supports Thumb-2.
>
> Do users wanting to use -mpure-code recompile the toolchain to avoid
> constant data in compiled C functions?  I don't think this is the
> default for the typical toolchain scripts.
No, users of -mpure-code do not recompile the toolchain.

> > >> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> > >> this based on the diff style and directory naming in your patches), you
> > >> switch to using a git tree, then you'll be able to use tools such as
> > >> rebasing and the git posting tools to send the patch series for
> > >> subsequent review.
> > >
> > > Your assumption is correct.  I didn't think that I would have to get so
> > > deep into the gcc development process for this contribution.  I used
> > > this library as a bare metal alternative for libgcc/libm in the product
> > > for years, so I thought it would just drop in.  But, the libgcc compile
> > > mechanics have proved much more 'interesting'. I'm assuming this
> > > architecture was created years before the introduction of -ffunction-
> > > sections...
> > >
> >
> > I don't think I've time to write a history lesson, even if you wanted
> > it.  Suffice to say, this does date back to the days of a.out format
> > object files (with 4 relocation types, STABS debugging, and one code,
> > one data and one bss section).
> >
> > >>
> > >> Richard.
> > >>
> > >
> > > Thanks again,
> > > Daniel
> > >
> >
> > R.
> >
>
> To reiterate what I said above, I intend to push forward and incorporate
> your current recommendations plus any further feedback I may get.  I
> expect you to say that this doesn't merit inclusion yet, but I'd rather
> spend the time while I have it.
>
> I'll post a patch series for review within the next day or so.

Here are the results of the validation of your latest version (20210105):
https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20210105.patch/report-build-info.html

"BIG-REGR" just means the regression report is large enough that it's
provided in compressed form to avoid overloading the browser.

So it really seems your patch introduces regressions in arm*linux* configs.
For the 2 arm-none-eabi configs which show regressions (cortex-m0 and
cortex-m3), the logs seem to indicate some tests timed out, and it's
possible the server used was overloaded.
The same applies to the 3 aarch64*elf cases, where the regressions
seem only caused by timed out; there's no reason your patch would have
an impact on aarch64.
(there 5 configs were tested on the same machine, so overload is indeed likely).

I didn't check why all the ubsan tests now seem to fail, they are in
the "unstable" category because in the past some of them had some
randomness.
I do not see such noise in trunk validation though.

Thanks,

Christophe

>
> Thanks again,
> Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-09 13:09                 ` Christophe Lyon
@ 2021-01-09 18:04                   ` Daniel Engel
  2021-01-11 14:49                     ` Richard Earnshaw
  2021-01-09 18:48                   ` Daniel Engel
  2021-01-11 16:07                   ` Christophe Lyon
  2 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-09 18:04 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

On Sat, Jan 9, 2021, at 5:09 AM, Christophe Lyon wrote:
> On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > --snip--
> > > >
> > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > >
> > > >> -- snip --
> > > >>
> > > >> - finally, your popcount implementations have data in the code segment.
> > > >>  That's going to cause problems when we have compilation options such as
> > > >> -mpure-code.
> > > >
> > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > If this matters, you'll need to point in the right direction for the
> > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > >
> > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> >
> > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> Thanks, I'll have a closer look at why I didn't see problems.
> 
> > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > appears to be valid only when the 'movt' instruction is available, which
> > means that the 'clz' instruction will also be available, so no array loads.
> No, -mpure-code is also supported with v6m.
> 
> > Is the -mpure-code state detectable as a preprocessor flag?  While
> No.
> 
> > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > should be the default just because a processor supports Thumb-2.
> >
> > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > constant data in compiled C functions?  I don't think this is the
> > default for the typical toolchain scripts.
> No, users of -mpure-code do not recompile the toolchain.

I won't claim that my use of inline constants is correct.  It was not
hard to find references to high security model processors that block
reading from executable sections.

However, if all of the above is true, I think libgcc as a whole
will have much bigger problems.  I count over 500 other instances
in the disassembled v6m *.a file where functions load pc-relative
data from '.text'.

For example:
* C version of popcount
* __powidf2 (0x3FF00000)
* __mulsc3 (0x7F7FFFFF)
* Most soft-float functions.

Still not seeing a clear resolution here.  Is it acceptable to use the 

    "ldr rD, =const" 

pattern?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-09 13:09                 ` Christophe Lyon
  2021-01-09 18:04                   ` Daniel Engel
@ 2021-01-09 18:48                   ` Daniel Engel
  2021-01-11 16:07                   ` Christophe Lyon
  2 siblings, 0 replies; 26+ messages in thread
From: Daniel Engel @ 2021-01-09 18:48 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

On Sat, Jan 9, 2021, at 5:09 AM, Christophe Lyon wrote:
> On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > -- snip --
> >
> > To reiterate what I said above, I intend to push forward and incorporate
> > your current recommendations plus any further feedback I may get.  I
> > expect you to say that this doesn't merit inclusion yet, but I'd rather
> > spend the time while I have it.
> >
> > I'll post a patch series for review within the next day or so.
> 
> Here are the results of the validation of your latest version 
> (20210105):
> https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20210105.patch/report-build-info.html

Thanks for this.  
 
> "BIG-REGR" just means the regression report is large enough that it's
> provided in compressed form to avoid overloading the browser.
> 
> So it really seems your patch introduces regressions in arm*linux* configs.
> For the 2 arm-none-eabi configs which show regressions (cortex-m0 and
> cortex-m3), the logs seem to indicate some tests timed out, and it's
> possible the server used was overloaded.

Looks like I added _divdi3 in LIB1ASMFUNCS with too much scope.  So the
C implementation gets locked out of the build.  On EABI, _divdi3 is
renamed as __aeabi_ldivmod, so both symbols are always found.  On GNU
EABI, that doesn't happen.

It should be a trivial fix, and I think there are a couple more similar.
I'll integrate this change in the patch series.

> The same applies to the 3 aarch64*elf cases, where the regressions
> seem only caused by timed out; there's no reason your patch would have
> an impact on aarch64.
> (there 5 configs were tested on the same machine, so overload is indeed likely).
> 
> I didn't check why all the ubsan tests now seem to fail, they are in
> the "unstable" category because in the past some of them had some
> randomness.
> I do not see such noise in trunk validation though.

I tried looking up a few of them to analyze.  Couldn't find the names
in the logs (e.g. "pr95810").  Are you sure they actually failed, or just
didn't run?  Regression reports say "ignored".

> Thanks,
> 
> Christophe
> 

Thanks again,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-09 18:04                   ` Daniel Engel
@ 2021-01-11 14:49                     ` Richard Earnshaw
  0 siblings, 0 replies; 26+ messages in thread
From: Richard Earnshaw @ 2021-01-11 14:49 UTC (permalink / raw)
  To: Daniel Engel, Christophe Lyon; +Cc: gcc Patches

On 09/01/2021 18:04, Daniel Engel wrote:
> On Sat, Jan 9, 2021, at 5:09 AM, Christophe Lyon wrote:
>> On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
>>>
>>> On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
>>>> On 07/01/2021 00:59, Daniel Engel wrote:
>>>>> --snip--
>>>>>
>>>>> On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
>>>>>
>>>>>> -- snip --
>>>>>>
>>>>>> - finally, your popcount implementations have data in the code segment.
>>>>>>  That's going to cause problems when we have compilation options such as
>>>>>> -mpure-code.
>>>>>
>>>>> I am just following the precedent of existing lib1funcs (e.g. __clz2si).
>>>>> If this matters, you'll need to point in the right direction for the
>>>>> fix.  I'm not sure it does matter, since these functions are PIC anyway.
>>>>
>>>> That might be a bug in the clz implementations - Christophe: Any thoughts?
>>>
>>> __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
>> Thanks, I'll have a closer look at why I didn't see problems.
>>
>>> The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
>>> appears to be valid only when the 'movt' instruction is available, which
>>> means that the 'clz' instruction will also be available, so no array loads.
>> No, -mpure-code is also supported with v6m.
>>
>>> Is the -mpure-code state detectable as a preprocessor flag?  While
>> No.
>>
>>> 'movw'/'movt' appears to be the canonical solution, I'm not sure it
>>> should be the default just because a processor supports Thumb-2.
>>>
>>> Do users wanting to use -mpure-code recompile the toolchain to avoid
>>> constant data in compiled C functions?  I don't think this is the
>>> default for the typical toolchain scripts.
>> No, users of -mpure-code do not recompile the toolchain.
> 
> I won't claim that my use of inline constants is correct.  It was not
> hard to find references to high security model processors that block
> reading from executable sections.
> 
> However, if all of the above is true, I think libgcc as a whole
> will have much bigger problems.  I count over 500 other instances
> in the disassembled v6m *.a file where functions load pc-relative
> data from '.text'.

The difference is that when the data-in-text references come from C
code, they can be eliminated simply by rebuilding the library with
-mpure-code on.  That's difficult, if not impossible to fix when the
source for a function is written in assembler.

> 
> For example:
> * C version of popcount
> * __powidf2 (0x3FF00000)
> * __mulsc3 (0x7F7FFFFF)
> * Most soft-float functions.
> 
> Still not seeing a clear resolution here.  Is it acceptable to use the 
> 
>     "ldr rD, =const" 

No, that's just short-hand for an LDR from a literal pool that is
generated auto-magically by the assembler.  I also wouldn't trust that
when using any section other than .text for code, unless you add
explicit .ltorg directives to state where the currently pending literals
are to be dumped.

> 
> pattern?
> 
> Thanks,
> Daniel
> 

R.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-09 13:09                 ` Christophe Lyon
  2021-01-09 18:04                   ` Daniel Engel
  2021-01-09 18:48                   ` Daniel Engel
@ 2021-01-11 16:07                   ` Christophe Lyon
  2021-01-11 16:18                     ` Daniel Engel
  2 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-11 16:07 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
>
> On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > --snip--
> > > >
> > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > >
> > > >>
> > > >> Thanks for working on this, Daniel.
> > > >>
> > > >> This is clearly stage1 material, so we've got time for a couple of
> > > >> iterations to sort things out.
> > > >
> > > > I appreciate your feedback.  I had been hoping that with no regressions
> > > > this might still be eligible for stage2.  Christophe never indicated
> > > > either way. but the fact that he was looking at it seemed positive.
> > > > I thought I would be a couple of weeks faster with this last
> > > > iteration, but holidays got in the way.
> > >
> > > GCC doesn't have a stage 2 any more (historical wart).  We were in
> > > (late) stage3 when this was first posted, and because of the significant
> > > impact this might have on not just CM0 but other targets as well, I
> > > don't think it's something we should try to squeeze in at the last
> > > minute.  We're now in stage 4, so that is doubly the case.
> >
> > Of course I meant stage3.  Oops.  I actually thought stage 3 would
> > continue through next week based on the average historical dates.
>
> I expected stage4 to start on Jan 1st :-)
>
> > It would have been nice to get this feedback when I emailed you a
> > preview version of this patch (2020-Nov-11).  Christophe's logs have
> > been very helpful on the technical integration, but it's proving a chore
> > to go back and re-create some of the intermediate chunks.
> >
> > Regardless, I still have free time for at least a little while longer to
> > work on this, so I'll push forward with any further feedback you are
> > willing to give me.  I have failed to free up any time during the last 2
> > years to actually work on this during stage1, and I have no guarantee
> > this coming year will be different.
> >
> > >
> > > Christophe is a very valuable member of our community, but he's not a
> > > port maintainer and thus cannot really rule on what can go into the
> > > tools, or when.
> > >
> > > >
> > > > I actually think your comments below could all be addressable within a
> > > > couple of days.  But, I'm not accounting for the review process.
> > > >
> > > >> Firstly, the patch is very large, but contains a large number of
> > > >> distinct changes, so it would really benefit from being broken down into
> > > >> a number of distinct patches.  This will make reviewing the individual
> > > >> changes much more straight-forward.
>
> And if you can generate the patch with git, that will help: the reason for the
> previous build errors was because I had to "reformat" your patch before
> submitting it for validation, and I messed things up.
>
> > > >
> > > > I have no context for "large" or "small" with respect to gcc.  This
> > > > patch comprises about 30% of a previously-monolithic library that's
> > > > been shipping since ~2016 (the rest is libm material).  Other than
> > > > (1) the aforementioned change to div0(), (2) a nascent adaptation
> > > > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
> > > > the bitwise functions, the library remains pretty much as it was
> > > > originally released.
> > >
> > > Large, like many other terms is relative.  For assembler file changes,
> > > which this is primarily, the overall size can be much smaller and still
> > > be considered 'large'.
> > >
> > > >
> > > > The driving force in the development of this library was small size,
> > > > which of course was never possible with the softfp routines.  It's not
> > > > half-slow, either, for the limitations of the M0 architecture.   And,
> > > > it's IEEE compliant.  But, that means that most of the functions are
> > > > highly interconnected.  So, some of it can be broken up as you outline
> > > > below, but that last patch is still worth more than half of the total.
> > >
> > > Nevertheless, having the floating-point code separated out will make
> > > reviewing more straight forward.  I'll likely need to ask one of our FP
> > > experts to have a specific look at that part and that will be easier if
> > > it is disentangled from the other changes.
> > > >
> > > > I also have ~70k lines of test vectors that seem mostly redundant, but
> > > > not completely.  I haven't decided what to do here.  For example, I have
> > > > coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
> > > > with this code it will be in a separate thread.
> > >
> > > Publishing the test code, even if it isn't integrated into the GCC
> > > testsuite would be useful.  Perhaps someone else could then help with that.
> >
> > Very brute force stuff, not production quality:
> >     <http://danielengel.com/cm0_test_vectors.tgz> (160 kb)
> >
> > > >> I'd suggest:
> > > >>
> > > >> 1) Some basic makefile cleanups to ease initial integration - in
> > > >> particular where we have things like
> > > >>
> > > >> LIB1FUNCS += <long list of functions>
> > > >>
> > > >> that this be rewritten with one function per line (and sorted
> > > >> alphabetically) - then we can see which functions are being changed in
> > > >> subsequent patches.  It makes the Makefile fragments longer, but the
> > > >> improvement in clarity for makes this worthwhile.
> > > >
> > > > I know next to nothing about Makefiles, particularly ones as complex as
> > > > GCC's.  I was just trying to work with the existing style to avoid
> > > > breaking something.  However, I can certainly adopt this suggestion.
> > > >
> > > >> 2) The changes for the existing integer functions - preferably one
> > > >> function per patch.
> > > >>
> > > >> 3) The new integer functions that you're adding
> > > >
> > > > These wouldn't be too hard to do, but what are the expectations for
> > > > testing?  A clean build of GCC takes about 6 hours in my VM, and
> > > > regression testing takes about 4 hours per architecture.  You would want
> > > > a full regression report for each incremental patch?  I have no idea how
> > > > to target regression tests that apply to particular runtime functions
> > > > without the risk of missing something.
> > > >
> > >
> > > Most of this can be tested in a cross-compile environment using qemu as
> > > a model.  A cross build shouldn't take that long (especially if you
> > > restrict the compiler to just C and C++ - other languages are
> > > vanishingly unlikely to pick up errors in the parts of the compiler
> > > you're changing).  But build checks will be mostly sufficient for most
> > > of the intermediate patches.
> > >
> > > >> 4) The floating-point support.
> > > >>
> > > >> Some more general observations:
> > > >>
> > > >> - where functions are already in lib1funcs.asm, please leave them there.
> > > >
> > > > I guess I have a different vision here.  I have had a really hard time
> > > > following all of the nested #ifdefs in lib1funcs, so I thought it would
> > > > be helpful to begin breaking it up into logical units.
> > >
> > > Agreed, it's not easy.  But the restructuring, if any, should be done
> > > separately from other changes, not mixed up with them.
> > >
> > > >
> > > > The functions removed were all functions for which I had THUMB1
> > > > sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
> > > > __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
> > > > the same number of instructions as the previous ARM/THUMB2 version.
> > > >
> > > > You will find all of the previous ARM versions of these functions merged
> > > > into the new files (with attribution) and the same preprocessor
> > > > selection path.  So no architecture variant should be any worse off than
> > > > before this patch, and some beyond v6m should benefit.
> > > >
> > > > In the future, I think that my versions of __divsi3 and __divdi3 will
> > > > prove faster/smaller than the existing THUMB2 versions.  I know that my
> > > > float routines are less than half the compiled size of THUMB2 versions
> > > > in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
> > > > differences so I have left all this work for future patches. (It's also
> > > > quite likely that my version can be further-refined with a few judicious
> > > > uses of THUMB2 alternatives.)
> > > >
> > > > My long-term vision would be use lib1funcs as an architectural wrapper
> > > > distinct from the implementation code.
> > > >
> > > >> - lets avoid having the cm0 subdirectory - in particular we should do
> > > >> this when there is existing code for other targets in the same source
> > > >> files.  It's OK to have any new files in the main 'arm' directory of the
> > > >> source tree - just name the files appropriately if really needed.
> > > >
> > > > Fair point on the name.  In v1 of this patch, all these files were all
> > > > preprocessor-selected for v6m only.  However, as I've stumbled through
> > > > the finer points of integration, that line has blurred.  Name aside,
> > > > the subdirectory does still represent a standalone library.   I think
> > > > I've managed to add enough integration hooks that it works well in
> > > > a libgcc context, but it still has a very distinct implementation style.
> > > >
> > > > I don't have a strong opinion on this, just preference.  But, keeping
> > > > the subdirectory with a neutral name will probably make maintenance
> > > > easier in the short term.  I would suggest "lib0" (since it caters to
> > > > the lowest common denominator) or "eabi" (since that was the original
> > > > target).  There are precedents in other architectures (e.g. avr).
> > >
> > > The issue here is that the selection of code from the various
> > > subdirectories is not consistent.  In some cases we might be pulling in
> > > a thumb1 implementation into a thumb2 environment, so having the code in
> > > a directory that doesn't reflect this makes maintaining the code harder.
> > >  I don't mind too much if some new files are introduced and their names
> > > reflect both their function and the architecture they support - eg
> > > t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
> >
> > You didn't say that a neutral directory name is off the table.
> > I will propose something other than 'cm0'.
> >
> > > >
> > > >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
> > > >> better term, and that is used widely elsewhere in the compiler.  So if
> > > >> you really need a term just use THUMB1, or even T1.
> > > >
> > > > Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
> > > > of this is probably THUMB1 clean, but it wasn't a design requirement.
> > >
> > > It's particularly the Thumb1 issue, just more the name is for a specific
> > > CPU which might cause confusion later.  v6m would be preferable to that
> > > if there really is a dependency on the instructions that are not in the
> > > original Thumb1 ISA.
> >
> > I will remove the CM0 prefix and use/extend the standard macro names.
> >
> > >
> > > >
> > > > The CM0_FUNC_START exists so that I can specify subsections of ".text"
> > > > for each function.  This was a fairly fundamental design decision that
> > > > allowed me to make a number of branch optimizations between functions.
> > > > The other macros are just duplicates for naming symmetry.
> > >
> > > This is something we'll have to get to during the main review of the
> > > code - we used to have support for PE-COFF object files.  That might now
> > > be obsolete, wince support is certainly deprecated - but we can't assume
> > > that ELF is the only object format we'll ever have to support.
> > >
> > > >
> > > > The existing  FUNC_START macro inserts extra conflicting ".text"
> > > > directives that would break the build.  Of course, the prefix was
> > > > arbitrary; I just took CM0 from the library name.  But, there's nothing
> > > > architecturally significant about this macro at all, so THUMB1 and T1
> > > > seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
> > > > two parameters? For example:
> > > >
> > > >     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
> > > >
> > > > Instead of:
> > > >
> > > >     .section .text.sorted.libgcc.clz2.clzdi2,"x"
> > > >     CM0_FUNC_START clzdi2
> > > >
> > > >> - For the 64-bit shift functions, I expect the existing code to be
> > > >> preferable whenever possible - I think it can even be tweaked to build
> > > >> for thumb2 by inserting a suitable IT instruction.  So your code should
> > > >> only be used when
> > > >>
> > > >>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
> > > >
> > > > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
> > > > (The name doesn't seem quite right for Cortex M0, since it does support
> > > > some 32 bit instructions, but that's beside the point.)
> > >
> > > The terms Thumb1 and Thumb2 predate the arm-v8m architecture
> > > specifications - even then the term Thumb1 was interpreted as "mostly
> > > 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
> > > 16/32-bit spilt has become more blurred and that will likely continue in
> > > future since the 16-bit encoding space is pretty full.
> > >
> > > >
> > > > The current lib1funcs ARM code path still exists, as described above. My
> > > > THUMB1 implementations were 1 - 3 instructions shorter than the current
> > > > versions, which is why I took the time to merge the files.
> > > >
> > > > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
> > > > see an advantage to eliminating the branch unless these functions were
> > > > written with cryptographic side channel attacks in mind.
> > >
> > > On high performance cores branches are predicted - if the branch is
> > > predictable then the common path will be taken and the unneeded
> > > instructions will never be used.  But library functions like this tend
> > > to have very unpredictable values used for calling them, so it's much
> > > less likely that the hardware will predict the right path - at this
> > > point conditional instructions tend to win (especially if there aren't
> > > very many of them) because the cost (on average) of not executing the
> > > unneeded instructions is much lower than the cost (on average) of
> > > unwinding the processor state to execute the other arm of the
> > > conditional branch.
> >
> > Got it.  I have been counting branches as 3 cycles of fixed cost, and
> > ignoring penalties if a branch skips at least 2 instructions.
> >
> > Going forward, I will add 'IT<c>' compile options for any new code with
> > scope beyond v6m.
> >
> > > >> - most, if not all, of your LSYM symbols should not be needed after
> > > >> assembly, so should start with a captial 'L' (and no leading
> > > >> underscores), the assembler will then automatically discard any that are
> > > >> not needed for relocations.
> > > >
> > > > You don't want debugging symbols for libgcc internals :) ?  I sort of
> > > > understand that, but those symbols have been useful to me in the past.
> > > > The "." by itself seems to keep visibility local, so the extra symbols
> > > > won't cause linker issuess. Would you object to a macro variant (e.g.
> > > > LLSYM) that prepends the "L" but is easier to disable?
> > >
> > > It is a matter of taste, but I really prefer the local symbols to
> > > disappear entirely once the file is compiled - it makes things like
> > > backtrace gdb show the proper call heirarchy.
> >
> > Hearing no objection to LLSYM, I'll implement that for debugging.
> > The released version will have ".L" symbols stripped.
> >
> > > >> - you'll need to write suitable commit messages for each patch, which
> > > >> also contain a suitable ChangeLog style entry.
> > > >
> > > > OK.
> > > >
> > > >> - finally, your popcount implementations have data in the code segment.
> > > >>  That's going to cause problems when we have compilation options such as
> > > >> -mpure-code.
> > > >
> > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > If this matters, you'll need to point in the right direction for the
> > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > >
> > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> >
> > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> Thanks, I'll have a closer look at why I didn't see problems.
>

So, that's because the code goes to the .text section (as opposed to
.text.noread)
and does not have the PURECODE flag. The compiler takes care of this
when generating code with -mpure-code.
And the simulator does not complain because it only checks loads from
the segment with the PURECODE flag set.

> > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > appears to be valid only when the 'movt' instruction is available, which
> > means that the 'clz' instruction will also be available, so no array loads.
> No, -mpure-code is also supported with v6m.
>
> > Is the -mpure-code state detectable as a preprocessor flag?  While
> No.
>
> > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > should be the default just because a processor supports Thumb-2.
> >
> > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > constant data in compiled C functions?  I don't think this is the
> > default for the typical toolchain scripts.
> No, users of -mpure-code do not recompile the toolchain.
>
> > > >> I strongly suggest that, rather than using gcc snapshots (I'm assuming
> > > >> this based on the diff style and directory naming in your patches), you
> > > >> switch to using a git tree, then you'll be able to use tools such as
> > > >> rebasing and the git posting tools to send the patch series for
> > > >> subsequent review.
> > > >
> > > > Your assumption is correct.  I didn't think that I would have to get so
> > > > deep into the gcc development process for this contribution.  I used
> > > > this library as a bare metal alternative for libgcc/libm in the product
> > > > for years, so I thought it would just drop in.  But, the libgcc compile
> > > > mechanics have proved much more 'interesting'. I'm assuming this
> > > > architecture was created years before the introduction of -ffunction-
> > > > sections...
> > > >
> > >
> > > I don't think I've time to write a history lesson, even if you wanted
> > > it.  Suffice to say, this does date back to the days of a.out format
> > > object files (with 4 relocation types, STABS debugging, and one code,
> > > one data and one bss section).
> > >
> > > >>
> > > >> Richard.
> > > >>
> > > >
> > > > Thanks again,
> > > > Daniel
> > > >
> > >
> > > R.
> > >
> >
> > To reiterate what I said above, I intend to push forward and incorporate
> > your current recommendations plus any further feedback I may get.  I
> > expect you to say that this doesn't merit inclusion yet, but I'd rather
> > spend the time while I have it.
> >
> > I'll post a patch series for review within the next day or so.
>
> Here are the results of the validation of your latest version (20210105):
> https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20210105.patch/report-build-info.html
>
> "BIG-REGR" just means the regression report is large enough that it's
> provided in compressed form to avoid overloading the browser.
>
> So it really seems your patch introduces regressions in arm*linux* configs.
> For the 2 arm-none-eabi configs which show regressions (cortex-m0 and
> cortex-m3), the logs seem to indicate some tests timed out, and it's
> possible the server used was overloaded.
> The same applies to the 3 aarch64*elf cases, where the regressions
> seem only caused by timed out; there's no reason your patch would have
> an impact on aarch64.
> (there 5 configs were tested on the same machine, so overload is indeed likely).
>
> I didn't check why all the ubsan tests now seem to fail, they are in
> the "unstable" category because in the past some of them had some
> randomness.
> I do not see such noise in trunk validation though.
>
> Thanks,
>
> Christophe
>
> >
> > Thanks again,
> > Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-11 16:07                   ` Christophe Lyon
@ 2021-01-11 16:18                     ` Daniel Engel
  2021-01-11 16:39                       ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-11 16:18 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> >
> > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > >
> > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > --snip--
> > > > >
> > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > --snip--
> > > > >
> > > > >> - finally, your popcount implementations have data in the code segment.
> > > > >>  That's going to cause problems when we have compilation options such as
> > > > >> -mpure-code.
> > > > >
> > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > If this matters, you'll need to point in the right direction for the
> > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > >
> > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > >
> > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > Thanks, I'll have a closer look at why I didn't see problems.
> >
> 
> So, that's because the code goes to the .text section (as opposed to
> .text.noread)
> and does not have the PURECODE flag. The compiler takes care of this
> when generating code with -mpure-code.
> And the simulator does not complain because it only checks loads from
> the segment with the PURECODE flag set.
> 
This is far out of my depth, but can something like: 

ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))

be adapted to:

a) detect the state of the -mpure-code switch, and
b) pass that flag to the preprocessor?

If so, I can probably fix both the target section and the data usage.  
Just have to add a few instructions to finish unrolling the loop. 

> > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > appears to be valid only when the 'movt' instruction is available, which
> > > means that the 'clz' instruction will also be available, so no array loads.
> > No, -mpure-code is also supported with v6m.
> >
> > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > No.
> >
> > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > should be the default just because a processor supports Thumb-2.
> > >
> > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > constant data in compiled C functions?  I don't think this is the
> > > default for the typical toolchain scripts.
> > No, users of -mpure-code do not recompile the toolchain.
> >
> > --snip --

>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-11 16:18                     ` Daniel Engel
@ 2021-01-11 16:39                       ` Christophe Lyon
  2021-01-15 11:40                         ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-11 16:39 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
>
> On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > >
> > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > >
> > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > --snip--
> > > > > >
> > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > --snip--
> > > > > >
> > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > >> -mpure-code.
> > > > > >
> > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > >
> > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > >
> > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > Thanks, I'll have a closer look at why I didn't see problems.
> > >
> >
> > So, that's because the code goes to the .text section (as opposed to
> > .text.noread)
> > and does not have the PURECODE flag. The compiler takes care of this
> > when generating code with -mpure-code.
> > And the simulator does not complain because it only checks loads from
> > the segment with the PURECODE flag set.
> >
> This is far out of my depth, but can something like:
>
> ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
>
> be adapted to:
>
> a) detect the state of the -mpure-code switch, and
> b) pass that flag to the preprocessor?
>
> If so, I can probably fix both the target section and the data usage.
> Just have to add a few instructions to finish unrolling the loop.

I must confess I never checked libgcc's Makefile deeply before,
but it looks like you can probably detect whether -mpure-code is
part of $CFLAGS.

However, it might be better to write pure-code-safe code
unconditionally because the toolchain will probably not
be rebuilt with -mpure-code as discussed before.
Or that could mean adding a -mpure-code multilib....

>
> > > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > > appears to be valid only when the 'movt' instruction is available, which
> > > > means that the 'clz' instruction will also be available, so no array loads.
> > > No, -mpure-code is also supported with v6m.
> > >
> > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > No.
> > >
> > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > should be the default just because a processor supports Thumb-2.
> > > >
> > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > constant data in compiled C functions?  I don't think this is the
> > > > default for the typical toolchain scripts.
> > > No, users of -mpure-code do not recompile the toolchain.
> > >
> > > --snip --
>
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-11 16:39                       ` Christophe Lyon
@ 2021-01-15 11:40                         ` Daniel Engel
  2021-01-15 12:30                           ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-15 11:40 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

Hi Christophe,

On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > >
> > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > >
> > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > --snip--
> > > > > > >
> > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > --snip--
> > > > > > >
> > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > >> -mpure-code.
> > > > > > >
> > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > >
> > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > >
> > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > >
> > >
> > > So, that's because the code goes to the .text section (as opposed to
> > > .text.noread)
> > > and does not have the PURECODE flag. The compiler takes care of this
> > > when generating code with -mpure-code.
> > > And the simulator does not complain because it only checks loads from
> > > the segment with the PURECODE flag set.
> > >
> > This is far out of my depth, but can something like:
> >
> > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> >
> > be adapted to:
> >
> > a) detect the state of the -mpure-code switch, and
> > b) pass that flag to the preprocessor?
> >
> > If so, I can probably fix both the target section and the data usage.
> > Just have to add a few instructions to finish unrolling the loop.
> 
> I must confess I never checked libgcc's Makefile deeply before,
> but it looks like you can probably detect whether -mpure-code is
> part of $CFLAGS.
> 
> However, it might be better to write pure-code-safe code
> unconditionally because the toolchain will probably not
> be rebuilt with -mpure-code as discussed before.
> Or that could mean adding a -mpure-code multilib....

I have learned a few things since the last update.  I think I know how
to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
something of a wall with testing.  I can't seem to compile any flavor of
libgcc with CFLAGS_FOR_TARGET="-mpure-code".

1.  Configuring --with-multilib-list=rmprofile results in build failure:

    checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
    configure: error: cannot compute suffix of object files: cannot compile
    See `config.log' for more details

   cc1: error: -mpure-code only supports non-pic code on M-profile targets
   
2.  Attempting to filter the multib list results in configuration error.
    This might have been misguided, but it was something I tried:

    Error: --with-multilib-list=armv6s-m not supported.

    Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported

3.  Attempting to configure a single architecture results in a build error.  

    --with-mode=thumb --with-arch=armv6s-m --with-float=soft

    checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
    configure: error: cannot compute suffix of object files: cannot compile
    See `config.log' for more details

    conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
        9 | #include <ac_nonexistent.h>
          |          ^~~~~~~~~~~~~~~~~~

This has me wondering whether pure-code in libgcc is a real issue ... 
If there's a way to build libgcc with -mpure-code, please enlighten me.

> >
> > > > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > > > appears to be valid only when the 'movt' instruction is available, which
> > > > > means that the 'clz' instruction will also be available, so no array loads.
> > > > No, -mpure-code is also supported with v6m.
> > > >
> > > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > > No.
> > > >
> > > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > > should be the default just because a processor supports Thumb-2.
> > > > >
> > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > > constant data in compiled C functions?  I don't think this is the
> > > > > default for the typical toolchain scripts.
> > > > No, users of -mpure-code do not recompile the toolchain.
> > > >
> > > > --snip --
> >
> > >
>

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-15 11:40                         ` Daniel Engel
@ 2021-01-15 12:30                           ` Christophe Lyon
  2021-01-16 16:14                             ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-15 12:30 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi Christophe,
>
> On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > >
> > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > >
> > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > >
> > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > --snip--
> > > > > > > >
> > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > --snip--
> > > > > > > >
> > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > >> -mpure-code.
> > > > > > > >
> > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > >
> > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > >
> > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > >
> > > >
> > > > So, that's because the code goes to the .text section (as opposed to
> > > > .text.noread)
> > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > when generating code with -mpure-code.
> > > > And the simulator does not complain because it only checks loads from
> > > > the segment with the PURECODE flag set.
> > > >
> > > This is far out of my depth, but can something like:
> > >
> > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > >
> > > be adapted to:
> > >
> > > a) detect the state of the -mpure-code switch, and
> > > b) pass that flag to the preprocessor?
> > >
> > > If so, I can probably fix both the target section and the data usage.
> > > Just have to add a few instructions to finish unrolling the loop.
> >
> > I must confess I never checked libgcc's Makefile deeply before,
> > but it looks like you can probably detect whether -mpure-code is
> > part of $CFLAGS.
> >
> > However, it might be better to write pure-code-safe code
> > unconditionally because the toolchain will probably not
> > be rebuilt with -mpure-code as discussed before.
> > Or that could mean adding a -mpure-code multilib....
>
> I have learned a few things since the last update.  I think I know how
> to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> something of a wall with testing.  I can't seem to compile any flavor of
> libgcc with CFLAGS_FOR_TARGET="-mpure-code".
>
> 1.  Configuring --with-multilib-list=rmprofile results in build failure:
>
>     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
>     configure: error: cannot compute suffix of object files: cannot compile
>     See `config.log' for more details
>
>    cc1: error: -mpure-code only supports non-pic code on M-profile targets
>

Yes, I did hit that wall too :-)

Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.

Note that there are problems in newlib too, but users of -mpure-code seem
to be able to work around that (eg. using their own startup code and no stdlib)

> 2.  Attempting to filter the multib list results in configuration error.
>     This might have been misguided, but it was something I tried:
>
>     Error: --with-multilib-list=armv6s-m not supported.
>
>     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported

I think only 2 values are supported: aprofile and rmprofile.

> 3.  Attempting to configure a single architecture results in a build error.
>
>     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
>
>     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
>     configure: error: cannot compute suffix of object files: cannot compile
>     See `config.log' for more details
>
>     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
>         9 | #include <ac_nonexistent.h>
>           |          ^~~~~~~~~~~~~~~~~~
I never saw that error message, but I never build using --with-arch.
I do use --with-cpu though.

> This has me wondering whether pure-code in libgcc is a real issue ...
> If there's a way to build libgcc with -mpure-code, please enlighten me.
I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
works?

Thanks,

Christophe

> > > > > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > > > > appears to be valid only when the 'movt' instruction is available, which
> > > > > > means that the 'clz' instruction will also be available, so no array loads.
> > > > > No, -mpure-code is also supported with v6m.
> > > > >
> > > > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > > > No.
> > > > >
> > > > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > > > should be the default just because a processor supports Thumb-2.
> > > > > >
> > > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > > > constant data in compiled C functions?  I don't think this is the
> > > > > > default for the typical toolchain scripts.
> > > > > No, users of -mpure-code do not recompile the toolchain.
> > > > >
> > > > > --snip --
> > >
> > > >
> >
>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-15 12:30                           ` Christophe Lyon
@ 2021-01-16 16:14                             ` Daniel Engel
  2021-01-21 10:29                               ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-16 16:14 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

Hi Christophe,

On Fri, Jan 15, 2021, at 4:30 AM, Christophe Lyon wrote:
> On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > Hi Christophe,
> >
> > On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > > >
> > > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > > >
> > > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > > --snip--
> > > > > > > > >
> > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > > --snip--
> > > > > > > > >
> > > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > > >> -mpure-code.
> > > > > > > > >
> > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > > >
> > > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > > >
> > > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > > >
> > > > >
> > > > > So, that's because the code goes to the .text section (as opposed to
> > > > > .text.noread)
> > > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > > when generating code with -mpure-code.
> > > > > And the simulator does not complain because it only checks loads from
> > > > > the segment with the PURECODE flag set.
> > > > >
> > > > This is far out of my depth, but can something like:
> > > >
> > > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > > >
> > > > be adapted to:
> > > >
> > > > a) detect the state of the -mpure-code switch, and
> > > > b) pass that flag to the preprocessor?
> > > >
> > > > If so, I can probably fix both the target section and the data usage.
> > > > Just have to add a few instructions to finish unrolling the loop.
> > >
> > > I must confess I never checked libgcc's Makefile deeply before,
> > > but it looks like you can probably detect whether -mpure-code is
> > > part of $CFLAGS.
> > >
> > > However, it might be better to write pure-code-safe code
> > > unconditionally because the toolchain will probably not
> > > be rebuilt with -mpure-code as discussed before.
> > > Or that could mean adding a -mpure-code multilib....
> >
> > I have learned a few things since the last update.  I think I know how
> > to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> > something of a wall with testing.  I can't seem to compile any flavor of
> > libgcc with CFLAGS_FOR_TARGET="-mpure-code".
> >
> > 1.  Configuring --with-multilib-list=rmprofile results in build failure:
> >
> >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> >     configure: error: cannot compute suffix of object files: cannot compile
> >     See `config.log' for more details
> >
> >    cc1: error: -mpure-code only supports non-pic code on M-profile targets
> >
> 
> Yes, I did hit that wall too :-)
> 
> Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.
> 
> Note that there are problems in newlib too, but users of -mpure-code seem
> to be able to work around that (eg. using their own startup code and no stdlib)

Is there a current side project to solve the makefile problems?

I think I'm back to my original question: If libgcc can't be built
with -mpure-code, and users bypass it completely with -nostdlib, then
why this conversation about pure-code compatibility of __clzsi2() etc?

> > 2.  Attempting to filter the multib list results in configuration error.
> >     This might have been misguided, but it was something I tried:
> >
> >     Error: --with-multilib-list=armv6s-m not supported.
> >
> >     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported
> 
> I think only 2 values are supported: aprofile and rmprofile.

It looks like this might require a custom t-* multilib in gcc/config/arm. 

> > 3.  Attempting to configure a single architecture results in a build error.
> >
> >     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
> >
> >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> >     configure: error: cannot compute suffix of object files: cannot compile
> >     See `config.log' for more details
> >
> >     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> >         9 | #include <ac_nonexistent.h>
> >           |          ^~~~~~~~~~~~~~~~~~
> I never saw that error message, but I never build using --with-arch.
> I do use --with-cpu though.
> 
> > This has me wondering whether pure-code in libgcc is a real issue ...
> > If there's a way to build libgcc with -mpure-code, please enlighten me.
> I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
> works?

No luck with that.  Same error message as before: 

4.  --with-mode=thumb --with-arch=armv6s-m --with-float=soft --with-cpu=cortex-m0

    Switch "--with-arch" may not be used with switch "--with-cpu"

5.  Then: --with-mode=thumb --with-float=soft --with-cpu=cortex-m0

    checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
    configure: error: cannot compute suffix of object files: cannot compile
    See `config.log' for more details

    cc1: error: -mpure-code only supports non-pic code on M-profile targets

6.  Finally! --with-float=soft --with-cpu=cortex-m0 --disable-multilib

Once you know this, and read the docs sideways, the previous errors are
all probably "works as designed".  But, I can still grumble.  

With libgcc compiled with -mpure-code, I can confirm that 
'builtin-bitops-1.c' (the test for __clzsi2) passed with libgcc as-is.

I then added the SHF_ARM_PURECODE flag to the libgcc assembly functions
and re-ran the test.  Still passed.  I then added -mpure-code to
RUNTESTFLAGS and re-ran the test.  Still passed.  readelf confirmed that
the test program is compiling as expected [1]:

    [ 2] .text             PROGBITS        0000800c 00800c 003314 00 AXy  0   0  4
    Key to Flags:
    W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
    L (link order), O (extra OS processing required), G (group), T (TLS),
    C (compressed), x (unknown), o (OS specific), E (exclude),
    y (purecode), p (processor specific)

It was only when I started inserting pure-code test directives into 
'builtin-bitops-1.c' that 'make check' began to report errors.

    /* { dg-do compile } */
    ...
    /* { dg-options "-mpure-code -mfp16-format=ieee" } */
    /* { dg-final { scan-assembler-not "\\.(float|l\\?double|\d?byte|short|int|long|quad|word)\\s+\[^.\]" } } */

However, for reasons [2] [3] [4] [5], this wasn't actually useful.  It's
sufficient to say that there are many reasons that non-pure-code
compatible functions exist in libgcc.

Although I'm not sure how useful this will be in light of the previous
findings, I did take the opportunity with a working compile process to
modify the relevant assembly functions for -mpure-code compatibility.
I can manually disassemble the library and verify correct compilation.
I can manually run a non-pure-code builtin-bitops-1 with a pure-code
library to verify correct execution.  But, I don't think your standard
regression suite will be able to exercise the new paths.

The patch is below; you can consider this as 34/33 in the series.

Regards,
Daniel

[1] It's pretty clear that the section flags in libgcc have never really
    mattered.  When the linker strings all of the used objects together,
    the original sections disappear into a single output object. The
    compiler controls those flags regardless of what libgcc does.)

[2] The existing pure-code tests are compile-only and cover just the
    disassembled 'main.o'.  There is no test of a complete executable
    and there is no execution/simulation.  

[3] While other parts of binutils may understand SHF_ARM_PURECODE, I
    don't think the simulator checks section flags or throws exceptions.

[4] builtin-bitops-1 modified this way will always fail due to the array
    data definitions (longs, longlongs, etc).  GCC can't translate those
    to instructions.  While the ".data" section would presumably be
    readable, scan-assembler-not doesn't know the difference.

[5] Even if the simulator were modified to throw exceptions, this will
    continue to fail because _mainCRTStartup uses a literal pool.

> Thanks,
> 
> Christophe
> 
> > > > > > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > > > > > appears to be valid only when the 'movt' instruction is available, which
> > > > > > > means that the 'clz' instruction will also be available, so no array loads.
> > > > > > No, -mpure-code is also supported with v6m.
> > > > > >
> > > > > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > > > > No.
> > > > > >
> > > > > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > > > > should be the default just because a processor supports Thumb-2.
> > > > > > >
> > > > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > > > > constant data in compiled C functions?  I don't think this is the
> > > > > > > default for the typical toolchain scripts.
> > > > > > No, users of -mpure-code do not recompile the toolchain.
> > > > > >
> > > > > > --snip --
> > > >
> > > > >
> > >
> >
> > Thanks,
> > Daniel

    Add -mpure-code support to the CM0 functions.

    gcc/libgcc/ChangeLog:
    2021-01-16 Daniel Engel <gnu@danielengel.com>

            Makefile.in (MPURE_CODE): New macro defines __PURE_CODE__.
            (gcc_compile): Appended MPURE_CODE.
            lib1funcs.S (FUNC_START_SECTION): Set flags for __PURE_CODE__.
            clz2.S (__clzsi2): Added -mpure-code compatible instructions.
            ctz2.S (__ctzsi2): Same.
            popcnt.S (__popcountsi2, __popcountdi2): Same.

diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
index 2de57519734..cd6b5f9c1b0 100644
--- a/libgcc/Makefile.in
+++ b/libgcc/Makefile.in
@@ -303,6 +303,9 @@ CRTSTUFF_CFLAGS = -O2 $(GCC_CFLAGS) $(INCLUDES) $(MULTILIB_CFLAGS) -g0 \
 # Extra flags to use when compiling crt{begin,end}.o.
 CRTSTUFF_T_CFLAGS =

+# Pass the -mpure-code flag into assembly for conditional compilation.
+MPURE_CODE = $(if $(findstring -mpure-code,$(CFLAGS)), -D__PURE_CODE__)
+
 MULTIDIR := $(shell $(CC) $(CFLAGS) -print-multi-directory)
 MULTIOSDIR := $(shell $(CC) $(CFLAGS) -print-multi-os-directory)

@@ -312,7 +315,7 @@ inst_slibdir = $(slibdir)$(MULTIOSSUBDIR)

 gcc_compile_bare = $(CC) $(INTERNAL_CFLAGS)
 compile_deps = -MT $@ -MD -MP -MF $(basename $@).dep
-gcc_compile = $(gcc_compile_bare) -o $@ $(compile_deps)
+gcc_compile = $(gcc_compile_bare) -o $@ $(compile_deps) $(MPURE_CODE)
 gcc_s_compile = $(gcc_compile) -DSHARED

 objects = $(filter %$(objext),$^)
diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
index a2de45ff651..97a44f5d187 100644
--- a/libgcc/config/arm/clz2.S
+++ b/libgcc/config/arm/clz2.S
@@ -214,17 +214,40 @@ FUNC_ENTRY clzsi2
      IT(sub,ne) r2,     #4

     LLSYM(__clz2):
+  #if defined(__PURE_CODE__) && __PURE_CODE__
+        // Without access to table data, continue unrolling the loop.
+        lsrs    r1,     r0,     #2
+
+      #ifdef __HAVE_FEATURE_IT
+        do_it   ne,t
+      #else
+        beq     LLSYM(__clz1)
+      #endif
+
+        // Out of 4 bits, the first '1' is somewhere in the highest 2,
+        //  so the lower 2 bits are no longer interesting.
+     IT(mov,ne) r0,     r1
+     IT(sub,ne) r2,     #2
+
+    LLSYM(__clz1):
+        // Convert remainder {0,1,2,3} to {0,1,2,2}.
+        lsrs    r1,     r0,     #1
+        bics    r0,     r1
+
+  #else /* !__PURE_CODE__ */
         // Load the remainder by index
         adr     r1,     LLSYM(__clz_remainder)
         ldrb    r0,     [r1, r0]

+  #endif /* !__PURE_CODE__ */
   #endif /* !__OPTIMIZE_SIZE__ */

         // Account for the remainder.
         subs    r0,     r2,     r0
         RET

-  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+  #if !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__) && \
+      !(defined(__PURE_CODE__) && __PURE_CODE__)
         .align 2
     LLSYM(__clz_remainder):
         .byte 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4
diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
index b9528a061a2..6a49d64f3a6 100644
--- a/libgcc/config/arm/ctz2.S
+++ b/libgcc/config/arm/ctz2.S
@@ -209,11 +209,44 @@ FUNC_ENTRY ctzsi2
      IT(sub,ne) r2,     #4

     LLSYM(__ctz2):
+  #if defined(__PURE_CODE__) && __PURE_CODE__
+        // Without access to table data, continue unrolling the loop.
+        lsls    r1,     r0,     #2
+
+      #ifdef __HAVE_FEATURE_IT
+        do_it   ne, t
+      #else
+        beq     LLSYM(__ctz1)
+      #endif
+
+        // Out of 4 bits, the first '1' is somewhere in the lowest 2,
+        //  so the higher 2 bits are no longer interesting.
+     IT(mov,ne) r0,     r1
+     IT(sub,ne) r2,     #2
+
+    LLSYM(__ctz1):
+        // Convert remainder {0,1,2,3} in $r0[31:30] to {0,2,1,2}.
+        lsrs    r0,     #31
+
+      #ifdef __HAVE_FEATURE_IT
+        do_it   cs, t
+      #else
+        bcc     LLSYM(__ctz_zero)
+      #endif
+
+        // If bit[30] of the remainder is set, neither of these bits count
+        //  towards the result.  Bit[31] must be cleared.
+        // Otherwise, bit[31] becomes the final remainder.
+     IT(sub,cs) r2,     #2
+     IT(eor,cs) r0,     r0
+
+  #else /* !__PURE_CODE__ */
         // Look up the remainder by index.
         lsrs    r0,     #28
         adr     r3,     LLSYM(__ctz_remainder)
         ldrb    r0,     [r3, r0]

+  #endif /* !__PURE_CODE__ */
   #endif /* !__OPTIMIZE_SIZE__ */

     LLSYM(__ctz_zero):
@@ -221,8 +254,9 @@ FUNC_ENTRY ctzsi2
         subs    r0,     r2,     r0
         RET

-  #if (!defined(__ARM_FEATURE_CLZ) || !__ARM_FEATURE_CLZ) && \
-      (!defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__)
+  #if !(defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ) && \
+      !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__) && \
+      !(defined(__PURE_CODE__) && __PURE_CODE__)
         .align 2
     LLSYM(__ctz_remainder):
         .byte 0,4,3,4,2,4,3,4,1,4,3,4,2,4,3,4
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 5148957144b..59b2370e160 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -454,7 +454,12 @@ SYM (\name):
    Use the *_START_SECTION macros for declarations that the linker should
     place in a non-defailt section (e.g. ".rodata", ".text.subsection"). */
 .macro FUNC_START_SECTION name section
-       .section \section,"x"
+#ifdef __PURE_CODE__
+       /* SHF_ARM_PURECODE | SHF_ALLOC | SHF_EXECINSTR */
+       .section \section,"0x20000006",%progbits
+#else
+       .section \section,"ax",%progbits
+#endif
        .align 0
        FUNC_ENTRY \name
 .endm
diff --git a/libgcc/config/arm/popcnt.S b/libgcc/config/arm/popcnt.S
index 51b1ed745ee..d6f65403b5d 100644
--- a/libgcc/config/arm/popcnt.S
+++ b/libgcc/config/arm/popcnt.S
@@ -23,6 +23,29 @@
    <http://www.gnu.org/licenses/>.  */


+#if defined(L_popcountdi2) || defined(L_popcountsi2)
+
+.macro ldmask reg, temp, value
+    #if defined(__PURE_CODE__) && (__PURE_CODE__)
+      #ifdef NOT_ISA_TARGET_32BIT
+        movs    \reg,   \value
+        lsls    \temp,  \reg,   #8
+        orrs    \reg,   \temp
+        lsls    \temp,  \reg,   #16
+        orrs    \reg,   \temp
+      #else
+        // Assumption: __PURE_CODE__ only support M-profile.
+        movw    \reg    ((\value) * 0x101)
+        movt    \reg    ((\value) * 0x101)
+      #endif
+    #else
+        ldr     \reg,   =((\value) * 0x1010101)
+    #endif
+.endm
+
+#endif
+
+
 #ifdef L_popcountdi2

 // int __popcountdi2(int)
@@ -49,7 +72,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2

   #else /* !__OPTIMIZE_SIZE__ */
         // Load the one-bit alternating mask.
-        ldr     r3,     =0x55555555
+        ldmask  r3,     r2,     0x55

         // Reduce the second word.
         lsrs    r2,     r1,     #1
@@ -62,7 +85,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
         subs    r0,     r2

         // Load the two-bit alternating mask.
-        ldr     r3,     =0x33333333
+        ldmask  r3,     r2,     0x33

         // Reduce the second word.
         lsrs    r2,     r1,     #2
@@ -140,7 +163,7 @@ FUNC_ENTRY popcountsi2
   #else /* !__OPTIMIZE_SIZE__ */

         // Load the one-bit alternating mask.
-        ldr     r3,     =0x55555555
+        ldmask  r3,     r2,     0x55

         // Reduce the word.
         lsrs    r1,     r0,     #1
@@ -148,7 +171,7 @@ FUNC_ENTRY popcountsi2
         subs    r0,     r1

         // Load the two-bit alternating mask.
-        ldr     r3,     =0x33333333
+        ldmask  r3,     r2,     0x33

         // Reduce the word.
         lsrs    r1,     r0,     #2
@@ -158,7 +181,7 @@ FUNC_ENTRY popcountsi2
         adds    r0,     r1

         // Load the four-bit alternating mask.
-        ldr     r3,     =0x0F0F0F0F
+        ldmask  r3,     r2,     0x0F

         // Reduce the word.
         lsrs    r1,     r0,     #4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-16 16:14                             ` Daniel Engel
@ 2021-01-21 10:29                               ` Christophe Lyon
  2021-01-21 20:35                                 ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-21 10:29 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Sat, 16 Jan 2021 at 17:13, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi Christophe,
>
> On Fri, Jan 15, 2021, at 4:30 AM, Christophe Lyon wrote:
> > On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
> > >
> > > Hi Christophe,
> > >
> > > On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > > > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > >
> > > > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > > > >
> > > > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > > > --snip--
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > > > --snip--
> > > > > > > > > >
> > > > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > > > >> -mpure-code.
> > > > > > > > > >
> > > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > > > >
> > > > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > > > >
> > > > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > > > >
> > > > > >
> > > > > > So, that's because the code goes to the .text section (as opposed to
> > > > > > .text.noread)
> > > > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > > > when generating code with -mpure-code.
> > > > > > And the simulator does not complain because it only checks loads from
> > > > > > the segment with the PURECODE flag set.
> > > > > >
> > > > > This is far out of my depth, but can something like:
> > > > >
> > > > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > > > >
> > > > > be adapted to:
> > > > >
> > > > > a) detect the state of the -mpure-code switch, and
> > > > > b) pass that flag to the preprocessor?
> > > > >
> > > > > If so, I can probably fix both the target section and the data usage.
> > > > > Just have to add a few instructions to finish unrolling the loop.
> > > >
> > > > I must confess I never checked libgcc's Makefile deeply before,
> > > > but it looks like you can probably detect whether -mpure-code is
> > > > part of $CFLAGS.
> > > >
> > > > However, it might be better to write pure-code-safe code
> > > > unconditionally because the toolchain will probably not
> > > > be rebuilt with -mpure-code as discussed before.
> > > > Or that could mean adding a -mpure-code multilib....
> > >
> > > I have learned a few things since the last update.  I think I know how
> > > to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> > > something of a wall with testing.  I can't seem to compile any flavor of
> > > libgcc with CFLAGS_FOR_TARGET="-mpure-code".
> > >
> > > 1.  Configuring --with-multilib-list=rmprofile results in build failure:
> > >
> > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> > >     configure: error: cannot compute suffix of object files: cannot compile
> > >     See `config.log' for more details
> > >
> > >    cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > >
> >
> > Yes, I did hit that wall too :-)
> >
> > Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.
> >
> > Note that there are problems in newlib too, but users of -mpure-code seem
> > to be able to work around that (eg. using their own startup code and no stdlib)
>
> Is there a current side project to solve the makefile problems?
None that I know of.


> I think I'm back to my original question: If libgcc can't be built
> with -mpure-code, and users bypass it completely with -nostdlib, then
> why this conversation about pure-code compatibility of __clzsi2() etc?
I think Richard noticed this pre-existing problem as part of the review
of your patches. I don't think I meant fixing this is a prerequisite.
But maybe I misunderstood :-)

> > > 2.  Attempting to filter the multib list results in configuration error.
> > >     This might have been misguided, but it was something I tried:
> > >
> > >     Error: --with-multilib-list=armv6s-m not supported.
> > >
> > >     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported
> >
> > I think only 2 values are supported: aprofile and rmprofile.
>
> It looks like this might require a custom t-* multilib in gcc/config/arm.
Or we could add -mpure-code to the rmprofile list.

> > > 3.  Attempting to configure a single architecture results in a build error.
> > >
> > >     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
> > >
> > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > >     configure: error: cannot compute suffix of object files: cannot compile
> > >     See `config.log' for more details
> > >
> > >     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> > >         9 | #include <ac_nonexistent.h>
> > >           |          ^~~~~~~~~~~~~~~~~~
> > I never saw that error message, but I never build using --with-arch.
> > I do use --with-cpu though.
> >
> > > This has me wondering whether pure-code in libgcc is a real issue ...
> > > If there's a way to build libgcc with -mpure-code, please enlighten me.
> > I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
> > works?
>
> No luck with that.  Same error message as before:
>
> 4.  --with-mode=thumb --with-arch=armv6s-m --with-float=soft --with-cpu=cortex-m0
>
>     Switch "--with-arch" may not be used with switch "--with-cpu"
>
> 5.  Then: --with-mode=thumb --with-float=soft --with-cpu=cortex-m0
>
>     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
>     configure: error: cannot compute suffix of object files: cannot compile
>     See `config.log' for more details
>
>     cc1: error: -mpure-code only supports non-pic code on M-profile targets
Yes that's because default multilibs include targets incompatible with
-mpure-code

> 6.  Finally! --with-float=soft --with-cpu=cortex-m0 --disable-multilib
>
> Once you know this, and read the docs sideways, the previous errors are
> all probably "works as designed".  But, I can still grumble.
Yes, I think it's "as designed". I faced the "incompatible multilibs" issue
too some time ago. Hence testing is not easy.

> With libgcc compiled with -mpure-code, I can confirm that
> 'builtin-bitops-1.c' (the test for __clzsi2) passed with libgcc as-is.
>
> I then added the SHF_ARM_PURECODE flag to the libgcc assembly functions
> and re-ran the test.  Still passed.  I then added -mpure-code to
> RUNTESTFLAGS and re-ran the test.  Still passed.  readelf confirmed that
> the test program is compiling as expected [1]:
>
>     [ 2] .text             PROGBITS        0000800c 00800c 003314 00 AXy  0   0  4
>     Key to Flags:
>     W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
>     L (link order), O (extra OS processing required), G (group), T (TLS),
>     C (compressed), x (unknown), o (OS specific), E (exclude),
>     y (purecode), p (processor specific)
>
> It was only when I started inserting pure-code test directives into
> 'builtin-bitops-1.c' that 'make check' began to report errors.
>
>     /* { dg-do compile } */
>     ...
>     /* { dg-options "-mpure-code -mfp16-format=ieee" } */
>     /* { dg-final { scan-assembler-not "\\.(float|l\\?double|\d?byte|short|int|long|quad|word)\\s+\[^.\]" } } */
>
> However, for reasons [2] [3] [4] [5], this wasn't actually useful.  It's
> sufficient to say that there are many reasons that non-pure-code
> compatible functions exist in libgcc.

I filed a PR to better track this discussion:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98779

>
> Although I'm not sure how useful this will be in light of the previous
> findings, I did take the opportunity with a working compile process to
> modify the relevant assembly functions for -mpure-code compatibility.
> I can manually disassemble the library and verify correct compilation.
> I can manually run a non-pure-code builtin-bitops-1 with a pure-code
> library to verify correct execution.  But, I don't think your standard
> regression suite will be able to exercise the new paths.
Thanks for doing that. With the SHF_ARM_PURECODE flag you
added to clz, I think my simulator would catch problems.

> The patch is below; you can consider this as 34/33 in the series.
>
> Regards,
> Daniel
>
> [1] It's pretty clear that the section flags in libgcc have never really
>     mattered.  When the linker strings all of the used objects together,
>     the original sections disappear into a single output object. The
>     compiler controls those flags regardless of what libgcc does.)
Not sure what you mean? The linker creates two segments, one
with and one without the SHF_ARM_PURECODE flag.

> [2] The existing pure-code tests are compile-only and cover just the
>     disassembled 'main.o'.  There is no test of a complete executable
>     and there is no execution/simulation.
That's something I did manually: run the full gcc testsuite, forcing -mpure-code
in RUNTESTFLAGS. This way, all execution tests are compiled with -mpure-code,
and this is how I found several bugs.

> [3] While other parts of binutils may understand SHF_ARM_PURECODE, I
>     don't think the simulator checks section flags or throws exceptions.
Indeed, I know of no public simulator that honors this flag. I do have
one though.

> [4] builtin-bitops-1 modified this way will always fail due to the array
>     data definitions (longs, longlongs, etc).  GCC can't translate those
>     to instructions.  While the ".data" section would presumably be
>     readable, scan-assembler-not doesn't know the difference.
Sure, adding such scan-assembler-not is not suitable for any existing testcase.
That's why it is only in place for testcases dedicated to -mpure-code.

> [5] Even if the simulator were modified to throw exceptions, this will
>     continue to fail because _mainCRTStartup uses a literal pool.
Yes, in general, and that's why I mentioned problems with newlib
earlier in this thread.

However, the simulator I use only throws an exception for code executed with
SHF_ARM_PURECODE. Code in the "regular" code segment is not checked.
So this does not catch errors in hand-written assembly code using regular .text,
but it enables to run larger validations (such as the whole GCC testsuite)
without having to fix all of newlib.
Not perfect, as it left the issues in libgcc we are discussing, but it
helped me fix
several bugs in -mpure-code.

Thanks,

Christophe

>
> > Thanks,
> >
> > Christophe
> >
> > > > > > > > The 'clzs' and 'ctz' functions should never have problems.   -mpure-code
> > > > > > > > appears to be valid only when the 'movt' instruction is available, which
> > > > > > > > means that the 'clz' instruction will also be available, so no array loads.
> > > > > > > No, -mpure-code is also supported with v6m.
> > > > > > >
> > > > > > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > > > > > No.
> > > > > > >
> > > > > > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > > > > > should be the default just because a processor supports Thumb-2.
> > > > > > > >
> > > > > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > > > > > constant data in compiled C functions?  I don't think this is the
> > > > > > > > default for the typical toolchain scripts.
> > > > > > > No, users of -mpure-code do not recompile the toolchain.
> > > > > > >
> > > > > > > --snip --
> > > > >
> > > > > >
> > > >
> > >
> > > Thanks,
> > > Daniel
>
>     Add -mpure-code support to the CM0 functions.
>
>     gcc/libgcc/ChangeLog:
>     2021-01-16 Daniel Engel <gnu@danielengel.com>
>
>             Makefile.in (MPURE_CODE): New macro defines __PURE_CODE__.
>             (gcc_compile): Appended MPURE_CODE.
>             lib1funcs.S (FUNC_START_SECTION): Set flags for __PURE_CODE__.
>             clz2.S (__clzsi2): Added -mpure-code compatible instructions.
>             ctz2.S (__ctzsi2): Same.
>             popcnt.S (__popcountsi2, __popcountdi2): Same.
>
> diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
> index 2de57519734..cd6b5f9c1b0 100644
> --- a/libgcc/Makefile.in
> +++ b/libgcc/Makefile.in
> @@ -303,6 +303,9 @@ CRTSTUFF_CFLAGS = -O2 $(GCC_CFLAGS) $(INCLUDES) $(MULTILIB_CFLAGS) -g0 \
>  # Extra flags to use when compiling crt{begin,end}.o.
>  CRTSTUFF_T_CFLAGS =
>
> +# Pass the -mpure-code flag into assembly for conditional compilation.
> +MPURE_CODE = $(if $(findstring -mpure-code,$(CFLAGS)), -D__PURE_CODE__)
> +
>  MULTIDIR := $(shell $(CC) $(CFLAGS) -print-multi-directory)
>  MULTIOSDIR := $(shell $(CC) $(CFLAGS) -print-multi-os-directory)
>
> @@ -312,7 +315,7 @@ inst_slibdir = $(slibdir)$(MULTIOSSUBDIR)
>
>  gcc_compile_bare = $(CC) $(INTERNAL_CFLAGS)
>  compile_deps = -MT $@ -MD -MP -MF $(basename $@).dep
> -gcc_compile = $(gcc_compile_bare) -o $@ $(compile_deps)
> +gcc_compile = $(gcc_compile_bare) -o $@ $(compile_deps) $(MPURE_CODE)
>  gcc_s_compile = $(gcc_compile) -DSHARED
>
>  objects = $(filter %$(objext),$^)
> diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
> index a2de45ff651..97a44f5d187 100644
> --- a/libgcc/config/arm/clz2.S
> +++ b/libgcc/config/arm/clz2.S
> @@ -214,17 +214,40 @@ FUNC_ENTRY clzsi2
>       IT(sub,ne) r2,     #4
>
>      LLSYM(__clz2):
> +  #if defined(__PURE_CODE__) && __PURE_CODE__
> +        // Without access to table data, continue unrolling the loop.
> +        lsrs    r1,     r0,     #2
> +
> +      #ifdef __HAVE_FEATURE_IT
> +        do_it   ne,t
> +      #else
> +        beq     LLSYM(__clz1)
> +      #endif
> +
> +        // Out of 4 bits, the first '1' is somewhere in the highest 2,
> +        //  so the lower 2 bits are no longer interesting.
> +     IT(mov,ne) r0,     r1
> +     IT(sub,ne) r2,     #2
> +
> +    LLSYM(__clz1):
> +        // Convert remainder {0,1,2,3} to {0,1,2,2}.
> +        lsrs    r1,     r0,     #1
> +        bics    r0,     r1
> +
> +  #else /* !__PURE_CODE__ */
>          // Load the remainder by index
>          adr     r1,     LLSYM(__clz_remainder)
>          ldrb    r0,     [r1, r0]
>
> +  #endif /* !__PURE_CODE__ */
>    #endif /* !__OPTIMIZE_SIZE__ */
>
>          // Account for the remainder.
>          subs    r0,     r2,     r0
>          RET
>
> -  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
> +  #if !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__) && \
> +      !(defined(__PURE_CODE__) && __PURE_CODE__)
>          .align 2
>      LLSYM(__clz_remainder):
>          .byte 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4
> diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
> index b9528a061a2..6a49d64f3a6 100644
> --- a/libgcc/config/arm/ctz2.S
> +++ b/libgcc/config/arm/ctz2.S
> @@ -209,11 +209,44 @@ FUNC_ENTRY ctzsi2
>       IT(sub,ne) r2,     #4
>
>      LLSYM(__ctz2):
> +  #if defined(__PURE_CODE__) && __PURE_CODE__
> +        // Without access to table data, continue unrolling the loop.
> +        lsls    r1,     r0,     #2
> +
> +      #ifdef __HAVE_FEATURE_IT
> +        do_it   ne, t
> +      #else
> +        beq     LLSYM(__ctz1)
> +      #endif
> +
> +        // Out of 4 bits, the first '1' is somewhere in the lowest 2,
> +        //  so the higher 2 bits are no longer interesting.
> +     IT(mov,ne) r0,     r1
> +     IT(sub,ne) r2,     #2
> +
> +    LLSYM(__ctz1):
> +        // Convert remainder {0,1,2,3} in $r0[31:30] to {0,2,1,2}.
> +        lsrs    r0,     #31
> +
> +      #ifdef __HAVE_FEATURE_IT
> +        do_it   cs, t
> +      #else
> +        bcc     LLSYM(__ctz_zero)
> +      #endif
> +
> +        // If bit[30] of the remainder is set, neither of these bits count
> +        //  towards the result.  Bit[31] must be cleared.
> +        // Otherwise, bit[31] becomes the final remainder.
> +     IT(sub,cs) r2,     #2
> +     IT(eor,cs) r0,     r0
> +
> +  #else /* !__PURE_CODE__ */
>          // Look up the remainder by index.
>          lsrs    r0,     #28
>          adr     r3,     LLSYM(__ctz_remainder)
>          ldrb    r0,     [r3, r0]
>
> +  #endif /* !__PURE_CODE__ */
>    #endif /* !__OPTIMIZE_SIZE__ */
>
>      LLSYM(__ctz_zero):
> @@ -221,8 +254,9 @@ FUNC_ENTRY ctzsi2
>          subs    r0,     r2,     r0
>          RET
>
> -  #if (!defined(__ARM_FEATURE_CLZ) || !__ARM_FEATURE_CLZ) && \
> -      (!defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__)
> +  #if !(defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ) && \
> +      !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__) && \
> +      !(defined(__PURE_CODE__) && __PURE_CODE__)
>          .align 2
>      LLSYM(__ctz_remainder):
>          .byte 0,4,3,4,2,4,3,4,1,4,3,4,2,4,3,4
> diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
> index 5148957144b..59b2370e160 100644
> --- a/libgcc/config/arm/lib1funcs.S
> +++ b/libgcc/config/arm/lib1funcs.S
> @@ -454,7 +454,12 @@ SYM (\name):
>     Use the *_START_SECTION macros for declarations that the linker should
>      place in a non-defailt section (e.g. ".rodata", ".text.subsection"). */
>  .macro FUNC_START_SECTION name section
> -       .section \section,"x"
> +#ifdef __PURE_CODE__
> +       /* SHF_ARM_PURECODE | SHF_ALLOC | SHF_EXECINSTR */
> +       .section \section,"0x20000006",%progbits
> +#else
> +       .section \section,"ax",%progbits
> +#endif
>         .align 0
>         FUNC_ENTRY \name
>  .endm
> diff --git a/libgcc/config/arm/popcnt.S b/libgcc/config/arm/popcnt.S
> index 51b1ed745ee..d6f65403b5d 100644
> --- a/libgcc/config/arm/popcnt.S
> +++ b/libgcc/config/arm/popcnt.S
> @@ -23,6 +23,29 @@
>     <http://www.gnu.org/licenses/>.  */
>
>
> +#if defined(L_popcountdi2) || defined(L_popcountsi2)
> +
> +.macro ldmask reg, temp, value
> +    #if defined(__PURE_CODE__) && (__PURE_CODE__)
> +      #ifdef NOT_ISA_TARGET_32BIT
> +        movs    \reg,   \value
> +        lsls    \temp,  \reg,   #8
> +        orrs    \reg,   \temp
> +        lsls    \temp,  \reg,   #16
> +        orrs    \reg,   \temp
> +      #else
> +        // Assumption: __PURE_CODE__ only support M-profile.
> +        movw    \reg    ((\value) * 0x101)
> +        movt    \reg    ((\value) * 0x101)
> +      #endif
> +    #else
> +        ldr     \reg,   =((\value) * 0x1010101)
> +    #endif
> +.endm
> +
> +#endif
> +
> +
>  #ifdef L_popcountdi2
>
>  // int __popcountdi2(int)
> @@ -49,7 +72,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
>
>    #else /* !__OPTIMIZE_SIZE__ */
>          // Load the one-bit alternating mask.
> -        ldr     r3,     =0x55555555
> +        ldmask  r3,     r2,     0x55
>
>          // Reduce the second word.
>          lsrs    r2,     r1,     #1
> @@ -62,7 +85,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
>          subs    r0,     r2
>
>          // Load the two-bit alternating mask.
> -        ldr     r3,     =0x33333333
> +        ldmask  r3,     r2,     0x33
>
>          // Reduce the second word.
>          lsrs    r2,     r1,     #2
> @@ -140,7 +163,7 @@ FUNC_ENTRY popcountsi2
>    #else /* !__OPTIMIZE_SIZE__ */
>
>          // Load the one-bit alternating mask.
> -        ldr     r3,     =0x55555555
> +        ldmask  r3,     r2,     0x55
>
>          // Reduce the word.
>          lsrs    r1,     r0,     #1
> @@ -148,7 +171,7 @@ FUNC_ENTRY popcountsi2
>          subs    r0,     r1
>
>          // Load the two-bit alternating mask.
> -        ldr     r3,     =0x33333333
> +        ldmask  r3,     r2,     0x33
>
>          // Reduce the word.
>          lsrs    r1,     r0,     #2
> @@ -158,7 +181,7 @@ FUNC_ENTRY popcountsi2
>          adds    r0,     r1
>
>          // Load the four-bit alternating mask.
> -        ldr     r3,     =0x0F0F0F0F
> +        ldmask  r3,     r2,     0x0F
>
>          // Reduce the word.
>          lsrs    r1,     r0,     #4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-21 10:29                               ` Christophe Lyon
@ 2021-01-21 20:35                                 ` Daniel Engel
  2021-01-22 18:28                                   ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Engel @ 2021-01-21 20:35 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

Hi Christophe,

On Thu, Jan 21, 2021, at 2:29 AM, Christophe Lyon wrote:
> On Sat, 16 Jan 2021 at 17:13, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > Hi Christophe,
> >
> > On Fri, Jan 15, 2021, at 4:30 AM, Christophe Lyon wrote:
> > > On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
> > > >
> > > > Hi Christophe,
> > > >
> > > > On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > > > > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > > > > --snip--
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > > > > --snip--
> > > > > > > > > > >
> > > > > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > > > > >> -mpure-code.
> > > > > > > > > > >
> > > > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > > > > >
> > > > > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > > > > >
> > > > > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > > > > >
> > > > > > >
> > > > > > > So, that's because the code goes to the .text section (as opposed to
> > > > > > > .text.noread)
> > > > > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > > > > when generating code with -mpure-code.
> > > > > > > And the simulator does not complain because it only checks loads from
> > > > > > > the segment with the PURECODE flag set.
> > > > > > >
> > > > > > This is far out of my depth, but can something like:
> > > > > >
> > > > > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > > > > >
> > > > > > be adapted to:
> > > > > >
> > > > > > a) detect the state of the -mpure-code switch, and
> > > > > > b) pass that flag to the preprocessor?
> > > > > >
> > > > > > If so, I can probably fix both the target section and the data usage.
> > > > > > Just have to add a few instructions to finish unrolling the loop.
> > > > >
> > > > > I must confess I never checked libgcc's Makefile deeply before,
> > > > > but it looks like you can probably detect whether -mpure-code is
> > > > > part of $CFLAGS.
> > > > >
> > > > > However, it might be better to write pure-code-safe code
> > > > > unconditionally because the toolchain will probably not
> > > > > be rebuilt with -mpure-code as discussed before.
> > > > > Or that could mean adding a -mpure-code multilib....
> > > >
> > > > I have learned a few things since the last update.  I think I know how
> > > > to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> > > > something of a wall with testing.  I can't seem to compile any flavor of
> > > > libgcc with CFLAGS_FOR_TARGET="-mpure-code".
> > > >
> > > > 1.  Configuring --with-multilib-list=rmprofile results in build failure:
> > > >
> > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > >     See `config.log' for more details
> > > >
> > > >    cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > > >
> > >
> > > Yes, I did hit that wall too :-)
> > >
> > > Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.
> > >
> > > Note that there are problems in newlib too, but users of -mpure-code seem
> > > to be able to work around that (eg. using their own startup code and no stdlib)
> >
> > Is there a current side project to solve the makefile problems?
> None that I know of.
> 
> 
> > I think I'm back to my original question: If libgcc can't be built
> > with -mpure-code, and users bypass it completely with -nostdlib, then
> > why this conversation about pure-code compatibility of __clzsi2() etc?
> I think Richard noticed this pre-existing problem as part of the review
> of your patches. I don't think I meant fixing this is a prerequisite.
> But maybe I misunderstood :-)

I might have misunderstood too then.  It was certainly a pre-existing
problem, but I took the comments to mean that I had to own it as part of
touching those functions.
 
> > > > 2.  Attempting to filter the multib list results in configuration error.
> > > >     This might have been misguided, but it was something I tried:
> > > >
> > > >     Error: --with-multilib-list=armv6s-m not supported.
> > > >
> > > >     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported
> > >
> > > I think only 2 values are supported: aprofile and rmprofile.
> >
> > It looks like this might require a custom t-* multilib in gcc/config/arm.
> Or we could add -mpure-code to the rmprofile list.

I have no strong opinions here.  Are you proposing that the "m" versions
of libgcc be built with -mpure-code enabled by default, or are you
proposing a parallel set of multilibs?  -mpure-code by default would
have costs in both size and speed.

> > > > 3.  Attempting to configure a single architecture results in a build error.
> > > >
> > > >     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
> > > >
> > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > >     See `config.log' for more details
> > > >
> > > >     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> > > >         9 | #include <ac_nonexistent.h>
> > > >           |          ^~~~~~~~~~~~~~~~~~
> > > I never saw that error message, but I never build using --with-arch.
> > > I do use --with-cpu though.
> > >
> > > > This has me wondering whether pure-code in libgcc is a real issue ...
> > > > If there's a way to build libgcc with -mpure-code, please enlighten me.
> > > I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
> > > works?
> >
> > No luck with that.  Same error message as before:
> >
> > 4.  --with-mode=thumb --with-arch=armv6s-m --with-float=soft --with-cpu=cortex-m0
> >
> >     Switch "--with-arch" may not be used with switch "--with-cpu"
> >
> > 5.  Then: --with-mode=thumb --with-float=soft --with-cpu=cortex-m0
> >
> >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> >     configure: error: cannot compute suffix of object files: cannot compile
> >     See `config.log' for more details
> >
> >     cc1: error: -mpure-code only supports non-pic code on M-profile targets
> Yes that's because default multilibs include targets incompatible with
> -mpure-code
> 
> > 6.  Finally! --with-float=soft --with-cpu=cortex-m0 --disable-multilib
> >
> > Once you know this, and read the docs sideways, the previous errors are
> > all probably "works as designed".  But, I can still grumble.
> Yes, I think it's "as designed". I faced the "incompatible multilibs" issue
> too some time ago. Hence testing is not easy.
> 
> > With libgcc compiled with -mpure-code, I can confirm that
> > 'builtin-bitops-1.c' (the test for __clzsi2) passed with libgcc as-is.
> >
> > I then added the SHF_ARM_PURECODE flag to the libgcc assembly functions
> > and re-ran the test.  Still passed.  I then added -mpure-code to
> > RUNTESTFLAGS and re-ran the test.  Still passed.  readelf confirmed that
> > the test program is compiling as expected [1]:
> >
> >     [ 2] .text             PROGBITS        0000800c 00800c 003314 00 AXy  0   0  4
> >     Key to Flags:
> >     W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
> >     L (link order), O (extra OS processing required), G (group), T (TLS),
> >     C (compressed), x (unknown), o (OS specific), E (exclude),
> >     y (purecode), p (processor specific)
> >
> > It was only when I started inserting pure-code test directives into
> > 'builtin-bitops-1.c' that 'make check' began to report errors.
> >
> >     /* { dg-do compile } */
> >     ...
> >     /* { dg-options "-mpure-code -mfp16-format=ieee" } */
> >     /* { dg-final { scan-assembler-not "\\.(float|l\\?double|\d?byte|short|int|long|quad|word)\\s+\[^.\]" } } */
> >
> > However, for reasons [2] [3] [4] [5], this wasn't actually useful.  It's
> > sufficient to say that there are many reasons that non-pure-code
> > compatible functions exist in libgcc.
> 
> I filed a PR to better track this discussion:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98779

Possibly worth noting that my patch series addresses __clzsi2, __ctzsi2,
__aeabi_ldivmod, __aeabi_uldivmod, and most of __gnu_float2h_internal as
long as CFLAGS_FOR_TARGET contains -mpure-code.

> 
> >
> > Although I'm not sure how useful this will be in light of the previous
> > findings, I did take the opportunity with a working compile process to
> > modify the relevant assembly functions for -mpure-code compatibility.
> > I can manually disassemble the library and verify correct compilation.
> > I can manually run a non-pure-code builtin-bitops-1 with a pure-code
> > library to verify correct execution.  But, I don't think your standard
> > regression suite will be able to exercise the new paths.
> Thanks for doing that. With the SHF_ARM_PURECODE flag you
> added to clz, I think my simulator would catch problems.

See my more detailed comments following.  Unless you're also using a
custom linker script, I would have expected any simulator capable of
catching errors to have already caught them.  Putting the
SHF_ARM_PURECODE flag on clz actually seems rather cosmetic.

> > The patch is below; you can consider this as 34/33 in the series.
> >
> > Regards,
> > Daniel
> >
> > [1] It's pretty clear that the section flags in libgcc have never really
> >     mattered.  When the linker strings all of the used objects together,
> >     the original sections disappear into a single output object. The
> >     compiler controls those flags regardless of what libgcc does.)
> Not sure what you mean? The linker creates two segments, one
> with and one without the SHF_ARM_PURECODE flag.

When libgcc is compiled "normally", individual objects in libgcc.a are
compiled _without_ SHF_ARM_PURECODE.  Note line 4 below, with flags
"AX" only (no "y").  

`readelf -S arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a`

    File: arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a(_clzsi2.o)
    There are 19 section headers, starting at offset 0x43c:

    Section Headers:
      [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
      [ 0]                   NULL            00000000 000000 000000 00      0   0  0
      [ 1] .text             PROGBITS        00000000 000034 000000 00  AX  0   0  2
      [ 2] .data             PROGBITS        00000000 000034 000000 00  WA  0   0  1
      [ 3] .bss              NOBITS          00000000 000034 000000 00  WA  0   0  1
  ==> [ 4] .text.sorted[...] PROGBITS        00000000 000034 000034 00  AX  0   0  4
      ...
    Key to Flags:
      W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
      L (link order), O (extra OS processing required), G (group), T (TLS),
      C (compressed), x (unknown), o (OS specific), E (exclude),
      y (purecode), p (processor specific)

When this "normal" libgcc is linked into an -mpure-code program
(e.g. with RUNTESTFLAGS), the linker script flattens all of the
sections together into a single output.  The relevant portion of the
linker script:

      .text           :
      {
        *(.text.unlikely .text.*_unlikely .text.unlikely.*)
        *(.text.exit .text.exit.*)
        *(.text.startup .text.startup.*)
        *(.text.hot .text.hot.*)
        *(SORT(.text.sorted.*)) // _clzsi2.o matches here
        *(.text .stub .text.* .gnu.linkonce.t.*) // main.o matches here
        /* .gnu.warning sections are handled specially by elf.em.  */
        *(.gnu.warning)
        *(.glue_7t) *(.glue_7) *(.vfp11_veneer) *(.v4_bx)
      }

I can't pretend to know how the linker merges conflicting flags from the
various input sections, but the final binary has the attributes
"AXy" as expected from the top level compile (line 2):

`readelf -Sl builtin-bitops-1.exe`

    There are 22 section headers, starting at offset 0x10934:

    Section Headers:
      [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
      [ 0]                   NULL            00000000 000000 000000 00      0   0  0
      [ 1] .init             PROGBITS        00008000 008000 00000c 00  AX  0   0  4
  ==> [ 2] .text             PROGBITS        0000800c 00800c 00455c 00 AXy  0   0  4
      [ 3] .fini             PROGBITS        0000c568 00c568 00000c 00  AX  0   0  4
      [ 4] .rodata           PROGBITS        0000c574 00c574 000050 00   A  0   0  4
      [ 5] .ARM.exidx        ARM_EXIDX       0000c5c4 00c5c4 000008 00  AL  2   0  4
      [ 6] .eh_frame         PROGBITS        0000c5cc 00c5cc 000124 00   A  0   0  4
      [ 7] .init_array       INIT_ARRAY      0001c6f0 00c6f0 000004 04  WA  0   0  4
      [ 8] .fini_array       FINI_ARRAY      0001c6f4 00c6f4 000004 04  WA  0   0  4
      [ 9] .data             PROGBITS        0001c6f8 00c6f8 000a30 00  WA  0   0  8
      [10] .bss              NOBITS          0001d128 00d128 000114 00  WA  0   0  4
      ...
    Key to Flags:
      W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
      L (link order), O (extra OS processing required), G (group), T (TLS),
      C (compressed), x (unknown), o (OS specific), E (exclude),
      y (purecode), p (processor specific)

    There are 3 program headers, starting at offset 52

    Program Headers:
      Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
      EXIDX          0x00c5c4 0x0000c5c4 0x0000c5c4 0x00008 0x00008 R   0x4
      LOAD           0x000000 0x00000000 0x00000000 0x0c6f0 0x0c6f0 R E 0x10000
      LOAD           0x00c6f0 0x0001c6f0 0x0001c6f0 0x00a38 0x00b4c RW  0x10000

     Section to Segment mapping:
      Segment Sections...
       00     .ARM.exidx
  ==>   01     .init .text .fini .rodata .ARM.exidx .eh_frame
       02     .init_array .fini_array .data .bss

A segment contains complete sections, not the other way around.  As you
can see above, the first LOAD segment contains the entire ".text" plus
some other sections.   Thus, SHF_ARM_PURECODE flag really appears
to apply to all of "text", even though the bits linked in from libgcc
weren't built or flagged this way.

> 
> > [2] The existing pure-code tests are compile-only and cover just the
> >     disassembled 'main.o'.  There is no test of a complete executable
> >     and there is no execution/simulation.
> That's something I did manually: run the full gcc testsuite, forcing -mpure-code
> in RUNTESTFLAGS. This way, all execution tests are compiled with -mpure-code,
> and this is how I found several bugs.
> 
> > [3] While other parts of binutils may understand SHF_ARM_PURECODE, I
> >     don't think the simulator checks section flags or throws exceptions.
> Indeed, I know of no public simulator that honors this flag. I do have
> one though.
> 
> > [4] builtin-bitops-1 modified this way will always fail due to the array
> >     data definitions (longs, longlongs, etc).  GCC can't translate those
> >     to instructions.  While the ".data" section would presumably be
> >     readable, scan-assembler-not doesn't know the difference.
> Sure, adding such scan-assembler-not is not suitable for any existing testcase.
> That's why it is only in place for testcases dedicated to -mpure-code.
> 
> > [5] Even if the simulator were modified to throw exceptions, this will
> >     continue to fail because _mainCRTStartup uses a literal pool.
> Yes, in general, and that's why I mentioned problems with newlib
> earlier in this thread.
> 
> However, the simulator I use only throws an exception for code executed with
> SHF_ARM_PURECODE. Code in the "regular" code segment is not checked.
> So this does not catch errors in hand-written assembly code using regular .text,
> but it enables to run larger validations (such as the whole GCC testsuite)
> without having to fix all of newlib.
> Not perfect, as it left the issues in libgcc we are discussing, but it
> helped me fix
> several bugs in -mpure-code.

Yet again I suspect that you have a custom linker script, or there's
some other major difference.  Using the public releases of binutils,
newlib, etc, my experiences just aren't lining up with yours.

> Thanks,
> 
> Christophe
> 
> >
> > > Thanks,
> > >
> > > Christophe
> > >
> > > --snip-- 

If the test server farm is free at some point, would you mind running
another set of regression tests on my v5 patch series?

Regards,
Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-21 20:35                                 ` Daniel Engel
@ 2021-01-22 18:28                                   ` Christophe Lyon
  2021-01-25 17:48                                     ` Christophe Lyon
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-22 18:28 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Thu, 21 Jan 2021 at 21:35, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi Christophe,
>
> On Thu, Jan 21, 2021, at 2:29 AM, Christophe Lyon wrote:
> > On Sat, 16 Jan 2021 at 17:13, Daniel Engel <libgcc@danielengel.com> wrote:
> > >
> > > Hi Christophe,
> > >
> > > On Fri, Jan 15, 2021, at 4:30 AM, Christophe Lyon wrote:
> > > > On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > >
> > > > > Hi Christophe,
> > > > >
> > > > > On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > > > > > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > > > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > > > > > --snip--
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > > > > > --snip--
> > > > > > > > > > > >
> > > > > > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > > > > > >> -mpure-code.
> > > > > > > > > > > >
> > > > > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > > > > > >
> > > > > > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > > > > > >
> > > > > > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > > > > > >
> > > > > > > >
> > > > > > > > So, that's because the code goes to the .text section (as opposed to
> > > > > > > > .text.noread)
> > > > > > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > > > > > when generating code with -mpure-code.
> > > > > > > > And the simulator does not complain because it only checks loads from
> > > > > > > > the segment with the PURECODE flag set.
> > > > > > > >
> > > > > > > This is far out of my depth, but can something like:
> > > > > > >
> > > > > > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > > > > > >
> > > > > > > be adapted to:
> > > > > > >
> > > > > > > a) detect the state of the -mpure-code switch, and
> > > > > > > b) pass that flag to the preprocessor?
> > > > > > >
> > > > > > > If so, I can probably fix both the target section and the data usage.
> > > > > > > Just have to add a few instructions to finish unrolling the loop.
> > > > > >
> > > > > > I must confess I never checked libgcc's Makefile deeply before,
> > > > > > but it looks like you can probably detect whether -mpure-code is
> > > > > > part of $CFLAGS.
> > > > > >
> > > > > > However, it might be better to write pure-code-safe code
> > > > > > unconditionally because the toolchain will probably not
> > > > > > be rebuilt with -mpure-code as discussed before.
> > > > > > Or that could mean adding a -mpure-code multilib....
> > > > >
> > > > > I have learned a few things since the last update.  I think I know how
> > > > > to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> > > > > something of a wall with testing.  I can't seem to compile any flavor of
> > > > > libgcc with CFLAGS_FOR_TARGET="-mpure-code".
> > > > >
> > > > > 1.  Configuring --with-multilib-list=rmprofile results in build failure:
> > > > >
> > > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> > > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > > >     See `config.log' for more details
> > > > >
> > > > >    cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > > > >
> > > >
> > > > Yes, I did hit that wall too :-)
> > > >
> > > > Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.
> > > >
> > > > Note that there are problems in newlib too, but users of -mpure-code seem
> > > > to be able to work around that (eg. using their own startup code and no stdlib)
> > >
> > > Is there a current side project to solve the makefile problems?
> > None that I know of.
> >
> >
> > > I think I'm back to my original question: If libgcc can't be built
> > > with -mpure-code, and users bypass it completely with -nostdlib, then
> > > why this conversation about pure-code compatibility of __clzsi2() etc?
> > I think Richard noticed this pre-existing problem as part of the review
> > of your patches. I don't think I meant fixing this is a prerequisite.
> > But maybe I misunderstood :-)
>
> I might have misunderstood too then.  It was certainly a pre-existing
> problem, but I took the comments to mean that I had to own it as part of
> touching those functions.
>
> > > > > 2.  Attempting to filter the multib list results in configuration error.
> > > > >     This might have been misguided, but it was something I tried:
> > > > >
> > > > >     Error: --with-multilib-list=armv6s-m not supported.
> > > > >
> > > > >     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported
> > > >
> > > > I think only 2 values are supported: aprofile and rmprofile.
> > >
> > > It looks like this might require a custom t-* multilib in gcc/config/arm.
> > Or we could add -mpure-code to the rmprofile list.
>
> I have no strong opinions here.  Are you proposing that the "m" versions
> of libgcc be built with -mpure-code enabled by default, or are you
> proposing a parallel set of multilibs?  -mpure-code by default would
> have costs in both size and speed.
I'm not sure how large the penalty is for thumb-2 cores?
Maybe it's acceptable to build thumb-2 with -mpure-code by default,
but probably not for cortex-m0.

> > > > > 3.  Attempting to configure a single architecture results in a build error.
> > > > >
> > > > >     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
> > > > >
> > > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > > >     See `config.log' for more details
> > > > >
> > > > >     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> > > > >         9 | #include <ac_nonexistent.h>
> > > > >           |          ^~~~~~~~~~~~~~~~~~
> > > > I never saw that error message, but I never build using --with-arch.
> > > > I do use --with-cpu though.
> > > >
> > > > > This has me wondering whether pure-code in libgcc is a real issue ...
> > > > > If there's a way to build libgcc with -mpure-code, please enlighten me.
> > > > I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
> > > > works?
> > >
> > > No luck with that.  Same error message as before:
> > >
> > > 4.  --with-mode=thumb --with-arch=armv6s-m --with-float=soft --with-cpu=cortex-m0
> > >
> > >     Switch "--with-arch" may not be used with switch "--with-cpu"
> > >
> > > 5.  Then: --with-mode=thumb --with-float=soft --with-cpu=cortex-m0
> > >
> > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > >     configure: error: cannot compute suffix of object files: cannot compile
> > >     See `config.log' for more details
> > >
> > >     cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > Yes that's because default multilibs include targets incompatible with
> > -mpure-code
> >
> > > 6.  Finally! --with-float=soft --with-cpu=cortex-m0 --disable-multilib
> > >
> > > Once you know this, and read the docs sideways, the previous errors are
> > > all probably "works as designed".  But, I can still grumble.
> > Yes, I think it's "as designed". I faced the "incompatible multilibs" issue
> > too some time ago. Hence testing is not easy.
> >
> > > With libgcc compiled with -mpure-code, I can confirm that
> > > 'builtin-bitops-1.c' (the test for __clzsi2) passed with libgcc as-is.
> > >
> > > I then added the SHF_ARM_PURECODE flag to the libgcc assembly functions
> > > and re-ran the test.  Still passed.  I then added -mpure-code to
> > > RUNTESTFLAGS and re-ran the test.  Still passed.  readelf confirmed that
> > > the test program is compiling as expected [1]:
> > >
> > >     [ 2] .text             PROGBITS        0000800c 00800c 003314 00 AXy  0   0  4
> > >     Key to Flags:
> > >     W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
> > >     L (link order), O (extra OS processing required), G (group), T (TLS),
> > >     C (compressed), x (unknown), o (OS specific), E (exclude),
> > >     y (purecode), p (processor specific)
> > >
> > > It was only when I started inserting pure-code test directives into
> > > 'builtin-bitops-1.c' that 'make check' began to report errors.
> > >
> > >     /* { dg-do compile } */
> > >     ...
> > >     /* { dg-options "-mpure-code -mfp16-format=ieee" } */
> > >     /* { dg-final { scan-assembler-not "\\.(float|l\\?double|\d?byte|short|int|long|quad|word)\\s+\[^.\]" } } */
> > >
> > > However, for reasons [2] [3] [4] [5], this wasn't actually useful.  It's
> > > sufficient to say that there are many reasons that non-pure-code
> > > compatible functions exist in libgcc.
> >
> > I filed a PR to better track this discussion:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98779
>
> Possibly worth noting that my patch series addresses __clzsi2, __ctzsi2,
> __aeabi_ldivmod, __aeabi_uldivmod, and most of __gnu_float2h_internal as
> long as CFLAGS_FOR_TARGET contains -mpure-code.
>
Great.

> >
> > >
> > > Although I'm not sure how useful this will be in light of the previous
> > > findings, I did take the opportunity with a working compile process to
> > > modify the relevant assembly functions for -mpure-code compatibility.
> > > I can manually disassemble the library and verify correct compilation.
> > > I can manually run a non-pure-code builtin-bitops-1 with a pure-code
> > > library to verify correct execution.  But, I don't think your standard
> > > regression suite will be able to exercise the new paths.
> > Thanks for doing that. With the SHF_ARM_PURECODE flag you
> > added to clz, I think my simulator would catch problems.
>
> See my more detailed comments following.  Unless you're also using a
> custom linker script, I would have expected any simulator capable of
> catching errors to have already caught them.  Putting the
> SHF_ARM_PURECODE flag on clz actually seems rather cosmetic.
>

Yes, I have a custom linker script with
  .text.noread      :
    {
        INPUT_SECTION_FLAGS (SHF_ARM_PURECODE) *(.text*)
    } > purecode_memory

> > > The patch is below; you can consider this as 34/33 in the series.
> > >
> > > Regards,
> > > Daniel
> > >
> > > [1] It's pretty clear that the section flags in libgcc have never really
> > >     mattered.  When the linker strings all of the used objects together,
> > >     the original sections disappear into a single output object. The
> > >     compiler controls those flags regardless of what libgcc does.)
> > Not sure what you mean? The linker creates two segments, one
> > with and one without the SHF_ARM_PURECODE flag.
>
> When libgcc is compiled "normally", individual objects in libgcc.a are
> compiled _without_ SHF_ARM_PURECODE.  Note line 4 below, with flags
> "AX" only (no "y").
>
> `readelf -S arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a`
>
>     File: arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a(_clzsi2.o)
>     There are 19 section headers, starting at offset 0x43c:
>
>     Section Headers:
>       [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
>       [ 0]                   NULL            00000000 000000 000000 00      0   0  0
>       [ 1] .text             PROGBITS        00000000 000034 000000 00  AX  0   0  2
>       [ 2] .data             PROGBITS        00000000 000034 000000 00  WA  0   0  1
>       [ 3] .bss              NOBITS          00000000 000034 000000 00  WA  0   0  1
>   ==> [ 4] .text.sorted[...] PROGBITS        00000000 000034 000034 00  AX  0   0  4
>       ...
>     Key to Flags:
>       W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
>       L (link order), O (extra OS processing required), G (group), T (TLS),
>       C (compressed), x (unknown), o (OS specific), E (exclude),
>       y (purecode), p (processor specific)
>
> When this "normal" libgcc is linked into an -mpure-code program
> (e.g. with RUNTESTFLAGS), the linker script flattens all of the
> sections together into a single output.  The relevant portion of the
> linker script:
>
>       .text           :
>       {
>         *(.text.unlikely .text.*_unlikely .text.unlikely.*)
>         *(.text.exit .text.exit.*)
>         *(.text.startup .text.startup.*)
>         *(.text.hot .text.hot.*)
>         *(SORT(.text.sorted.*)) // _clzsi2.o matches here
>         *(.text .stub .text.* .gnu.linkonce.t.*) // main.o matches here
>         /* .gnu.warning sections are handled specially by elf.em.  */
>         *(.gnu.warning)
>         *(.glue_7t) *(.glue_7) *(.vfp11_veneer) *(.v4_bx)
>       }
>
> I can't pretend to know how the linker merges conflicting flags from the
> various input sections, but the final binary has the attributes
> "AXy" as expected from the top level compile (line 2):
>
> `readelf -Sl builtin-bitops-1.exe`
>
>     There are 22 section headers, starting at offset 0x10934:
>
>     Section Headers:
>       [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
>       [ 0]                   NULL            00000000 000000 000000 00      0   0  0
>       [ 1] .init             PROGBITS        00008000 008000 00000c 00  AX  0   0  4
>   ==> [ 2] .text             PROGBITS        0000800c 00800c 00455c 00 AXy  0   0  4
>       [ 3] .fini             PROGBITS        0000c568 00c568 00000c 00  AX  0   0  4
>       [ 4] .rodata           PROGBITS        0000c574 00c574 000050 00   A  0   0  4
>       [ 5] .ARM.exidx        ARM_EXIDX       0000c5c4 00c5c4 000008 00  AL  2   0  4
>       [ 6] .eh_frame         PROGBITS        0000c5cc 00c5cc 000124 00   A  0   0  4
>       [ 7] .init_array       INIT_ARRAY      0001c6f0 00c6f0 000004 04  WA  0   0  4
>       [ 8] .fini_array       FINI_ARRAY      0001c6f4 00c6f4 000004 04  WA  0   0  4
>       [ 9] .data             PROGBITS        0001c6f8 00c6f8 000a30 00  WA  0   0  8
>       [10] .bss              NOBITS          0001d128 00d128 000114 00  WA  0   0  4
>       ...
>     Key to Flags:
>       W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
>       L (link order), O (extra OS processing required), G (group), T (TLS),
>       C (compressed), x (unknown), o (OS specific), E (exclude),
>       y (purecode), p (processor specific)
>
>     There are 3 program headers, starting at offset 52
>
>     Program Headers:
>       Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>       EXIDX          0x00c5c4 0x0000c5c4 0x0000c5c4 0x00008 0x00008 R   0x4
>       LOAD           0x000000 0x00000000 0x00000000 0x0c6f0 0x0c6f0 R E 0x10000
>       LOAD           0x00c6f0 0x0001c6f0 0x0001c6f0 0x00a38 0x00b4c RW  0x10000
>
>      Section to Segment mapping:
>       Segment Sections...
>        00     .ARM.exidx
>   ==>   01     .init .text .fini .rodata .ARM.exidx .eh_frame
>        02     .init_array .fini_array .data .bss
>
> A segment contains complete sections, not the other way around.  As you
> can see above, the first LOAD segment contains the entire ".text" plus
> some other sections.   Thus, SHF_ARM_PURECODE flag really appears
> to apply to all of "text", even though the bits linked in from libgcc
> weren't built or flagged this way.
>
> >
> > > [2] The existing pure-code tests are compile-only and cover just the
> > >     disassembled 'main.o'.  There is no test of a complete executable
> > >     and there is no execution/simulation.
> > That's something I did manually: run the full gcc testsuite, forcing -mpure-code
> > in RUNTESTFLAGS. This way, all execution tests are compiled with -mpure-code,
> > and this is how I found several bugs.
> >
> > > [3] While other parts of binutils may understand SHF_ARM_PURECODE, I
> > >     don't think the simulator checks section flags or throws exceptions.
> > Indeed, I know of no public simulator that honors this flag. I do have
> > one though.
> >
> > > [4] builtin-bitops-1 modified this way will always fail due to the array
> > >     data definitions (longs, longlongs, etc).  GCC can't translate those
> > >     to instructions.  While the ".data" section would presumably be
> > >     readable, scan-assembler-not doesn't know the difference.
> > Sure, adding such scan-assembler-not is not suitable for any existing testcase.
> > That's why it is only in place for testcases dedicated to -mpure-code.
> >
> > > [5] Even if the simulator were modified to throw exceptions, this will
> > >     continue to fail because _mainCRTStartup uses a literal pool.
> > Yes, in general, and that's why I mentioned problems with newlib
> > earlier in this thread.
> >
> > However, the simulator I use only throws an exception for code executed with
> > SHF_ARM_PURECODE. Code in the "regular" code segment is not checked.
> > So this does not catch errors in hand-written assembly code using regular .text,
> > but it enables to run larger validations (such as the whole GCC testsuite)
> > without having to fix all of newlib.
> > Not perfect, as it left the issues in libgcc we are discussing, but it
> > helped me fix
> > several bugs in -mpure-code.
>
> Yet again I suspect that you have a custom linker script, or there's
> some other major difference.  Using the public releases of binutils,
> newlib, etc, my experiences just aren't lining up with yours.

Yep, the linker script makes the difference.

>
> > Thanks,
> >
> > Christophe
> >
> > >
> > > > Thanks,
> > > >
> > > > Christophe
> > > >
> > > > --snip--
>
> If the test server farm is free at some point, would you mind running
> another set of regression tests on my v5 patch series?

Sure. Given the number of sub-patches, can you send it to me as a
single patch file
(git format) that I can directly apply to GCC trunk?
My mailer does not want to help with saving each patch as a proper
patch file :-(

Thanks

Christophe

>
> Regards,
> Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-22 18:28                                   ` Christophe Lyon
@ 2021-01-25 17:48                                     ` Christophe Lyon
  2021-01-25 23:36                                       ` Daniel Engel
  0 siblings, 1 reply; 26+ messages in thread
From: Christophe Lyon @ 2021-01-25 17:48 UTC (permalink / raw)
  To: Daniel Engel; +Cc: Richard Earnshaw, gcc Patches

On Fri, 22 Jan 2021 at 19:28, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
>
> On Thu, 21 Jan 2021 at 21:35, Daniel Engel <libgcc@danielengel.com> wrote:
> >
> > Hi Christophe,
> >
> > On Thu, Jan 21, 2021, at 2:29 AM, Christophe Lyon wrote:
> > > On Sat, 16 Jan 2021 at 17:13, Daniel Engel <libgcc@danielengel.com> wrote:
> > > >
> > > > Hi Christophe,
> > > >
> > > > On Fri, Jan 15, 2021, at 4:30 AM, Christophe Lyon wrote:
> > > > > On Fri, 15 Jan 2021 at 12:39, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > >
> > > > > > Hi Christophe,
> > > > > >
> > > > > > On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > > > > > > On Mon, 11 Jan 2021 at 17:18, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > > > > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <libgcc@danielengel.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > > > > > > --snip--
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > > > > > > --snip--
> > > > > > > > > > > > >
> > > > > > > > > > > > >> - finally, your popcount implementations have data in the code segment.
> > > > > > > > > > > > >>  That's going to cause problems when we have compilation options such as
> > > > > > > > > > > > >> -mpure-code.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si).
> > > > > > > > > > > > > If this matters, you'll need to point in the right direction for the
> > > > > > > > > > > > > fix.  I'm not sure it does matter, since these functions are PIC anyway.
> > > > > > > > > > > >
> > > > > > > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts?
> > > > > > > > > > >
> > > > > > > > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > > > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > So, that's because the code goes to the .text section (as opposed to
> > > > > > > > > .text.noread)
> > > > > > > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > > > > > > when generating code with -mpure-code.
> > > > > > > > > And the simulator does not complain because it only checks loads from
> > > > > > > > > the segment with the PURECODE flag set.
> > > > > > > > >
> > > > > > > > This is far out of my depth, but can something like:
> > > > > > > >
> > > > > > > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - </dev/null)))
> > > > > > > >
> > > > > > > > be adapted to:
> > > > > > > >
> > > > > > > > a) detect the state of the -mpure-code switch, and
> > > > > > > > b) pass that flag to the preprocessor?
> > > > > > > >
> > > > > > > > If so, I can probably fix both the target section and the data usage.
> > > > > > > > Just have to add a few instructions to finish unrolling the loop.
> > > > > > >
> > > > > > > I must confess I never checked libgcc's Makefile deeply before,
> > > > > > > but it looks like you can probably detect whether -mpure-code is
> > > > > > > part of $CFLAGS.
> > > > > > >
> > > > > > > However, it might be better to write pure-code-safe code
> > > > > > > unconditionally because the toolchain will probably not
> > > > > > > be rebuilt with -mpure-code as discussed before.
> > > > > > > Or that could mean adding a -mpure-code multilib....
> > > > > >
> > > > > > I have learned a few things since the last update.  I think I know how
> > > > > > to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> > > > > > something of a wall with testing.  I can't seem to compile any flavor of
> > > > > > libgcc with CFLAGS_FOR_TARGET="-mpure-code".
> > > > > >
> > > > > > 1.  Configuring --with-multilib-list=rmprofile results in build failure:
> > > > > >
> > > > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> > > > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > > > >     See `config.log' for more details
> > > > > >
> > > > > >    cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > > > > >
> > > > >
> > > > > Yes, I did hit that wall too :-)
> > > > >
> > > > > Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.
> > > > >
> > > > > Note that there are problems in newlib too, but users of -mpure-code seem
> > > > > to be able to work around that (eg. using their own startup code and no stdlib)
> > > >
> > > > Is there a current side project to solve the makefile problems?
> > > None that I know of.
> > >
> > >
> > > > I think I'm back to my original question: If libgcc can't be built
> > > > with -mpure-code, and users bypass it completely with -nostdlib, then
> > > > why this conversation about pure-code compatibility of __clzsi2() etc?
> > > I think Richard noticed this pre-existing problem as part of the review
> > > of your patches. I don't think I meant fixing this is a prerequisite.
> > > But maybe I misunderstood :-)
> >
> > I might have misunderstood too then.  It was certainly a pre-existing
> > problem, but I took the comments to mean that I had to own it as part of
> > touching those functions.
> >
> > > > > > 2.  Attempting to filter the multib list results in configuration error.
> > > > > >     This might have been misguided, but it was something I tried:
> > > > > >
> > > > > >     Error: --with-multilib-list=armv6s-m not supported.
> > > > > >
> > > > > >     Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not supported
> > > > >
> > > > > I think only 2 values are supported: aprofile and rmprofile.
> > > >
> > > > It looks like this might require a custom t-* multilib in gcc/config/arm.
> > > Or we could add -mpure-code to the rmprofile list.
> >
> > I have no strong opinions here.  Are you proposing that the "m" versions
> > of libgcc be built with -mpure-code enabled by default, or are you
> > proposing a parallel set of multilibs?  -mpure-code by default would
> > have costs in both size and speed.
> I'm not sure how large the penalty is for thumb-2 cores?
> Maybe it's acceptable to build thumb-2 with -mpure-code by default,
> but probably not for cortex-m0.
>
> > > > > > 3.  Attempting to configure a single architecture results in a build error.
> > > > > >
> > > > > >     --with-mode=thumb --with-arch=armv6s-m --with-float=soft
> > > > > >
> > > > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > > > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > > > >     See `config.log' for more details
> > > > > >
> > > > > >     conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> > > > > >         9 | #include <ac_nonexistent.h>
> > > > > >           |          ^~~~~~~~~~~~~~~~~~
> > > > > I never saw that error message, but I never build using --with-arch.
> > > > > I do use --with-cpu though.
> > > > >
> > > > > > This has me wondering whether pure-code in libgcc is a real issue ...
> > > > > > If there's a way to build libgcc with -mpure-code, please enlighten me.
> > > > > I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
> > > > > works?
> > > >
> > > > No luck with that.  Same error message as before:
> > > >
> > > > 4.  --with-mode=thumb --with-arch=armv6s-m --with-float=soft --with-cpu=cortex-m0
> > > >
> > > >     Switch "--with-arch" may not be used with switch "--with-cpu"
> > > >
> > > > 5.  Then: --with-mode=thumb --with-float=soft --with-cpu=cortex-m0
> > > >
> > > >     checking for suffix of object files... configure: error: in `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> > > >     configure: error: cannot compute suffix of object files: cannot compile
> > > >     See `config.log' for more details
> > > >
> > > >     cc1: error: -mpure-code only supports non-pic code on M-profile targets
> > > Yes that's because default multilibs include targets incompatible with
> > > -mpure-code
> > >
> > > > 6.  Finally! --with-float=soft --with-cpu=cortex-m0 --disable-multilib
> > > >
> > > > Once you know this, and read the docs sideways, the previous errors are
> > > > all probably "works as designed".  But, I can still grumble.
> > > Yes, I think it's "as designed". I faced the "incompatible multilibs" issue
> > > too some time ago. Hence testing is not easy.
> > >
> > > > With libgcc compiled with -mpure-code, I can confirm that
> > > > 'builtin-bitops-1.c' (the test for __clzsi2) passed with libgcc as-is.
> > > >
> > > > I then added the SHF_ARM_PURECODE flag to the libgcc assembly functions
> > > > and re-ran the test.  Still passed.  I then added -mpure-code to
> > > > RUNTESTFLAGS and re-ran the test.  Still passed.  readelf confirmed that
> > > > the test program is compiling as expected [1]:
> > > >
> > > >     [ 2] .text             PROGBITS        0000800c 00800c 003314 00 AXy  0   0  4
> > > >     Key to Flags:
> > > >     W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
> > > >     L (link order), O (extra OS processing required), G (group), T (TLS),
> > > >     C (compressed), x (unknown), o (OS specific), E (exclude),
> > > >     y (purecode), p (processor specific)
> > > >
> > > > It was only when I started inserting pure-code test directives into
> > > > 'builtin-bitops-1.c' that 'make check' began to report errors.
> > > >
> > > >     /* { dg-do compile } */
> > > >     ...
> > > >     /* { dg-options "-mpure-code -mfp16-format=ieee" } */
> > > >     /* { dg-final { scan-assembler-not "\\.(float|l\\?double|\d?byte|short|int|long|quad|word)\\s+\[^.\]" } } */
> > > >
> > > > However, for reasons [2] [3] [4] [5], this wasn't actually useful.  It's
> > > > sufficient to say that there are many reasons that non-pure-code
> > > > compatible functions exist in libgcc.
> > >
> > > I filed a PR to better track this discussion:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98779
> >
> > Possibly worth noting that my patch series addresses __clzsi2, __ctzsi2,
> > __aeabi_ldivmod, __aeabi_uldivmod, and most of __gnu_float2h_internal as
> > long as CFLAGS_FOR_TARGET contains -mpure-code.
> >
> Great.
>
> > >
> > > >
> > > > Although I'm not sure how useful this will be in light of the previous
> > > > findings, I did take the opportunity with a working compile process to
> > > > modify the relevant assembly functions for -mpure-code compatibility.
> > > > I can manually disassemble the library and verify correct compilation.
> > > > I can manually run a non-pure-code builtin-bitops-1 with a pure-code
> > > > library to verify correct execution.  But, I don't think your standard
> > > > regression suite will be able to exercise the new paths.
> > > Thanks for doing that. With the SHF_ARM_PURECODE flag you
> > > added to clz, I think my simulator would catch problems.
> >
> > See my more detailed comments following.  Unless you're also using a
> > custom linker script, I would have expected any simulator capable of
> > catching errors to have already caught them.  Putting the
> > SHF_ARM_PURECODE flag on clz actually seems rather cosmetic.
> >
>
> Yes, I have a custom linker script with
>   .text.noread      :
>     {
>         INPUT_SECTION_FLAGS (SHF_ARM_PURECODE) *(.text*)
>     } > purecode_memory
>
> > > > The patch is below; you can consider this as 34/33 in the series.
> > > >
> > > > Regards,
> > > > Daniel
> > > >
> > > > [1] It's pretty clear that the section flags in libgcc have never really
> > > >     mattered.  When the linker strings all of the used objects together,
> > > >     the original sections disappear into a single output object. The
> > > >     compiler controls those flags regardless of what libgcc does.)
> > > Not sure what you mean? The linker creates two segments, one
> > > with and one without the SHF_ARM_PURECODE flag.
> >
> > When libgcc is compiled "normally", individual objects in libgcc.a are
> > compiled _without_ SHF_ARM_PURECODE.  Note line 4 below, with flags
> > "AX" only (no "y").
> >
> > `readelf -S arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a`
> >
> >     File: arm-none-eabi/thumb/v6-m/nofp/libgcc/libgcc.a(_clzsi2.o)
> >     There are 19 section headers, starting at offset 0x43c:
> >
> >     Section Headers:
> >       [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
> >       [ 0]                   NULL            00000000 000000 000000 00      0   0  0
> >       [ 1] .text             PROGBITS        00000000 000034 000000 00  AX  0   0  2
> >       [ 2] .data             PROGBITS        00000000 000034 000000 00  WA  0   0  1
> >       [ 3] .bss              NOBITS          00000000 000034 000000 00  WA  0   0  1
> >   ==> [ 4] .text.sorted[...] PROGBITS        00000000 000034 000034 00  AX  0   0  4
> >       ...
> >     Key to Flags:
> >       W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
> >       L (link order), O (extra OS processing required), G (group), T (TLS),
> >       C (compressed), x (unknown), o (OS specific), E (exclude),
> >       y (purecode), p (processor specific)
> >
> > When this "normal" libgcc is linked into an -mpure-code program
> > (e.g. with RUNTESTFLAGS), the linker script flattens all of the
> > sections together into a single output.  The relevant portion of the
> > linker script:
> >
> >       .text           :
> >       {
> >         *(.text.unlikely .text.*_unlikely .text.unlikely.*)
> >         *(.text.exit .text.exit.*)
> >         *(.text.startup .text.startup.*)
> >         *(.text.hot .text.hot.*)
> >         *(SORT(.text.sorted.*)) // _clzsi2.o matches here
> >         *(.text .stub .text.* .gnu.linkonce.t.*) // main.o matches here
> >         /* .gnu.warning sections are handled specially by elf.em.  */
> >         *(.gnu.warning)
> >         *(.glue_7t) *(.glue_7) *(.vfp11_veneer) *(.v4_bx)
> >       }
> >
> > I can't pretend to know how the linker merges conflicting flags from the
> > various input sections, but the final binary has the attributes
> > "AXy" as expected from the top level compile (line 2):
> >
> > `readelf -Sl builtin-bitops-1.exe`
> >
> >     There are 22 section headers, starting at offset 0x10934:
> >
> >     Section Headers:
> >       [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
> >       [ 0]                   NULL            00000000 000000 000000 00      0   0  0
> >       [ 1] .init             PROGBITS        00008000 008000 00000c 00  AX  0   0  4
> >   ==> [ 2] .text             PROGBITS        0000800c 00800c 00455c 00 AXy  0   0  4
> >       [ 3] .fini             PROGBITS        0000c568 00c568 00000c 00  AX  0   0  4
> >       [ 4] .rodata           PROGBITS        0000c574 00c574 000050 00   A  0   0  4
> >       [ 5] .ARM.exidx        ARM_EXIDX       0000c5c4 00c5c4 000008 00  AL  2   0  4
> >       [ 6] .eh_frame         PROGBITS        0000c5cc 00c5cc 000124 00   A  0   0  4
> >       [ 7] .init_array       INIT_ARRAY      0001c6f0 00c6f0 000004 04  WA  0   0  4
> >       [ 8] .fini_array       FINI_ARRAY      0001c6f4 00c6f4 000004 04  WA  0   0  4
> >       [ 9] .data             PROGBITS        0001c6f8 00c6f8 000a30 00  WA  0   0  8
> >       [10] .bss              NOBITS          0001d128 00d128 000114 00  WA  0   0  4
> >       ...
> >     Key to Flags:
> >       W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
> >       L (link order), O (extra OS processing required), G (group), T (TLS),
> >       C (compressed), x (unknown), o (OS specific), E (exclude),
> >       y (purecode), p (processor specific)
> >
> >     There are 3 program headers, starting at offset 52
> >
> >     Program Headers:
> >       Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
> >       EXIDX          0x00c5c4 0x0000c5c4 0x0000c5c4 0x00008 0x00008 R   0x4
> >       LOAD           0x000000 0x00000000 0x00000000 0x0c6f0 0x0c6f0 R E 0x10000
> >       LOAD           0x00c6f0 0x0001c6f0 0x0001c6f0 0x00a38 0x00b4c RW  0x10000
> >
> >      Section to Segment mapping:
> >       Segment Sections...
> >        00     .ARM.exidx
> >   ==>   01     .init .text .fini .rodata .ARM.exidx .eh_frame
> >        02     .init_array .fini_array .data .bss
> >
> > A segment contains complete sections, not the other way around.  As you
> > can see above, the first LOAD segment contains the entire ".text" plus
> > some other sections.   Thus, SHF_ARM_PURECODE flag really appears
> > to apply to all of "text", even though the bits linked in from libgcc
> > weren't built or flagged this way.
> >
> > >
> > > > [2] The existing pure-code tests are compile-only and cover just the
> > > >     disassembled 'main.o'.  There is no test of a complete executable
> > > >     and there is no execution/simulation.
> > > That's something I did manually: run the full gcc testsuite, forcing -mpure-code
> > > in RUNTESTFLAGS. This way, all execution tests are compiled with -mpure-code,
> > > and this is how I found several bugs.
> > >
> > > > [3] While other parts of binutils may understand SHF_ARM_PURECODE, I
> > > >     don't think the simulator checks section flags or throws exceptions.
> > > Indeed, I know of no public simulator that honors this flag. I do have
> > > one though.
> > >
> > > > [4] builtin-bitops-1 modified this way will always fail due to the array
> > > >     data definitions (longs, longlongs, etc).  GCC can't translate those
> > > >     to instructions.  While the ".data" section would presumably be
> > > >     readable, scan-assembler-not doesn't know the difference.
> > > Sure, adding such scan-assembler-not is not suitable for any existing testcase.
> > > That's why it is only in place for testcases dedicated to -mpure-code.
> > >
> > > > [5] Even if the simulator were modified to throw exceptions, this will
> > > >     continue to fail because _mainCRTStartup uses a literal pool.
> > > Yes, in general, and that's why I mentioned problems with newlib
> > > earlier in this thread.
> > >
> > > However, the simulator I use only throws an exception for code executed with
> > > SHF_ARM_PURECODE. Code in the "regular" code segment is not checked.
> > > So this does not catch errors in hand-written assembly code using regular .text,
> > > but it enables to run larger validations (such as the whole GCC testsuite)
> > > without having to fix all of newlib.
> > > Not perfect, as it left the issues in libgcc we are discussing, but it
> > > helped me fix
> > > several bugs in -mpure-code.
> >
> > Yet again I suspect that you have a custom linker script, or there's
> > some other major difference.  Using the public releases of binutils,
> > newlib, etc, my experiences just aren't lining up with yours.
>
> Yep, the linker script makes the difference.
>
> >
> > > Thanks,
> > >
> > > Christophe
> > >
> > > >
> > > > > Thanks,
> > > > >
> > > > > Christophe
> > > > >
> > > > > --snip--
> >
> > If the test server farm is free at some point, would you mind running
> > another set of regression tests on my v5 patch series?
>
> Sure. Given the number of sub-patches, can you send it to me as a
> single patch file
> (git format) that I can directly apply to GCC trunk?
> My mailer does not want to help with saving each patch as a proper
> patch file :-(
>

The validation results came back clean (no regression found).
Thanks

Christophe

> Thanks
>
> Christophe
>
> >
> > Regards,
> > Daniel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0
  2021-01-25 17:48                                     ` Christophe Lyon
@ 2021-01-25 23:36                                       ` Daniel Engel
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Engel @ 2021-01-25 23:36 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Earnshaw, gcc Patches

> > > --snip--
> > >
> > > If the test server farm is free at some point, would you mind running
> > > another set of regression tests on my v5 patch series?
> >
> > Sure. Given the number of sub-patches, can you send it to me as a
> > single patch file
> > (git format) that I can directly apply to GCC trunk?
> > My mailer does not want to help with saving each patch as a proper
> > patch file :-(
> >
> 
> The validation results came back clean (no regression found).
> Thanks
> 
> Christophe

Appreciate the update.  Seems that the linker "bug" really was all that
I was fighting there at the end (see patch number 33/33).

I did see the announcement for stage 4 last week, so I think this is all
I can do for now.  With luck I will be back in October or so.

Thanks again,
Daniel

> 
> > Thanks
> >
> > Christophe
> >
> > >
> > > Regards,
> > > Daniel
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-25 23:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-12 23:04 [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0 Daniel Engel
2020-11-26  9:14 ` Christophe Lyon
2020-12-02  3:32   ` Daniel Engel
2020-12-16 17:15     ` Christophe Lyon
2021-01-06 11:20       ` [PATCH v3] " Daniel Engel
2021-01-06 17:05         ` Richard Earnshaw
2021-01-07  0:59           ` Daniel Engel
2021-01-07 12:56             ` Richard Earnshaw
2021-01-07 13:27               ` Christophe Lyon
2021-01-07 16:44                 ` Richard Earnshaw
2021-01-09 12:28               ` Daniel Engel
2021-01-09 13:09                 ` Christophe Lyon
2021-01-09 18:04                   ` Daniel Engel
2021-01-11 14:49                     ` Richard Earnshaw
2021-01-09 18:48                   ` Daniel Engel
2021-01-11 16:07                   ` Christophe Lyon
2021-01-11 16:18                     ` Daniel Engel
2021-01-11 16:39                       ` Christophe Lyon
2021-01-15 11:40                         ` Daniel Engel
2021-01-15 12:30                           ` Christophe Lyon
2021-01-16 16:14                             ` Daniel Engel
2021-01-21 10:29                               ` Christophe Lyon
2021-01-21 20:35                                 ` Daniel Engel
2021-01-22 18:28                                   ` Christophe Lyon
2021-01-25 17:48                                     ` Christophe Lyon
2021-01-25 23:36                                       ` Daniel Engel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).