[PATCH 0/11] Improve Mips target

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/11] Improve Mips target
@ 2025-01-23 13:42 Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 00/11] " Aleksandar Rakic
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:42 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu

This patch series improves the support for the MIPS target in glibc,
including the enhancements to the MIPS target and several bug fixes.

These patches are cherry-picked from the branch mips_rel/2_28/master on
the MIPS' repository: https://github.com/MIPS/glibc .
Further details on the individual changes are included in the respective
patches.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 00/11] Improve Mips target
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
@ 2025-01-23 13:42 ` Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 01/11] Updates for microMIPS Release 6 Aleksandar Rakic
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:42 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu

Aleksandar Rakic (11):
  Updates for microMIPS Release 6
  Fix rtld link_map initialization issues
  Fix issues with removing no-reorder directives
  Add C implementation of memcpy/memset
  Add optimized assembly for strcmp
  Fix prefetching beyond copied memory
  Fix strcmp bug for little endian target
  Add script to run tests through a qemu wrapper
  Avoid warning from -Wbuiltin-declaration-mismatch
  Avoid GCC 11 warning from -Wmaybe-uninitialized
  Prevent turning memset into self-recursion

 elf/rtld.c                                   |  14 +-
 scripts/cross-test-qemu.sh                   | 152 ++++
 sysdeps/ieee754/dbl-64/s_modf.c              |   4 +
 sysdeps/ieee754/dbl-64/s_sincos.c            |   4 +
 sysdeps/ieee754/soft-fp/s_fdiv.c             |   1 +
 sysdeps/mips/Makefile                        |   5 +
 sysdeps/mips/add_n.S                         |  12 +-
 sysdeps/mips/addmul_1.S                      |  11 +-
 sysdeps/mips/bsd-setjmp.S                    |   2 +-
 sysdeps/mips/dl-machine.h                    |  15 +-
 sysdeps/mips/dl-trampoline.c                 |   4 -
 sysdeps/mips/lshift.S                        |  12 +-
 sysdeps/mips/machine-gmon.h                  |  82 ++
 sysdeps/mips/memcpy.S                        | 868 -------------------
 sysdeps/mips/memcpy.c                        | 449 ++++++++++
 sysdeps/mips/memset.S                        | 426 ---------
 sysdeps/mips/memset.c                        | 187 ++++
 sysdeps/mips/mips32/crtn.S                   |  12 +-
 sysdeps/mips/mips64/__longjmp.c              |   2 +-
 sysdeps/mips/mips64/add_n.S                  |  12 +-
 sysdeps/mips/mips64/addmul_1.S               |  11 +-
 sysdeps/mips/mips64/lshift.S                 |  12 +-
 sysdeps/mips/mips64/mul_1.S                  |  11 +-
 sysdeps/mips/mips64/n32/crtn.S               |  12 +-
 sysdeps/mips/mips64/n64/crtn.S               |  12 +-
 sysdeps/mips/mips64/rshift.S                 |  12 +-
 sysdeps/mips/mips64/sub_n.S                  |  12 +-
 sysdeps/mips/mips64/submul_1.S               |  11 +-
 sysdeps/mips/mul_1.S                         |  11 +-
 sysdeps/mips/rshift.S                        |  12 +-
 sysdeps/mips/strcmp.S                        | 229 +++--
 sysdeps/mips/sub_n.S                         |  12 +-
 sysdeps/mips/submul_1.S                      |  11 +-
 sysdeps/mips/sys/asm.h                       |  20 +-
 sysdeps/unix/mips/mips32/sysdep.h            |   4 -
 sysdeps/unix/mips/mips64/sysdep.h            |   4 -
 sysdeps/unix/mips/sysdep.h                   |   2 -
 sysdeps/unix/sysv/linux/mips/mips32/sysdep.h |  10 -
 sysdeps/unix/sysv/linux/mips/mips64/sysdep.h |  14 -
 39 files changed, 1108 insertions(+), 1588 deletions(-)
 create mode 100755 scripts/cross-test-qemu.sh
 delete mode 100644 sysdeps/mips/memcpy.S
 create mode 100644 sysdeps/mips/memcpy.c
 delete mode 100644 sysdeps/mips/memset.S
 create mode 100644 sysdeps/mips/memset.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/11] Updates for microMIPS Release 6
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 00/11] " Aleksandar Rakic
@ 2025-01-23 13:42 ` Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 02/11] Fix rtld link_map initialization issues Aleksandar Rakic
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:42 UTC (permalink / raw)
  To: libc-alpha
  Cc: aleksandar.rakic, djordje.todorovic, cfu, Matthew Fortune,
	Andrew Bennett, Faraz Shahbazker

* Remove noreorder
* Fix PC relative code label calculations for microMIPSR6
* Add special versions of code that would be de-optimised by removing
  noreorder
* Avoid use of un-aligned ADDIUPC instruction for address calculation.

Cherry-picked 94a52199502361be4a5b1cc616661e287416cc8d
from https://github.com/MIPS/glibc

Signed-off-by: Matthew Fortune <matthew.fortune@imgtec.com>
Signed-off-by: Andrew Bennett <andrew.bennett@imgtec.com>
Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/add_n.S                         |  12 +-
 sysdeps/mips/addmul_1.S                      |  11 +-
 sysdeps/mips/dl-machine.h                    |  15 ++-
 sysdeps/mips/dl-trampoline.c                 |   4 -
 sysdeps/mips/lshift.S                        |  12 +-
 sysdeps/mips/machine-gmon.h                  |  82 +++++++++++++
 sysdeps/mips/memcpy.S                        | 120 +++++++++++--------
 sysdeps/mips/memset.S                        |  62 +++++-----
 sysdeps/mips/mips32/crtn.S                   |  12 +-
 sysdeps/mips/mips64/__longjmp.c              |   2 +-
 sysdeps/mips/mips64/add_n.S                  |  12 +-
 sysdeps/mips/mips64/addmul_1.S               |  11 +-
 sysdeps/mips/mips64/lshift.S                 |  12 +-
 sysdeps/mips/mips64/mul_1.S                  |  11 +-
 sysdeps/mips/mips64/n32/crtn.S               |  12 +-
 sysdeps/mips/mips64/n64/crtn.S               |  12 +-
 sysdeps/mips/mips64/rshift.S                 |  12 +-
 sysdeps/mips/mips64/sub_n.S                  |  12 +-
 sysdeps/mips/mips64/submul_1.S               |  11 +-
 sysdeps/mips/mul_1.S                         |  11 +-
 sysdeps/mips/rshift.S                        |  12 +-
 sysdeps/mips/sub_n.S                         |  12 +-
 sysdeps/mips/submul_1.S                      |  11 +-
 sysdeps/mips/sys/asm.h                       |  20 +---
 sysdeps/unix/mips/mips32/sysdep.h            |   4 -
 sysdeps/unix/mips/mips64/sysdep.h            |   4 -
 sysdeps/unix/mips/sysdep.h                   |   2 -
 sysdeps/unix/sysv/linux/mips/mips32/sysdep.h |  10 --
 sysdeps/unix/sysv/linux/mips/mips64/sysdep.h |  14 ---
 29 files changed, 260 insertions(+), 277 deletions(-)

diff --git a/sysdeps/mips/add_n.S b/sysdeps/mips/add_n.S
index 234e1e3c8d..f4d98fa38c 100644
--- a/sysdeps/mips/add_n.S
+++ b/sysdeps/mips/add_n.S
@@ -31,19 +31,16 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_add_n)
-	.set	noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set	nomacro
-
 	lw	$10,0($5)
 	lw	$11,0($6)
 
 	addiu	$7,$7,-1
 	and	$9,$7,4-1	/* number of limbs in first loop */
-	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 	move	$2,$0
+	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 
 	subu	$7,$7,$9
 
@@ -61,11 +58,10 @@ L(Loop0):	addiu	$9,$9,-1
 	addiu	$6,$6,4
 	move	$10,$12
 	move	$11,$13
-	bne	$9,$0,L(Loop0)
 	addiu	$4,$4,4
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$7,$0,L(end)
-	nop
 
 L(Loop):	addiu	$7,$7,-4
 
@@ -108,14 +104,14 @@ L(Loop):	addiu	$7,$7,-4
 	addiu	$5,$5,16
 	addiu	$6,$6,16
 
-	bne	$7,$0,L(Loop)
 	addiu	$4,$4,16
+	bne	$7,$0,L(Loop)
 
 L(end):	addu	$11,$11,$2
 	sltu	$8,$11,$2
 	addu	$11,$10,$11
 	sltu	$2,$11,$10
 	sw	$11,0($4)
-	j	$31
 	or	$2,$2,$8
+	jr	$31
 END (__mpn_add_n)
diff --git a/sysdeps/mips/addmul_1.S b/sysdeps/mips/addmul_1.S
index 523478d7e8..eea26630fc 100644
--- a/sysdeps/mips/addmul_1.S
+++ b/sysdeps/mips/addmul_1.S
@@ -31,12 +31,9 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_addmul_1)
-	.set    noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set    nomacro
-
 	/* warm up phase 0 */
 	lw	$8,0($5)
 
@@ -50,12 +47,12 @@ ENTRY (__mpn_addmul_1)
 #endif
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		/* zero cy2 */
+	beq	$6,$0,L(LC0)
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	lw	$8,0($5)	/* load new s1 limb as early as possible */
+	beq	$6,$0,L(LC1)
 
 L(Loop):	lw	$10,0($4)
 #if __mips_isa_rev < 6
@@ -81,8 +78,8 @@ L(Loop):	lw	$10,0($4)
 	addu	$2,$2,$10
 	sw	$3,0($4)
 	addiu	$4,$4,4
-	bne	$6,$0,L(Loop)	/* should be "bnel" */
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	bne	$6,$0,L(Loop)	/* should be "bnel" */
 
 	/* cool down phase 1 */
 L(LC1):	lw	$10,0($4)
@@ -123,6 +120,6 @@ L(LC0):	lw	$10,0($4)
 	sltu	$10,$3,$10
 	addu	$2,$2,$10
 	sw	$3,0($4)
-	j	$31
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	jr	$31
 	END (__mpn_addmul_1)
diff --git a/sysdeps/mips/dl-machine.h b/sysdeps/mips/dl-machine.h
index 10e30f1e90..a360dfcd63 100644
--- a/sysdeps/mips/dl-machine.h
+++ b/sysdeps/mips/dl-machine.h
@@ -127,16 +127,13 @@ elf_machine_load_address (void)
 {
   ElfW(Addr) addr;
 #ifndef __mips16
-  asm ("	.set noreorder\n"
-       "	" STRINGXP (PTR_LA) " %0, 0f\n"
+  asm ("	" STRINGXP (PTR_LA) " %0, 0f\n"
 # if !defined __mips_isa_rev || __mips_isa_rev < 6
        "	bltzal $0, 0f\n"
-       "	nop\n"
+#else
+       "	bal 0f\n"
+#endif
        "0:	" STRINGXP (PTR_SUBU) " %0, $31, %0\n"
-# else
-       "0:	addiupc $31, 0\n"
-       "	" STRINGXP (PTR_SUBU) " %0, $31, %0\n"
-# endif
        "	.set reorder\n"
        :	"=r" (addr)
        :	/* No inputs */
@@ -237,7 +234,9 @@ do {									\
       and not just plain _start.  */
 
 #ifndef __mips16
-# if !defined __mips_isa_rev || __mips_isa_rev < 6
+/* Although microMIPSr6 has an ADDIUPC instruction, it must be 4-byte aligned
+   for the address calculation to be valid.  */
+# if !defined __mips_isa_rev || __mips_isa_rev < 6 || defined __mips_micromips
 #  define LCOFF STRINGXP(.Lcof2)
 #  define LOAD_31 STRINGXP(bltzal $8) "," STRINGXP(.Lcof2)
 # else
diff --git a/sysdeps/mips/dl-trampoline.c b/sysdeps/mips/dl-trampoline.c
index 603ee2d2f8..915e1da6ad 100644
--- a/sysdeps/mips/dl-trampoline.c
+++ b/sysdeps/mips/dl-trampoline.c
@@ -301,7 +301,6 @@ asm ("\n\
 	.ent	_dl_runtime_resolve\n\
 _dl_runtime_resolve:\n\
 	.frame	$29, " STRINGXP(ELF_DL_FRAME_SIZE) ", $31\n\
-	.set noreorder\n\
 	# Save GP.\n\
 1:	move	$3, $28\n\
 	# Save arguments and sp value in stack.\n\
@@ -311,7 +310,6 @@ _dl_runtime_resolve:\n\
 	# Compute GP.\n\
 2:	" STRINGXP(SETUP_GP) "\n\
 	" STRINGXV(SETUP_GP64 (0, _dl_runtime_resolve)) "\n\
-	.set reorder\n\
 	# Save slot call pc.\n\
 	move	$2, $31\n\
 	" IFABIO32(STRINGXP(CPRESTORE(32))) "\n\
@@ -358,7 +356,6 @@ asm ("\n\
 	.ent	_dl_runtime_pltresolve\n\
 _dl_runtime_pltresolve:\n\
 	.frame	$29, " STRINGXP(ELF_DL_PLT_FRAME_SIZE) ", $31\n\
-	.set noreorder\n\
 	# Save arguments and sp value in stack.\n\
 1:	" STRINGXP(PTR_SUBIU) "	$29, " STRINGXP(ELF_DL_PLT_FRAME_SIZE) "\n\
 	" IFABIO32(STRINGXP(PTR_L) "	$13, " STRINGXP(PTRSIZE) "($28)") "\n\
@@ -368,7 +365,6 @@ _dl_runtime_pltresolve:\n\
 	# Compute GP.\n\
 2:	" STRINGXP(SETUP_GP) "\n\
 	" STRINGXV(SETUP_GP64 (0, _dl_runtime_pltresolve)) "\n\
-	.set reorder\n\
 	" IFABIO32(STRINGXP(CPRESTORE(32))) "\n\
 	" ELF_DL_PLT_SAVE_ARG_REGS "\
 	move	$4, $13\n\
diff --git a/sysdeps/mips/lshift.S b/sysdeps/mips/lshift.S
index 04caa76a84..c6c42aa1f5 100644
--- a/sysdeps/mips/lshift.S
+++ b/sysdeps/mips/lshift.S
@@ -30,12 +30,9 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_lshift)
-	.set	noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set	nomacro
-
 	sll	$2,$6,2
 	addu	$5,$5,$2	/* make r5 point at end of src */
 	lw	$10,-4($5)	/* load first limb */
@@ -43,8 +40,8 @@ ENTRY (__mpn_lshift)
 	addu	$4,$4,$2	/* make r4 point at end of res */
 	addiu	$6,$6,-1
 	and	$9,$6,4-1	/* number of limbs in first loop */
-	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 	srl	$2,$10,$13	/* compute function result */
+	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 
 	subu	$6,$6,$9
 
@@ -56,11 +53,10 @@ L(Loop0):	lw	$3,-8($5)
 	srl	$12,$3,$13
 	move	$10,$3
 	or	$8,$11,$12
-	bne	$9,$0,L(Loop0)
 	sw	$8,0($4)
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$6,$0,L(Lend)
-	nop
 
 L(Loop):	lw	$3,-8($5)
 	addiu	$4,$4,-16
@@ -88,10 +84,10 @@ L(Loop):	lw	$3,-8($5)
 
 	addiu	$5,$5,-16
 	or	$8,$14,$9
-	bgtz	$6,L(Loop)
 	sw	$8,0($4)
+	bgtz	$6,L(Loop)
 
 L(Lend):	sll	$8,$10,$7
-	j	$31
 	sw	$8,-4($4)
+	jr	$31
 	END (__mpn_lshift)
diff --git a/sysdeps/mips/machine-gmon.h b/sysdeps/mips/machine-gmon.h
index e2e0756575..d890e5ec19 100644
--- a/sysdeps/mips/machine-gmon.h
+++ b/sysdeps/mips/machine-gmon.h
@@ -34,6 +34,42 @@ static void __attribute_used__ __mcount (u_long frompc, u_long selfpc)
 # define CPRESTORE
 #endif
 
+#if __mips_isa_rev > 5 && defined (__mips_micromips)
+#define MCOUNT asm(\
+	".globl _mcount;\n\t" \
+	".align 2;\n\t" \
+	".set push;\n\t" \
+	".set nomips16;\n\t" \
+	".type _mcount,@function;\n\t" \
+	".ent _mcount\n\t" \
+        "_mcount:\n\t" \
+        ".frame $sp,44,$31\n\t" \
+        ".set noat;\n\t" \
+        CPLOAD \
+	"subu $29,$29,48;\n\t" \
+	CPRESTORE \
+        "sw $4,24($29);\n\t" \
+        "sw $5,28($29);\n\t" \
+        "sw $6,32($29);\n\t" \
+        "sw $7,36($29);\n\t" \
+        "sw $2,40($29);\n\t" \
+        "sw $1,16($29);\n\t" \
+        "sw $31,20($29);\n\t" \
+        "move $5,$31;\n\t" \
+        "move $4,$1;\n\t" \
+        "balc __mcount;\n\t" \
+        "lw $4,24($29);\n\t" \
+        "lw $5,28($29);\n\t" \
+        "lw $6,32($29);\n\t" \
+        "lw $7,36($29);\n\t" \
+        "lw $2,40($29);\n\t" \
+        "lw $1,20($29);\n\t" \
+        "lw $31,16($29);\n\t" \
+        "addu $29,$29,56;\n\t" \
+        "jrc $1;\n\t" \
+	".end _mcount;\n\t" \
+	".set pop");
+#else
 #define MCOUNT asm(\
 	".globl _mcount;\n\t" \
 	".align 2;\n\t" \
@@ -71,6 +107,7 @@ static void __attribute_used__ __mcount (u_long frompc, u_long selfpc)
         "move $31,$1;\n\t" \
 	".end _mcount;\n\t" \
 	".set pop");
+#endif
 
 #else
 
@@ -97,6 +134,50 @@ static void __attribute_used__ __mcount (u_long frompc, u_long selfpc)
 # error "Unknown ABI"
 #endif
 
+#if __mips_isa_rev > 5 && defined (__mips_micromips)
+#define MCOUNT asm(\
+	".globl _mcount;\n\t" \
+	".align 3;\n\t" \
+	".set push;\n\t" \
+	".set nomips16;\n\t" \
+	".type _mcount,@function;\n\t" \
+	".ent _mcount\n\t" \
+        "_mcount:\n\t" \
+        ".frame $sp,88,$31\n\t" \
+        ".set noat;\n\t" \
+        PTR_SUBU_STRING " $29,$29,96;\n\t" \
+        CPSETUP \
+        "sd $4,24($29);\n\t" \
+        "sd $5,32($29);\n\t" \
+        "sd $6,40($29);\n\t" \
+        "sd $7,48($29);\n\t" \
+        "sd $8,56($29);\n\t" \
+        "sd $9,64($29);\n\t" \
+        "sd $10,72($29);\n\t" \
+        "sd $11,80($29);\n\t" \
+        "sd $2,16($29);\n\t" \
+        "sd $1,0($29);\n\t" \
+        "sd $31,8($29);\n\t" \
+        "move $5,$31;\n\t" \
+        "move $4,$1;\n\t" \
+        "balc __mcount;\n\t" \
+        "ld $4,24($29);\n\t" \
+        "ld $5,32($29);\n\t" \
+        "ld $6,40($29);\n\t" \
+        "ld $7,48($29);\n\t" \
+        "ld $8,56($29);\n\t" \
+        "ld $9,64($29);\n\t" \
+        "ld $10,72($29);\n\t" \
+        "ld $11,80($29);\n\t" \
+        "ld $2,16($29);\n\t" \
+        "ld $1,8($29);\n\t" \
+        "ld $31,0($29);\n\t" \
+        CPRETURN \
+        PTR_ADDU_STRING " $29,$29,96;\n\t" \
+        "jrc $1;\n\t" \
+	".end _mcount;\n\t" \
+	".set pop");
+#else
 #define MCOUNT asm(\
 	".globl _mcount;\n\t" \
 	".align 3;\n\t" \
@@ -142,5 +223,6 @@ static void __attribute_used__ __mcount (u_long frompc, u_long selfpc)
         "move $31,$1;\n\t" \
 	".end _mcount;\n\t" \
 	".set pop");
+#endif
 
 #endif
diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S
index 5b277e07c5..96d1c92d89 100644
--- a/sysdeps/mips/memcpy.S
+++ b/sysdeps/mips/memcpy.S
@@ -86,6 +86,12 @@
 # endif
 #endif
 
+#if __mips_isa_rev > 5 && defined (__mips_micromips)
+# define PTR_BC	      bc16
+#else
+# define PTR_BC	      bc
+#endif
+
 /*
  * Using PREFETCH_HINT_LOAD_STREAMED instead of PREFETCH_LOAD on load
  * prefetches appear to offer a slight performance advantage.
@@ -272,7 +278,6 @@ LEAF(MEMCPY_NAME, 0)
 LEAF(MEMCPY_NAME)
 #endif
 	.set	nomips16
-	.set	noreorder
 /*
  * Below we handle the case where memcpy is called with overlapping src and dst.
  * Although memcpy is not required to handle this case, some parts of Android
@@ -284,10 +289,9 @@ LEAF(MEMCPY_NAME)
 	xor	t1,t0,t2
 	PTR_SUBU t0,t1,t2
 	sltu	t2,t0,a2
-	beq	t2,zero,L(memcpy)
 	la	t9,memmove
+	beq	t2,zero,L(memcpy)
 	jr	t9
-	 nop
 L(memcpy):
 #endif
 /*
@@ -295,12 +299,12 @@ L(memcpy):
  * size, copy dst pointer to v0 for the return value.
  */
 	slti	t2,a2,(2 * NSIZE)
-	bne	t2,zero,L(lasts)
 #if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH)
 	move	v0,zero
 #else
 	move	v0,a0
 #endif
+	bne	t2,zero,L(lasts)
 
 #ifndef R6_CODE
 
@@ -312,12 +316,12 @@ L(memcpy):
  */
 	xor	t8,a1,a0
 	andi	t8,t8,(NSIZE-1)		/* t8 is a0/a1 word-displacement */
-	bne	t8,zero,L(unaligned)
 	PTR_SUBU a3, zero, a0
+	bne	t8,zero,L(unaligned)
 
 	andi	a3,a3,(NSIZE-1)		/* copy a3 bytes to align a0/a1	  */
+	PTR_SUBU a2,a2,a3		/* a2 is the remining bytes count */
 	beq	a3,zero,L(aligned)	/* if a3=0, it is already aligned */
-	PTR_SUBU a2,a2,a3		/* a2 is the remaining bytes count */
 
 	C_LDHI	t8,0(a1)
 	PTR_ADDU a1,a1,a3
@@ -332,18 +336,24 @@ L(memcpy):
  * align instruction.
  */
 	andi	t8,a0,7
+#ifdef __mips_micromips
+	auipc	t9,%pcrel_hi(L(atable))
+	addiu	t9,t9,%pcrel_lo(L(atable)+4)
+	PTR_LSA	t9,t8,t9,1
+#else
 	lapc	t9,L(atable)
 	PTR_LSA	t9,t8,t9,2
+#endif
 	jrc	t9
 L(atable):
-	bc	L(lb0)
-	bc	L(lb7)
-	bc	L(lb6)
-	bc	L(lb5)
-	bc	L(lb4)
-	bc	L(lb3)
-	bc	L(lb2)
-	bc	L(lb1)
+	PTR_BC	L(lb0)
+	PTR_BC	L(lb7)
+	PTR_BC	L(lb6)
+	PTR_BC	L(lb5)
+	PTR_BC	L(lb4)
+	PTR_BC	L(lb3)
+	PTR_BC	L(lb2)
+	PTR_BC	L(lb1)
 L(lb7):
 	lb	a3, 6(a1)
 	sb	a3, 6(a0)
@@ -374,20 +384,26 @@ L(lb1):
 L(lb0):
 
 	andi	t8,a1,(NSIZE-1)
+#ifdef __mips_micromips
+	auipc	t9,%pcrel_hi(L(jtable))
+	addiu	t9,t9,%pcrel_lo(L(jtable)+4)
+	PTR_LSA	t9,t8,t9,1
+#else
 	lapc	t9,L(jtable)
 	PTR_LSA	t9,t8,t9,2
+#endif
 	jrc	t9
 L(jtable):
-        bc      L(aligned)
-        bc      L(r6_unaligned1)
-        bc      L(r6_unaligned2)
-        bc      L(r6_unaligned3)
-# ifdef USE_DOUBLE
-        bc      L(r6_unaligned4)
-        bc      L(r6_unaligned5)
-        bc      L(r6_unaligned6)
-        bc      L(r6_unaligned7)
-# endif
+	PTR_BC      L(aligned)
+	PTR_BC      L(r6_unaligned1)
+	PTR_BC      L(r6_unaligned2)
+	PTR_BC      L(r6_unaligned3)
+#ifdef USE_DOUBLE
+	PTR_BC      L(r6_unaligned4)
+	PTR_BC      L(r6_unaligned5)
+	PTR_BC      L(r6_unaligned6)
+	PTR_BC      L(r6_unaligned7)
+#endif
 #endif /* R6_CODE */
 
 L(aligned):
@@ -401,8 +417,8 @@ L(aligned):
  */
 
 	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
+	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
 
 /* When in the loop we may prefetch with the 'prepare to store' hint,
@@ -428,7 +444,6 @@ L(aligned):
 # if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
 	sltu    v1,t9,a0
 	bgtz    v1,L(skip_set)
-	nop
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
 L(skip_set):
 # else
@@ -444,11 +459,16 @@ L(skip_set):
 #endif
 L(loop16w):
 	C_LD	t0,UNIT(0)(a1)
+/* We need to separate out the C_LD instruction here so that it will work
+   both when it is used by itself and when it is used with the branch
+   instruction.  */
 #if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
 	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
+	C_LD	t1,UNIT(1)(a1)
 	bgtz	v1,L(skip_pref)
-#endif
+#else
 	C_LD	t1,UNIT(1)(a1)
+#endif
 #ifdef R6_CODE
 	PREFETCH_FOR_STORE (2, a0)
 #else
@@ -502,8 +522,8 @@ L(skip_pref):
 	C_ST	REG6,UNIT(14)(a0)
 	C_ST	REG7,UNIT(15)(a0)
 	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-	bne	a0,a3,L(loop16w)
 	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
+	bne	a0,a3,L(loop16w)
 	move	a2,t8
 
 /* Here we have src and dest word-aligned but less than 64-bytes or
@@ -517,7 +537,6 @@ L(chkw):
 	andi	t8,a2,NSIZEMASK	/* Is there a 32-byte/64-byte chunk.  */
 				/* The t8 is the reminder count past 32-bytes */
 	beq	a2,t8,L(chk1w)	/* When a2=t8, no 32-byte chunk  */
-	nop
 	C_LD	t0,UNIT(0)(a1)
 	C_LD	t1,UNIT(1)(a1)
 	C_LD	REG2,UNIT(2)(a1)
@@ -546,8 +565,8 @@ L(chkw):
  */
 L(chk1w):
 	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	beq	a2,t8,L(lastw)
 	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
+	beq	a2,t8,L(lastw)
 	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
 
 /* copying in words (4-byte or 8-byte chunks) */
@@ -555,8 +574,8 @@ L(wordCopy_loop):
 	C_LD	REG3,UNIT(0)(a1)
 	PTR_ADDIU a0,a0,UNIT(1)
 	PTR_ADDIU a1,a1,UNIT(1)
-	bne	a0,a3,L(wordCopy_loop)
 	C_ST	REG3,UNIT(-1)(a0)
+	bne	a0,a3,L(wordCopy_loop)
 
 /* If we have been copying double words, see if we can copy a single word
    before doing byte copies.  We can have, at most, one word to copy.  */
@@ -574,17 +593,16 @@ L(lastw):
 
 /* Copy the last 8 (or 16) bytes */
 L(lastb):
-	blez	a2,L(leave)
 	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
+	blez	a2,L(leave)
 L(lastbloop):
 	lb	v1,0(a1)
 	PTR_ADDIU a0,a0,1
 	PTR_ADDIU a1,a1,1
-	bne	a0,a3,L(lastbloop)
 	sb	v1,-1(a0)
+	bne	a0,a3,L(lastbloop)
 L(leave):
-	j	ra
-	nop
+	jr	ra
 
 /* We jump here with a memcpy of less than 8 or 16 bytes, depending on
    whether or not USE_DOUBLE is defined.  Instead of just doing byte
@@ -625,8 +643,8 @@ L(wcopy_loop):
 
 L(unaligned):
 	andi	a3,a3,(NSIZE-1)	/* copy a3 bytes to align a0/a1 */
+	PTR_SUBU a2,a2,a3	/* a2 is the remining bytes count */
 	beqz	a3,L(ua_chk16w) /* if a3=0, it is already aligned */
-	PTR_SUBU a2,a2,a3	/* a2 is the remaining bytes count */
 
 	C_LDHI	v1,UNIT(0)(a1)
 	C_LDLO	v1,UNITM1(1)(a1)
@@ -644,8 +662,8 @@ L(unaligned):
 
 L(ua_chk16w):
 	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	beq	a2,t8,L(ua_chkw) /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
+	beq	a2,t8,L(ua_chkw) /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
 
 # if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
@@ -664,7 +682,6 @@ L(ua_chk16w):
 #  if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
 	sltu    v1,t9,a0
 	bgtz    v1,L(ua_skip_set)
-	nop
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
 L(ua_skip_set):
 #  else
@@ -676,11 +693,16 @@ L(ua_loop16w):
 	C_LDHI	t0,UNIT(0)(a1)
 	C_LDHI	t1,UNIT(1)(a1)
 	C_LDHI	REG2,UNIT(2)(a1)
+/* We need to separate out the C_LDHI instruction here so that it will work
+   both when it is used by itself and when it is used with the branch
+   instruction.  */
 # if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
 	sltu	v1,t9,a0
+	C_LDHI	REG3,UNIT(3)(a1)
 	bgtz	v1,L(ua_skip_pref)
-# endif
+# else
 	C_LDHI	REG3,UNIT(3)(a1)
+# endif
 	PREFETCH_FOR_STORE (4, a0)
 	PREFETCH_FOR_STORE (5, a0)
 L(ua_skip_pref):
@@ -731,8 +753,8 @@ L(ua_skip_pref):
 	C_ST	REG6,UNIT(14)(a0)
 	C_ST	REG7,UNIT(15)(a0)
 	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-	bne	a0,a3,L(ua_loop16w)
 	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
+	bne	a0,a3,L(ua_loop16w)
 	move	a2,t8
 
 /* Here we have src and dest word-aligned but less than 64-bytes or
@@ -745,7 +767,6 @@ L(ua_chkw):
 	andi	t8,a2,NSIZEMASK	  /* Is there a 32-byte/64-byte chunk.  */
 				  /* t8 is the reminder count past 32-bytes */
 	beq	a2,t8,L(ua_chk1w) /* When a2=t8, no 32-byte chunk */
-	nop
 	C_LDHI	t0,UNIT(0)(a1)
 	C_LDHI	t1,UNIT(1)(a1)
 	C_LDHI	REG2,UNIT(2)(a1)
@@ -778,8 +799,8 @@ L(ua_chkw):
  */
 L(ua_chk1w):
 	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	beq	a2,t8,L(ua_smallCopy)
 	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
+	beq	a2,t8,L(ua_smallCopy)
 	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
 
 /* copying in words (4-byte or 8-byte chunks) */
@@ -788,22 +809,21 @@ L(ua_wordCopy_loop):
 	C_LDLO	v1,UNITM1(1)(a1)
 	PTR_ADDIU a0,a0,UNIT(1)
 	PTR_ADDIU a1,a1,UNIT(1)
-	bne	a0,a3,L(ua_wordCopy_loop)
 	C_ST	v1,UNIT(-1)(a0)
+	bne	a0,a3,L(ua_wordCopy_loop)
 
 /* Copy the last 8 (or 16) bytes */
 L(ua_smallCopy):
-	beqz	a2,L(leave)
 	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
+	beqz	a2,L(leave)
 L(ua_smallCopy_loop):
 	lb	v1,0(a1)
 	PTR_ADDIU a0,a0,1
 	PTR_ADDIU a1,a1,1
-	bne	a0,a3,L(ua_smallCopy_loop)
 	sb	v1,-1(a0)
+	bne	a0,a3,L(ua_smallCopy_loop)
 
-	j	ra
-	nop
+	jr	ra
 
 #else /* R6_CODE */
 
@@ -816,9 +836,9 @@ L(ua_smallCopy_loop):
 # endif
 # define R6_UNALIGNED_WORD_COPY(BYTEOFFSET) \
 	andi	REG7, a2, (NSIZE-1);/* REG7 is # of bytes to by bytes.     */ \
-	beq	REG7, a2, L(lastb); /* Check for bytes to copy by word	   */ \
 	PTR_SUBU a3, a2, REG7;	/* a3 is number of bytes to be copied in   */ \
 				/* (d)word chunks.			   */ \
+	beq	REG7, a2, L(lastb); /* Check for bytes to copy by word	   */ \
 	move	a2, REG7;	/* a2 is # of bytes to copy byte by byte   */ \
 				/* after word loop is finished.		   */ \
 	PTR_ADDU REG6, a0, a3;	/* REG6 is the dst address after loop.	   */ \
@@ -831,10 +851,9 @@ L(r6_ua_wordcopy##BYTEOFFSET):						      \
 	PTR_ADDIU a0, a0, UNIT(1);  /* Increment destination pointer.	   */ \
 	PTR_ADDIU REG2, REG2, UNIT(1); /* Increment aligned source pointer.*/ \
 	move	t0, t1;		/* Move second part of source to first.	   */ \
-	bne	a0, REG6,L(r6_ua_wordcopy##BYTEOFFSET);			      \
 	C_ST	REG3, UNIT(-1)(a0);					      \
+	bne	a0, REG6,L(r6_ua_wordcopy##BYTEOFFSET);			      \
 	j	L(lastb);						      \
-	nop
 
 	/* We are generating R6 code, the destination is 4 byte aligned and
 	   the source is not 4 byte aligned. t8 is 1, 2, or 3 depending on the
@@ -859,7 +878,6 @@ L(r6_unaligned7):
 #endif /* R6_CODE */
 
 	.set	at
-	.set	reorder
 END(MEMCPY_NAME)
 #ifndef ANDROID_CHANGES
 # ifdef _LIBC
diff --git a/sysdeps/mips/memset.S b/sysdeps/mips/memset.S
index 466599b9f4..0c8375c9f5 100644
--- a/sysdeps/mips/memset.S
+++ b/sysdeps/mips/memset.S
@@ -82,6 +82,12 @@
 # endif
 #endif
 
+#if __mips_isa_rev > 5 && defined (__mips_micromips)
+# define PTR_BC	      bc16
+#else
+# define PTR_BC	      bc
+#endif
+
 /* Using PREFETCH_HINT_PREPAREFORSTORE instead of PREFETCH_STORE
    or PREFETCH_STORE_STREAMED offers a large performance advantage
    but PREPAREFORSTORE has some special restrictions to consider.
@@ -205,17 +211,16 @@ LEAF(MEMSET_NAME)
 #endif
 
 	.set	nomips16
-	.set	noreorder
-/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb).  Regardless of
+/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb).  Regardless of
    size, copy dst pointer to v0 for the return value.  */
-	slti	t2,a2,(2 * NSIZE)
-	bne	t2,zero,L(lastb)
+	slti	t2,a2,(4 * NSIZE)
 	move	v0,a0
+	bne	t2,zero,L(lastb)
 
 /* If memset value is not zero, we copy it to all the bytes in a 32 or 64
    bit word.  */
-	beq	a1,zero,L(set0)		/* If memset value is zero no smear  */
 	PTR_SUBU a3,zero,a0
+	beq	a1,zero,L(set0)		/* If memset value is zero no smear  */
 	nop
 
 	/* smear byte into 32 or 64 bit word */
@@ -251,26 +256,30 @@ LEAF(MEMSET_NAME)
 L(set0):
 #ifndef R6_CODE
 	andi	t2,a3,(NSIZE-1)		/* word-unaligned address?          */
-	beq	t2,zero,L(aligned)	/* t2 is the unalignment count      */
 	PTR_SUBU a2,a2,t2
+	beq	t2,zero,L(aligned)	/* t2 is the unalignment count      */
 	C_STHI	a1,0(a0)
 	PTR_ADDU a0,a0,t2
 #else /* R6_CODE */
-	andi	t2,a0,(NSIZE-1)
+	andi	t2,a0,7
+# ifdef __mips_micromips
+	auipc	t9,%pcrel_hi(L(atable))
+	addiu	t9,t9,%pcrel_lo(L(atable)+4)
+	PTR_LSA	t9,t2,t9,1
+# else
 	lapc	t9,L(atable)
 	PTR_LSA	t9,t2,t9,2
+# endif
 	jrc	t9
 L(atable):
-	bc	L(aligned)
-# ifdef USE_DOUBLE
-	bc	L(lb7)
-	bc	L(lb6)
-	bc	L(lb5)
-	bc	L(lb4)
-# endif
-	bc	L(lb3)
-	bc	L(lb2)
-	bc	L(lb1)
+	PTR_BC	L(aligned)
+	PTR_BC	L(lb7)
+	PTR_BC	L(lb6)
+	PTR_BC	L(lb5)
+	PTR_BC	L(lb4)
+	PTR_BC	L(lb3)
+	PTR_BC	L(lb2)
+	PTR_BC	L(lb1)
 L(lb7):
 	sb	a1,6(a0)
 L(lb6):
@@ -300,8 +309,8 @@ L(aligned):
    left to store or we would have jumped to L(lastb) earlier in the code.  */
 #ifdef DOUBLE_ALIGN
 	andi	t2,a3,4
-	beq	t2,zero,L(double_aligned)
 	PTR_SUBU a2,a2,t2
+	beq	t2,zero,L(double_aligned)
 	sw	a1,0(a0)
 	PTR_ADDU a0,a0,t2
 L(double_aligned):
@@ -313,8 +322,8 @@ L(double_aligned):
    chunks have been copied.  We will loop, incrementing a0 until it equals
    a3.  */
 	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
+	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
 
 /* When in the loop we may prefetch with the 'prepare to store' hint,
@@ -339,7 +348,6 @@ L(loop16w):
     && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
 	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
 	bgtz	v1,L(skip_pref)
-	nop
 #endif
 #ifdef R6_CODE
 	PREFETCH_FOR_STORE (2, a0)
@@ -366,7 +374,6 @@ L(skip_pref):
 	C_ST	a1,UNIT(15)(a0)
 	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
 	bne	a0,a3,L(loop16w)
-	nop
 	move	a2,t8
 
 /* Here we have dest word-aligned but less than 64-bytes or 128 bytes to go.
@@ -376,7 +383,6 @@ L(chkw):
 	andi	t8,a2,NSIZEMASK	/* is there a 32-byte/64-byte chunk.  */
 				/* the t8 is the reminder count past 32-bytes */
 	beq	a2,t8,L(chk1w)/* when a2==t8, no 32-byte chunk */
-	nop
 	C_ST	a1,UNIT(0)(a0)
 	C_ST	a1,UNIT(1)(a0)
 	C_ST	a1,UNIT(2)(a0)
@@ -394,30 +400,28 @@ L(chkw):
    been copied.  We will loop, incrementing a0 until a0 equals a3.  */
 L(chk1w):
 	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	beq	a2,t8,L(lastb)
 	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
+	beq	a2,t8,L(lastb)
 	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
 
 /* copying in words (4-byte or 8 byte chunks) */
 L(wordCopy_loop):
 	PTR_ADDIU a0,a0,UNIT(1)
-	bne	a0,a3,L(wordCopy_loop)
 	C_ST	a1,UNIT(-1)(a0)
+	bne	a0,a3,L(wordCopy_loop)
 
 /* Copy the last 8 (or 16) bytes */
 L(lastb):
-	blez	a2,L(leave)
 	PTR_ADDU a3,a0,a2       /* a3 is the last dst address */
+	blez	a2,L(leave)
 L(lastbloop):
 	PTR_ADDIU a0,a0,1
-	bne	a0,a3,L(lastbloop)
 	sb	a1,-1(a0)
+	bne	a0,a3,L(lastbloop)
 L(leave):
-	j	ra
-	nop
+	jr	ra
 
 	.set	at
-	.set	reorder
 END(MEMSET_NAME)
 #ifndef ANDROID_CHANGES
 # ifdef _LIBC
diff --git a/sysdeps/mips/mips32/crtn.S b/sysdeps/mips/mips32/crtn.S
index 89ecbd9882..568aabd86e 100644
--- a/sysdeps/mips/mips32/crtn.S
+++ b/sysdeps/mips/mips32/crtn.S
@@ -40,18 +40,10 @@
 
 	.section .init,"ax",@progbits
 	lw $31,28($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	addiu $sp,$sp,32
-	.set macro
-	.set reorder
+	jr $31
 
 	.section .fini,"ax",@progbits
 	lw $31,28($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	addiu $sp,$sp,32
-	.set macro
-	.set reorder
+	jr $31
diff --git a/sysdeps/mips/mips64/__longjmp.c b/sysdeps/mips/mips64/__longjmp.c
index 4a93e884c0..1a9bb7b23e 100644
--- a/sysdeps/mips/mips64/__longjmp.c
+++ b/sysdeps/mips/mips64/__longjmp.c
@@ -87,7 +87,7 @@ __longjmp (__jmp_buf env_arg, int val_arg)
   else
     asm volatile ("move $2, %0" : : "r" (val));
 
-  asm volatile ("j $31");
+  asm volatile ("jr $31");
 
   /* Avoid `volatile function does return' warnings.  */
   for (;;);
diff --git a/sysdeps/mips/mips64/add_n.S b/sysdeps/mips/mips64/add_n.S
index 345d62fbc5..bab523fd5a 100644
--- a/sysdeps/mips/mips64/add_n.S
+++ b/sysdeps/mips/mips64/add_n.S
@@ -37,16 +37,13 @@ ENTRY (__mpn_add_n)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set	noreorder
-	.set	nomacro
-
 	ld	$10,0($5)
 	ld	$11,0($6)
 
 	daddiu	$7,$7,-1
 	and	$9,$7,4-1	# number of limbs in first loop
-	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 	move	$2,$0
+	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 
 	dsubu	$7,$7,$9
 
@@ -64,11 +61,10 @@ L(Loop0):	daddiu	$9,$9,-1
 	daddiu	$6,$6,8
 	move	$10,$12
 	move	$11,$13
-	bne	$9,$0,L(Loop0)
 	daddiu	$4,$4,8
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$7,$0,L(Lend)
-	nop
 
 L(Loop):	daddiu	$7,$7,-4
 
@@ -111,15 +107,15 @@ L(Loop):	daddiu	$7,$7,-4
 	daddiu	$5,$5,32
 	daddiu	$6,$6,32
 
-	bne	$7,$0,L(Loop)
 	daddiu	$4,$4,32
+	bne	$7,$0,L(Loop)
 
 L(Lend):	daddu	$11,$11,$2
 	sltu	$8,$11,$2
 	daddu	$11,$10,$11
 	sltu	$2,$11,$10
 	sd	$11,0($4)
-	j	$31
 	or	$2,$2,$8
+	jr	$31
 
 END (__mpn_add_n)
diff --git a/sysdeps/mips/mips64/addmul_1.S b/sysdeps/mips/mips64/addmul_1.S
index d105938f00..d84edd76a0 100644
--- a/sysdeps/mips/mips64/addmul_1.S
+++ b/sysdeps/mips/mips64/addmul_1.S
@@ -36,9 +36,6 @@ ENTRY (__mpn_addmul_1)
 #ifdef PIC
 	SETUP_GP /* ??? unused */
 #endif
-	.set    noreorder
-	.set    nomacro
-
  # warm up phase 0
 	ld	$8,0($5)
 
@@ -52,12 +49,12 @@ ENTRY (__mpn_addmul_1)
 #endif
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		# zero cy2
+	beq	$6,$0,L(LC0)
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	ld	$8,0($5)	# load new s1 limb as early as possible
+	beq	$6,$0,L(LC1)
 
 L(Loop):	ld	$10,0($4)
 #if __mips_isa_rev < 6
@@ -83,8 +80,8 @@ L(Loop):	ld	$10,0($4)
 	daddu	$2,$2,$10
 	sd	$3,0($4)
 	daddiu	$4,$4,8
-	bne	$6,$0,L(Loop)
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	bne	$6,$0,L(Loop)
 
  # cool down phase 1
 L(LC1):	ld	$10,0($4)
@@ -125,7 +122,7 @@ L(LC0):	ld	$10,0($4)
 	sltu	$10,$3,$10
 	daddu	$2,$2,$10
 	sd	$3,0($4)
-	j	$31
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	jr	$31
 
 END (__mpn_addmul_1)
diff --git a/sysdeps/mips/mips64/lshift.S b/sysdeps/mips/mips64/lshift.S
index 2ea2e58b85..ca84385998 100644
--- a/sysdeps/mips/mips64/lshift.S
+++ b/sysdeps/mips/mips64/lshift.S
@@ -36,9 +36,6 @@ ENTRY (__mpn_lshift)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set	noreorder
-	.set	nomacro
-
 	dsll	$2,$6,3
 	daddu	$5,$5,$2	# make r5 point at end of src
 	ld	$10,-8($5)	# load first limb
@@ -46,8 +43,8 @@ ENTRY (__mpn_lshift)
 	daddu	$4,$4,$2	# make r4 point at end of res
 	daddiu	$6,$6,-1
 	and	$9,$6,4-1	# number of limbs in first loop
-	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 	dsrl	$2,$10,$13	# compute function result
+	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 
 	dsubu	$6,$6,$9
 
@@ -59,11 +56,10 @@ L(Loop0):	ld	$3,-16($5)
 	dsrl	$12,$3,$13
 	move	$10,$3
 	or	$8,$11,$12
-	bne	$9,$0,L(Loop0)
 	sd	$8,0($4)
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$6,$0,L(Lend)
-	nop
 
 L(Loop):	ld	$3,-16($5)
 	daddiu	$4,$4,-32
@@ -91,10 +87,10 @@ L(Loop):	ld	$3,-16($5)
 
 	daddiu	$5,$5,-32
 	or	$8,$14,$9
-	bgtz	$6,L(Loop)
 	sd	$8,0($4)
+	bgtz	$6,L(Loop)
 
 L(Lend):	dsll	$8,$10,$7
-	j	$31
 	sd	$8,-8($4)
+	jr	$31
 END (__mpn_lshift)
diff --git a/sysdeps/mips/mips64/mul_1.S b/sysdeps/mips/mips64/mul_1.S
index 321789b345..7604bac3a2 100644
--- a/sysdeps/mips/mips64/mul_1.S
+++ b/sysdeps/mips/mips64/mul_1.S
@@ -37,9 +37,6 @@ ENTRY (__mpn_mul_1)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set    noreorder
-	.set    nomacro
-
  # warm up phase 0
 	ld	$8,0($5)
 
@@ -53,12 +50,12 @@ ENTRY (__mpn_mul_1)
 #endif
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		# zero cy2
+	beq	$6,$0,L(LC0)
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	ld	$8,0($5)	# load new s1 limb as early as possible
+	beq	$6,$0,L(LC1)
 
 #if __mips_isa_rev < 6
 L(Loop):	mflo	$10
@@ -80,8 +77,8 @@ L(Loop):	move	$10,$11
 	sltu	$2,$10,$2	# carry from previous addition -> $2
 	sd	$10,0($4)
 	daddiu	$4,$4,8
-	bne	$6,$0,L(Loop)
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	bne	$6,$0,L(Loop)
 
  # cool down phase 1
 #if __mips_isa_rev < 6
@@ -114,7 +111,7 @@ L(LC0):	move	$10,$11
 	daddu	$10,$10,$2
 	sltu	$2,$10,$2
 	sd	$10,0($4)
-	j	$31
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	jr	$31
 
 END (__mpn_mul_1)
diff --git a/sysdeps/mips/mips64/n32/crtn.S b/sysdeps/mips/mips64/n32/crtn.S
index 633d79cfad..8d4c83381c 100644
--- a/sysdeps/mips/mips64/n32/crtn.S
+++ b/sysdeps/mips/mips64/n32/crtn.S
@@ -41,19 +41,11 @@
 	.section .init,"ax",@progbits
 	ld $31,8($sp)
 	ld $28,0($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	addiu $sp,$sp,16
-	.set macro
-	.set reorder
+	jr $31
 
 	.section .fini,"ax",@progbits
 	ld $31,8($sp)
 	ld $28,0($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	addiu $sp,$sp,16
-	.set macro
-	.set reorder
+	jr $31
diff --git a/sysdeps/mips/mips64/n64/crtn.S b/sysdeps/mips/mips64/n64/crtn.S
index 99ed1e3263..110040c9fc 100644
--- a/sysdeps/mips/mips64/n64/crtn.S
+++ b/sysdeps/mips/mips64/n64/crtn.S
@@ -41,19 +41,11 @@
 	.section .init,"ax",@progbits
 	ld $31,8($sp)
 	ld $28,0($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	daddiu $sp,$sp,16
-	.set macro
-	.set reorder
+	jr $31
 
 	.section .fini,"ax",@progbits
 	ld $31,8($sp)
 	ld $28,0($sp)
-	.set noreorder
-	.set nomacro
-	j $31
 	daddiu $sp,$sp,16
-	.set macro
-	.set reorder
+	jr $31
diff --git a/sysdeps/mips/mips64/rshift.S b/sysdeps/mips/mips64/rshift.S
index 1f6e3a2a12..153aacfd86 100644
--- a/sysdeps/mips/mips64/rshift.S
+++ b/sysdeps/mips/mips64/rshift.S
@@ -36,15 +36,12 @@ ENTRY (__mpn_rshift)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set	noreorder
-	.set	nomacro
-
 	ld	$10,0($5)	# load first limb
 	dsubu	$13,$0,$7
 	daddiu	$6,$6,-1
 	and	$9,$6,4-1	# number of limbs in first loop
-	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 	dsll	$2,$10,$13	# compute function result
+	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 
 	dsubu	$6,$6,$9
 
@@ -56,11 +53,10 @@ L(Loop0):	ld	$3,8($5)
 	dsll	$12,$3,$13
 	move	$10,$3
 	or	$8,$11,$12
-	bne	$9,$0,L(Loop0)
 	sd	$8,-8($4)
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$6,$0,L(Lend)
-	nop
 
 L(Loop):	ld	$3,8($5)
 	daddiu	$4,$4,32
@@ -88,10 +84,10 @@ L(Loop):	ld	$3,8($5)
 
 	daddiu	$5,$5,32
 	or	$8,$14,$9
-	bgtz	$6,L(Loop)
 	sd	$8,-8($4)
+	bgtz	$6,L(Loop)
 
 L(Lend):	dsrl	$8,$10,$7
-	j	$31
 	sd	$8,0($4)
+	jr	$31
 END (__mpn_rshift)
diff --git a/sysdeps/mips/mips64/sub_n.S b/sysdeps/mips/mips64/sub_n.S
index b83d5ccab6..5b7337472f 100644
--- a/sysdeps/mips/mips64/sub_n.S
+++ b/sysdeps/mips/mips64/sub_n.S
@@ -37,16 +37,13 @@ ENTRY (__mpn_sub_n)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set	noreorder
-	.set	nomacro
-
 	ld	$10,0($5)
 	ld	$11,0($6)
 
 	daddiu	$7,$7,-1
 	and	$9,$7,4-1	# number of limbs in first loop
-	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 	move	$2,$0
+	beq	$9,$0,L(L0)	# if multiple of 4 limbs, skip first loop
 
 	dsubu	$7,$7,$9
 
@@ -64,11 +61,10 @@ L(Loop0):	daddiu	$9,$9,-1
 	daddiu	$6,$6,8
 	move	$10,$12
 	move	$11,$13
-	bne	$9,$0,L(Loop0)
 	daddiu	$4,$4,8
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$7,$0,L(Lend)
-	nop
 
 L(Loop):	daddiu	$7,$7,-4
 
@@ -111,15 +107,15 @@ L(Loop):	daddiu	$7,$7,-4
 	daddiu	$5,$5,32
 	daddiu	$6,$6,32
 
-	bne	$7,$0,L(Loop)
 	daddiu	$4,$4,32
+	bne	$7,$0,L(Loop)
 
 L(Lend):	daddu	$11,$11,$2
 	sltu	$8,$11,$2
 	dsubu	$11,$10,$11
 	sltu	$2,$10,$11
 	sd	$11,0($4)
-	j	$31
 	or	$2,$2,$8
+	jr	$31
 
 END (__mpn_sub_n)
diff --git a/sysdeps/mips/mips64/submul_1.S b/sysdeps/mips/mips64/submul_1.S
index 46f26e8dde..121433d232 100644
--- a/sysdeps/mips/mips64/submul_1.S
+++ b/sysdeps/mips/mips64/submul_1.S
@@ -37,9 +37,6 @@ ENTRY (__mpn_submul_1)
 #ifdef __PIC__
 	SETUP_GP /* ??? unused */
 #endif
-	.set    noreorder
-	.set    nomacro
-
  # warm up phase 0
 	ld	$8,0($5)
 
@@ -53,12 +50,12 @@ ENTRY (__mpn_submul_1)
 #endif
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		# zero cy2
+	beq	$6,$0,L(LC0)
 
 	daddiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	ld	$8,0($5)	# load new s1 limb as early as possible
+	beq	$6,$0,L(LC1)
 
 L(Loop):	ld	$10,0($4)
 #if __mips_isa_rev < 6
@@ -84,8 +81,8 @@ L(Loop):	ld	$10,0($4)
 	daddu	$2,$2,$10
 	sd	$3,0($4)
 	daddiu	$4,$4,8
-	bne	$6,$0,L(Loop)
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	bne	$6,$0,L(Loop)
 
  # cool down phase 1
 L(LC1):	ld	$10,0($4)
@@ -126,7 +123,7 @@ L(LC0):	ld	$10,0($4)
 	sgtu	$10,$3,$10
 	daddu	$2,$2,$10
 	sd	$3,0($4)
-	j	$31
 	daddu	$2,$9,$2	# add high product limb and carry from addition
+	jr	$31
 
 END (__mpn_submul_1)
diff --git a/sysdeps/mips/mul_1.S b/sysdeps/mips/mul_1.S
index cfd4cc7cd5..ae65ebe79d 100644
--- a/sysdeps/mips/mul_1.S
+++ b/sysdeps/mips/mul_1.S
@@ -31,12 +31,9 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_mul_1)
-	.set    noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set    nomacro
-
 	/* warm up phase 0 */
 	lw	$8,0($5)
 
@@ -50,12 +47,12 @@ ENTRY (__mpn_mul_1)
 #endif
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		/* zero cy2 */
+	beq	$6,$0,L(LC0)
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	lw	$8,0($5)	/* load new s1 limb as early as possible */
+	beq	$6,$0,L(LC1)
 
 
 #if  __mips_isa_rev < 6
@@ -78,8 +75,8 @@ L(Loop):	move	$10,$11
 	sltu	$2,$10,$2	/* carry from previous addition -> $2 */
 	sw	$10,0($4)
 	addiu	$4,$4,4
-	bne	$6,$0,L(Loop)	/* should be "bnel" */
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	bne	$6,$0,L(Loop)	/* should be "bnel" */
 
 	/* cool down phase 1 */
 #if __mips_isa_rev < 6
@@ -112,6 +109,6 @@ L(LC0):	move	$10,$11
 	addu	$10,$10,$2
 	sltu	$2,$10,$2
 	sw	$10,0($4)
-	j	$31
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	jr	$31
 	END (__mpn_mul_1)
diff --git a/sysdeps/mips/rshift.S b/sysdeps/mips/rshift.S
index e19fa41234..b453ca2ba7 100644
--- a/sysdeps/mips/rshift.S
+++ b/sysdeps/mips/rshift.S
@@ -30,18 +30,15 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_rshift)
-	.set	noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set	nomacro
-
 	lw	$10,0($5)	/* load first limb */
 	subu	$13,$0,$7
 	addiu	$6,$6,-1
 	and	$9,$6,4-1	/* number of limbs in first loop */
+	sll	$2,$10,$13	/* compute function result */
 	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop*/
-	 sll	$2,$10,$13	/* compute function result */
 
 	subu	$6,$6,$9
 
@@ -53,11 +50,10 @@ L(Loop0):	lw	$3,4($5)
 	sll	$12,$3,$13
 	move	$10,$3
 	or	$8,$11,$12
+	sw	$8,-4($4)
 	bne	$9,$0,L(Loop0)
-	 sw	$8,-4($4)
 
 L(L0):	beq	$6,$0,L(Lend)
-	 nop
 
 L(Loop):	lw	$3,4($5)
 	addiu	$4,$4,16
@@ -85,10 +81,10 @@ L(Loop):	lw	$3,4($5)
 
 	addiu	$5,$5,16
 	or	$8,$14,$9
+	sw	$8,-4($4)
 	bgtz	$6,L(Loop)
-	 sw	$8,-4($4)
 
 L(Lend):	srl	$8,$10,$7
-	j	$31
 	sw	$8,0($4)
+	jr	$31
 	END (__mpn_rshift)
diff --git a/sysdeps/mips/sub_n.S b/sysdeps/mips/sub_n.S
index 3e988ecbb4..9f7cb5458d 100644
--- a/sysdeps/mips/sub_n.S
+++ b/sysdeps/mips/sub_n.S
@@ -31,19 +31,16 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_sub_n)
-	.set	noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set	nomacro
-
 	lw	$10,0($5)
 	lw	$11,0($6)
 
 	addiu	$7,$7,-1
 	and	$9,$7,4-1	/* number of limbs in first loop */
-	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 	move	$2,$0
+	beq	$9,$0,L(L0)	/* if multiple of 4 limbs, skip first loop */
 
 	subu	$7,$7,$9
 
@@ -61,11 +58,10 @@ L(Loop0):	addiu	$9,$9,-1
 	addiu	$6,$6,4
 	move	$10,$12
 	move	$11,$13
-	bne	$9,$0,L(Loop0)
 	addiu	$4,$4,4
+	bne	$9,$0,L(Loop0)
 
 L(L0):	beq	$7,$0,L(Lend)
-	nop
 
 L(Loop):	addiu	$7,$7,-4
 
@@ -108,14 +104,14 @@ L(Loop):	addiu	$7,$7,-4
 	addiu	$5,$5,16
 	addiu	$6,$6,16
 
-	bne	$7,$0,L(Loop)
 	addiu	$4,$4,16
+	bne	$7,$0,L(Loop)
 
 L(Lend):	addu	$11,$11,$2
 	sltu	$8,$11,$2
 	subu	$11,$10,$11
 	sltu	$2,$10,$11
 	sw	$11,0($4)
-	j	$31
 	or	$2,$2,$8
+	jr	$31
 	END (__mpn_sub_n)
diff --git a/sysdeps/mips/submul_1.S b/sysdeps/mips/submul_1.S
index be8e2844ef..8405801c57 100644
--- a/sysdeps/mips/submul_1.S
+++ b/sysdeps/mips/submul_1.S
@@ -31,12 +31,9 @@ along with the GNU MP Library.  If not, see
 	.option pic2
 #endif
 ENTRY (__mpn_submul_1)
-	.set    noreorder
 #ifdef __PIC__
 	.cpload t9
 #endif
-	.set    nomacro
-
 	/* warm up phase 0 */
 	lw	$8,0($5)
 
@@ -50,12 +47,12 @@ ENTRY (__mpn_submul_1)
 #endif
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC0)
 	move	$2,$0		/* zero cy2 */
+	beq	$6,$0,L(LC0)
 
 	addiu	$6,$6,-1
-	beq	$6,$0,L(LC1)
 	lw	$8,0($5)	/* load new s1 limb as early as possible */
+	beq	$6,$0,L(LC1)
 
 L(Loop):	lw	$10,0($4)
 #if __mips_isa_rev < 6
@@ -81,8 +78,8 @@ L(Loop):	lw	$10,0($4)
 	addu	$2,$2,$10
 	sw	$3,0($4)
 	addiu	$4,$4,4
-	bne	$6,$0,L(Loop)	/* should be "bnel" */
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	bne	$6,$0,L(Loop)	/* should be "bnel" */
 
 	/* cool down phase 1 */
 L(LC1):	lw	$10,0($4)
@@ -123,6 +120,6 @@ L(LC0):	lw	$10,0($4)
 	sgtu	$10,$3,$10
 	addu	$2,$2,$10
 	sw	$3,0($4)
-	j	$31
 	addu	$2,$9,$2	/* add high product limb and carry from addition */
+	jr	$31
 	END (__mpn_submul_1)
diff --git a/sysdeps/mips/sys/asm.h b/sysdeps/mips/sys/asm.h
index e43eb39ca3..62f9e549c6 100644
--- a/sysdeps/mips/sys/asm.h
+++ b/sysdeps/mips/sys/asm.h
@@ -71,23 +71,21 @@
 		.set reorder
 /* Set gp when not at 1st instruction */
 # define SETUP_GPX(r)					\
-		.set noreorder;				\
 		move r, $31;	 /* Save old ra.  */	\
 		bal 10f; /* Find addr of cpload.  */	\
-		nop;					\
 10:							\
+		.set noreorder;				\
 		.cpload $31;				\
-		move $31, r;				\
-		.set reorder
+		.set reorder;				\
+		move $31, r;
 # define SETUP_GPX_L(r, l)				\
-		.set noreorder;				\
 		move r, $31;	 /* Save old ra.  */	\
 		bal l;   /* Find addr of cpload.  */	\
-		nop;					\
 l:							\
+		.set noreorder;				\
 		.cpload $31;				\
-		move $31, r;				\
-		.set reorder
+		.set reorder;				\
+		move $31, r;
 # define SAVE_GP(x) \
 		.cprestore x /* Save gp trigger t9/jalr conversion.	 */
 # define SETUP_GP64(a, b)
@@ -108,20 +106,14 @@ l:							\
 		.cpsetup $25, gpoffset, proc
 # define SETUP_GPX64(cp_reg, ra_save)			\
 		move ra_save, $31; /* Save old ra.  */	\
-		.set noreorder;				\
 		bal 10f; /* Find addr of .cpsetup.  */	\
-		nop;					\
 10:							\
-		.set reorder;				\
 		.cpsetup $31, cp_reg, 10b;		\
 		move $31, ra_save
 # define SETUP_GPX64_L(cp_reg, ra_save, l)  \
 		move ra_save, $31; /* Save old ra.  */	\
-		.set noreorder;				\
 		bal l;   /* Find addr of .cpsetup.  */	\
-		nop;					\
 l:							\
-		.set reorder;				\
 		.cpsetup $31, cp_reg, l;		\
 		move $31, ra_save
 # define RESTORE_GP64 \
diff --git a/sysdeps/unix/mips/mips32/sysdep.h b/sysdeps/unix/mips/mips32/sysdep.h
index c515b94540..df3f73a4eb 100644
--- a/sysdeps/unix/mips/mips32/sysdep.h
+++ b/sysdeps/unix/mips/mips32/sysdep.h
@@ -38,18 +38,14 @@
 L(syse1):
 #else
 #define PSEUDO(name, syscall_name, args) \
-  .set noreorder;							      \
   .set nomips16;							      \
   .align 2;								      \
   cfi_startproc;							      \
   99: j __syscall_error;						      \
-  nop;									      \
   cfi_endproc;								      \
   ENTRY(name)								      \
-  .set noreorder;							      \
   li v0, SYS_ify(syscall_name);						      \
   syscall;								      \
-  .set reorder;								      \
   bne a3, zero, 99b;							      \
 L(syse1):
 #endif
diff --git a/sysdeps/unix/mips/mips64/sysdep.h b/sysdeps/unix/mips/mips64/sysdep.h
index 6565b84e3a..c0772002e6 100644
--- a/sysdeps/unix/mips/mips64/sysdep.h
+++ b/sysdeps/unix/mips/mips64/sysdep.h
@@ -45,18 +45,14 @@
 L(syse1):
 #else
 #define PSEUDO(name, syscall_name, args) \
-  .set noreorder;							      \
   .align 2;								      \
   .set nomips16;							      \
   cfi_startproc;							      \
   99: j __syscall_error;						      \
-  nop;                                                                        \
   cfi_endproc;								      \
   ENTRY(name)								      \
-  .set noreorder;							      \
   li v0, SYS_ify(syscall_name);						      \
   syscall;								      \
-  .set reorder;								      \
   bne a3, zero, 99b;							      \
 L(syse1):
 #endif
diff --git a/sysdeps/unix/mips/sysdep.h b/sysdeps/unix/mips/sysdep.h
index d1e0460260..07cd5c4a06 100644
--- a/sysdeps/unix/mips/sysdep.h
+++ b/sysdeps/unix/mips/sysdep.h
@@ -48,7 +48,6 @@
   .align 2;						\
   ENTRY(name)						\
   .set nomips16;					\
-  .set noreorder;					\
   li v0, SYS_ify(syscall_name);				\
   syscall
 
@@ -61,7 +60,6 @@
   .align 2;						\
   ENTRY(name)						\
   .set nomips16;					\
-  .set noreorder;					\
   li v0, SYS_ify(syscall_name);				\
   syscall
 
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h b/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
index 47a1b97351..647a66ee1f 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
+++ b/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
@@ -140,10 +140,8 @@ union __mips_syscall_return
 	register long int __v0 asm ("$2");				\
 	register long int __a3 asm ("$7");				\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set reorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input								\
 	: __SYSCALL_CLOBBERS);						\
@@ -164,10 +162,8 @@ union __mips_syscall_return
 	register long int __a0 asm ("$4") = _arg1;			\
 	register long int __a3 asm ("$7");				\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set reorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0)						\
 	: __SYSCALL_CLOBBERS);						\
@@ -190,10 +186,8 @@ union __mips_syscall_return
 	register long int __a1 asm ("$5") = _arg2;			\
 	register long int __a3 asm ("$7");				\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1)					\
 	: __SYSCALL_CLOBBERS);						\
@@ -219,10 +213,8 @@ union __mips_syscall_return
 	register long int __a2 asm ("$6") = _arg3;			\
 	register long int __a3 asm ("$7");				\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2)			\
 	: __SYSCALL_CLOBBERS);						\
@@ -249,10 +241,8 @@ union __mips_syscall_return
 	register long int __a2 asm ("$6") = _arg3;			\
 	register long int __a3 asm ("$7") = _arg4;			\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "+r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2)			\
 	: __SYSCALL_CLOBBERS);						\
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/sysdep.h b/sysdeps/unix/sysv/linux/mips/mips64/sysdep.h
index 0438bed23d..8f4787352a 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/sysdep.h
+++ b/sysdeps/unix/sysv/linux/mips/mips64/sysdep.h
@@ -95,10 +95,8 @@
 	register __syscall_arg_t __v0 asm ("$2");			\
 	register __syscall_arg_t __a3 asm ("$7");			\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set reorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input								\
 	: __SYSCALL_CLOBBERS);						\
@@ -119,10 +117,8 @@
 	register __syscall_arg_t __a0 asm ("$4") = _arg1;		\
 	register __syscall_arg_t __a3 asm ("$7");			\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set reorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0)						\
 	: __SYSCALL_CLOBBERS);						\
@@ -145,10 +141,8 @@
 	register __syscall_arg_t __a1 asm ("$5") = _arg2;		\
 	register __syscall_arg_t __a3 asm ("$7");			\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1)					\
 	: __SYSCALL_CLOBBERS);						\
@@ -173,10 +167,8 @@
 	register __syscall_arg_t __a2 asm ("$6") = _arg3;		\
 	register __syscall_arg_t __a3 asm ("$7");			\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "=r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2)			\
 	: __SYSCALL_CLOBBERS);						\
@@ -203,10 +195,8 @@
 	register __syscall_arg_t __a2 asm ("$6") = _arg3;		\
 	register __syscall_arg_t __a3 asm ("$7") = _arg4;		\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "+r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2)			\
 	: __SYSCALL_CLOBBERS);						\
@@ -235,10 +225,8 @@
 	register __syscall_arg_t __a3 asm ("$7") = _arg4;		\
 	register __syscall_arg_t __a4 asm ("$8") = _arg5;		\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "+r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2), "r" (__a4)		\
 	: __SYSCALL_CLOBBERS);						\
@@ -269,10 +257,8 @@
 	register __syscall_arg_t __a4 asm ("$8") = _arg5;		\
 	register __syscall_arg_t __a5 asm ("$9") = _arg6;		\
 	__asm__ volatile (						\
-	".set\tnoreorder\n\t"						\
 	v0_init								\
 	"syscall\n\t"							\
-	".set\treorder"							\
 	: "=r" (__v0), "+r" (__a3)					\
 	: input, "r" (__a0), "r" (__a1), "r" (__a2), "r" (__a4),	\
 	  "r" (__a5)							\
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 02/11] Fix rtld link_map initialization issues
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 00/11] " Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 01/11] Updates for microMIPS Release 6 Aleksandar Rakic
@ 2025-01-23 13:42 ` Aleksandar Rakic
  2025-01-23 13:42 ` [PATCH 03/11] Fix issues with removing no-reorder directives Aleksandar Rakic
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:42 UTC (permalink / raw)
  To: libc-alpha
  Cc: aleksandar.rakic, djordje.todorovic, cfu, Matthew Fortune,
	Faraz Shahbazker

Import patch fixing rtld link_map initialization issues from:
https://sourceware.org/ml/libc-alpha/2015-03/msg00704.html
Author: Sandra Loosemore

Cherry-picked 1507c7be47ef07d4b264168ab031d8c2ed4678f2
from https://github.com/MIPS/glibc

Signed-off-by: Matthew Fortune <matthew.fortune@imgtec.com>
Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 elf/rtld.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/elf/rtld.c b/elf/rtld.c
index 1e2e9ad5a8..252f4d6666 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -522,7 +522,7 @@ _dl_start (void *arg)
   rtld_timer_start (&info.start_time);
 #endif
 
-  /* Partly clean the `bootstrap_map' structure up.  Don't use
+  /* Zero-initialize the `bootstrap_map' structure.  Don't use
      `memset' since it might not be built in or inlined and we cannot
      make function calls at this point.  Use '__builtin_memset' if we
      know it is available.  We do not have to clear the memory if we
@@ -530,12 +530,14 @@ _dl_start (void *arg)
      are initialized to zero by default.  */
 #ifndef DONT_USE_BOOTSTRAP_MAP
 # ifdef HAVE_BUILTIN_MEMSET
-  __builtin_memset (bootstrap_map.l_info, '\0', sizeof (bootstrap_map.l_info));
+  __builtin_memset (&bootstrap_map, '\0', sizeof (struct link_map));
 # else
-  for (size_t cnt = 0;
-       cnt < sizeof (bootstrap_map.l_info) / sizeof (bootstrap_map.l_info[0]);
-       ++cnt)
-    bootstrap_map.l_info[cnt] = 0;
+  {
+    char *p = (char *) &bootstrap_map;
+    char *pend = p + sizeof (struct link_map);
+    while (p < pend)
+      *(p++) = '\0';
+  }
 # endif
 #endif
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 03/11] Fix issues with removing no-reorder directives
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (2 preceding siblings ...)
  2025-01-23 13:42 ` [PATCH 02/11] Fix rtld link_map initialization issues Aleksandar Rakic
@ 2025-01-23 13:42 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 04/11] Add C implementation of memcpy/memset Aleksandar Rakic
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:42 UTC (permalink / raw)
  To: libc-alpha
  Cc: aleksandar.rakic, djordje.todorovic, cfu, Andrew Bennett,
	Faraz Shahbazker

1. Added -O2 to the Makefile to ensure that assembly sources have
   their delay slots filled.

2. Also move the no-reorder directive into the PIC section of the
   setjmp code.

Cherry-picked 4e451260675b2e54535eafc2df35d92653acd084
from https://github.com/MIPS/glibc

Signed-off-by: Andrew Bennett <andrew.bennett@imgtec.com>
Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/Makefile     | 2 ++
 sysdeps/mips/bsd-setjmp.S | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/sysdeps/mips/Makefile b/sysdeps/mips/Makefile
index d189973aa0..17ddc2a97c 100644
--- a/sysdeps/mips/Makefile
+++ b/sysdeps/mips/Makefile
@@ -18,9 +18,11 @@ CPPFLAGS-crtn.S += $(pic-ccflag)
 endif
 
 ASFLAGS-.os += $(pic-ccflag)
+
 # libc.a and libc_p.a must be compiled with -fPIE/-fpie for static PIE.
 ASFLAGS-.o += $(pie-default)
 ASFLAGS-.op += $(pie-default)
+ASFLAGS += -O2
 
 ifeq ($(subdir),elf)
 
diff --git a/sysdeps/mips/bsd-setjmp.S b/sysdeps/mips/bsd-setjmp.S
index 7e4d7dcb0b..8c06b9957c 100644
--- a/sysdeps/mips/bsd-setjmp.S
+++ b/sysdeps/mips/bsd-setjmp.S
@@ -28,8 +28,8 @@
 	.option pic2
 #endif
 ENTRY (setjmp)
-	.set	noreorder
 #ifdef __PIC__
+	.set	noreorder
 	.cpload t9
 	.set	reorder
 	la	t9, C_SYMBOL_NAME (__sigsetjmp)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 04/11] Add C implementation of memcpy/memset
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (3 preceding siblings ...)
  2025-01-23 13:42 ` [PATCH 03/11] Fix issues with removing no-reorder directives Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 05/11] Add optimized assembly for strcmp Aleksandar Rakic
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

Add improved C implementation of memcpy/memset and remove corresponding
.S files.

Cherry-picked 6b74133706246af94b71e4154e4ca09482828c9f
from https://github.com/MIPS/glibc

Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/memcpy.S | 886 ------------------------------------------
 sysdeps/mips/memcpy.c | 415 ++++++++++++++++++++
 sysdeps/mips/memset.S | 430 --------------------
 sysdeps/mips/memset.c | 187 +++++++++
 4 files changed, 602 insertions(+), 1316 deletions(-)
 delete mode 100644 sysdeps/mips/memcpy.S
 create mode 100644 sysdeps/mips/memcpy.c
 delete mode 100644 sysdeps/mips/memset.S
 create mode 100644 sysdeps/mips/memset.c

diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S
deleted file mode 100644
index 96d1c92d89..0000000000
--- a/sysdeps/mips/memcpy.S
+++ /dev/null
@@ -1,886 +0,0 @@
-/* Copyright (C) 2012-2024 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#ifdef ANDROID_CHANGES
-# include "machine/asm.h"
-# include "machine/regdef.h"
-# define USE_MEMMOVE_FOR_OVERLAP
-# define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD_STREAMED
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#elif _LIBC
-# include <sysdep.h>
-# include <regdef.h>
-# include <sys/asm.h>
-# define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD_STREAMED
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#elif defined _COMPILING_NEWLIB
-# include "machine/asm.h"
-# include "machine/regdef.h"
-# define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD_STREAMED
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#else
-# include <regdef.h>
-# include <sys/asm.h>
-#endif
-
-#if (_MIPS_ISA == _MIPS_ISA_MIPS4) || (_MIPS_ISA == _MIPS_ISA_MIPS5) || \
-    (_MIPS_ISA == _MIPS_ISA_MIPS32) || (_MIPS_ISA == _MIPS_ISA_MIPS64)
-# ifndef DISABLE_PREFETCH
-#  define USE_PREFETCH
-# endif
-#endif
-
-#if defined(_MIPS_SIM) && ((_MIPS_SIM == _ABI64) || (_MIPS_SIM == _ABIN32))
-# ifndef DISABLE_DOUBLE
-#  define USE_DOUBLE
-# endif
-#endif
-
-/* Some asm.h files do not have the L macro definition.  */
-#ifndef L
-# if _MIPS_SIM == _ABIO32
-#  define L(label) $L ## label
-# else
-#  define L(label) .L ## label
-# endif
-#endif
-
-/* Some asm.h files do not have the PTR_ADDIU macro definition.  */
-#ifndef PTR_ADDIU
-# ifdef USE_DOUBLE
-#  define PTR_ADDIU	daddiu
-# else
-#  define PTR_ADDIU	addiu
-# endif
-#endif
-
-/* Some asm.h files do not have the PTR_SRA macro definition.  */
-#ifndef PTR_SRA
-# ifdef USE_DOUBLE
-#  define PTR_SRA		dsra
-# else
-#  define PTR_SRA		sra
-# endif
-#endif
-
-/* New R6 instructions that may not be in asm.h.  */
-#ifndef PTR_LSA
-# if _MIPS_SIM == _ABI64
-#  define PTR_LSA	dlsa
-# else
-#  define PTR_LSA	lsa
-# endif
-#endif
-
-#if __mips_isa_rev > 5 && defined (__mips_micromips)
-# define PTR_BC	      bc16
-#else
-# define PTR_BC	      bc
-#endif
-
-/*
- * Using PREFETCH_HINT_LOAD_STREAMED instead of PREFETCH_LOAD on load
- * prefetches appear to offer a slight performance advantage.
- *
- * Using PREFETCH_HINT_PREPAREFORSTORE instead of PREFETCH_STORE
- * or PREFETCH_STORE_STREAMED offers a large performance advantage
- * but PREPAREFORSTORE has some special restrictions to consider.
- *
- * Prefetch with the 'prepare for store' hint does not copy a memory
- * location into the cache, it just allocates a cache line and zeros
- * it out.  This means that if you do not write to the entire cache
- * line before writing it out to memory some data will get zero'ed out
- * when the cache line is written back to memory and data will be lost.
- *
- * Also if you are using this memcpy to copy overlapping buffers it may
- * not behave correctly when using the 'prepare for store' hint.  If you
- * use the 'prepare for store' prefetch on a memory area that is in the
- * memcpy source (as well as the memcpy destination), then you will get
- * some data zero'ed out before you have a chance to read it and data will
- * be lost.
- *
- * If you are going to use this memcpy routine with the 'prepare for store'
- * prefetch you may want to set USE_MEMMOVE_FOR_OVERLAP in order to avoid
- * the problem of running memcpy on overlapping buffers.
- *
- * There are ifdef'ed sections of this memcpy to make sure that it does not
- * do prefetches on cache lines that are not going to be completely written.
- * This code is only needed and only used when PREFETCH_STORE_HINT is set to
- * PREFETCH_HINT_PREPAREFORSTORE.  This code assumes that cache lines are
- * 32 bytes and if the cache line is larger it will not work correctly.
- */
-
-#ifdef USE_PREFETCH
-# define PREFETCH_HINT_LOAD		0
-# define PREFETCH_HINT_STORE		1
-# define PREFETCH_HINT_LOAD_STREAMED	4
-# define PREFETCH_HINT_STORE_STREAMED	5
-# define PREFETCH_HINT_LOAD_RETAINED	6
-# define PREFETCH_HINT_STORE_RETAINED	7
-# define PREFETCH_HINT_WRITEBACK_INVAL	25
-# define PREFETCH_HINT_PREPAREFORSTORE	30
-
-/*
- * If we have not picked out what hints to use at this point use the
- * standard load and store prefetch hints.
- */
-# ifndef PREFETCH_STORE_HINT
-#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE
-# endif
-# ifndef PREFETCH_LOAD_HINT
-#  define PREFETCH_LOAD_HINT PREFETCH_HINT_LOAD
-# endif
-
-/*
- * We double everything when USE_DOUBLE is true so we do 2 prefetches to
- * get 64 bytes in that case.  The assumption is that each individual
- * prefetch brings in 32 bytes.
- */
-
-# ifdef USE_DOUBLE
-#  define PREFETCH_CHUNK 64
-#  define PREFETCH_FOR_LOAD(chunk, reg) \
- pref PREFETCH_LOAD_HINT, (chunk)*64(reg); \
- pref PREFETCH_LOAD_HINT, ((chunk)*64)+32(reg)
-#  define PREFETCH_FOR_STORE(chunk, reg) \
- pref PREFETCH_STORE_HINT, (chunk)*64(reg); \
- pref PREFETCH_STORE_HINT, ((chunk)*64)+32(reg)
-# else
-#  define PREFETCH_CHUNK 32
-#  define PREFETCH_FOR_LOAD(chunk, reg) \
- pref PREFETCH_LOAD_HINT, (chunk)*32(reg)
-#  define PREFETCH_FOR_STORE(chunk, reg) \
- pref PREFETCH_STORE_HINT, (chunk)*32(reg)
-# endif
-/* MAX_PREFETCH_SIZE is the maximum size of a prefetch, it must not be less
- * than PREFETCH_CHUNK, the assumed size of each prefetch.  If the real size
- * of a prefetch is greater than MAX_PREFETCH_SIZE and the PREPAREFORSTORE
- * hint is used, the code will not work correctly.  If PREPAREFORSTORE is not
- * used then MAX_PREFETCH_SIZE does not matter.  */
-# define MAX_PREFETCH_SIZE 128
-/* PREFETCH_LIMIT is set based on the fact that we never use an offset greater
- * than 5 on a STORE prefetch and that a single prefetch can never be larger
- * than MAX_PREFETCH_SIZE.  We add the extra 32 when USE_DOUBLE is set because
- * we actually do two prefetches in that case, one 32 bytes after the other.  */
-# ifdef USE_DOUBLE
-#  define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + 32 + MAX_PREFETCH_SIZE
-# else
-#  define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + MAX_PREFETCH_SIZE
-# endif
-# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE) \
-    && ((PREFETCH_CHUNK * 4) < MAX_PREFETCH_SIZE)
-/* We cannot handle this because the initial prefetches may fetch bytes that
- * are before the buffer being copied.  We start copies with an offset
- * of 4 so avoid this situation when using PREPAREFORSTORE.  */
-#error "PREFETCH_CHUNK is too large and/or MAX_PREFETCH_SIZE is too small."
-# endif
-#else /* USE_PREFETCH not defined */
-# define PREFETCH_FOR_LOAD(offset, reg)
-# define PREFETCH_FOR_STORE(offset, reg)
-#endif
-
-#if __mips_isa_rev > 5
-# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-#  undef PREFETCH_STORE_HINT
-#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
-# endif
-# define R6_CODE
-#endif
-
-/* Allow the routine to be named something else if desired.  */
-#ifndef MEMCPY_NAME
-# define MEMCPY_NAME memcpy
-#endif
-
-/* We use these 32/64 bit registers as temporaries to do the copying.  */
-#define REG0 t0
-#define REG1 t1
-#define REG2 t2
-#define REG3 t3
-#if defined(_MIPS_SIM) && ((_MIPS_SIM == _ABIO32) || (_MIPS_SIM == _ABIO64))
-# define REG4 t4
-# define REG5 t5
-# define REG6 t6
-# define REG7 t7
-#else
-# define REG4 ta0
-# define REG5 ta1
-# define REG6 ta2
-# define REG7 ta3
-#endif
-
-/* We load/store 64 bits at a time when USE_DOUBLE is true.
- * The C_ prefix stands for CHUNK and is used to avoid macro name
- * conflicts with system header files.  */
-
-#ifdef USE_DOUBLE
-# define C_ST	sd
-# define C_LD	ld
-# ifdef __MIPSEB
-#  define C_LDHI	ldl	/* high part is left in big-endian	*/
-#  define C_STHI	sdl	/* high part is left in big-endian	*/
-#  define C_LDLO	ldr	/* low part is right in big-endian	*/
-#  define C_STLO	sdr	/* low part is right in big-endian	*/
-# else
-#  define C_LDHI	ldr	/* high part is right in little-endian	*/
-#  define C_STHI	sdr	/* high part is right in little-endian	*/
-#  define C_LDLO	ldl	/* low part is left in little-endian	*/
-#  define C_STLO	sdl	/* low part is left in little-endian	*/
-# endif
-# define C_ALIGN	dalign	/* r6 align instruction			*/
-#else
-# define C_ST	sw
-# define C_LD	lw
-# ifdef __MIPSEB
-#  define C_LDHI	lwl	/* high part is left in big-endian	*/
-#  define C_STHI	swl	/* high part is left in big-endian	*/
-#  define C_LDLO	lwr	/* low part is right in big-endian	*/
-#  define C_STLO	swr	/* low part is right in big-endian	*/
-# else
-#  define C_LDHI	lwr	/* high part is right in little-endian	*/
-#  define C_STHI	swr	/* high part is right in little-endian	*/
-#  define C_LDLO	lwl	/* low part is left in little-endian	*/
-#  define C_STLO	swl	/* low part is left in little-endian	*/
-# endif
-# define C_ALIGN	align	/* r6 align instruction			*/
-#endif
-
-/* Bookkeeping values for 32 vs. 64 bit mode.  */
-#ifdef USE_DOUBLE
-# define NSIZE 8
-# define NSIZEMASK 0x3f
-# define NSIZEDMASK 0x7f
-#else
-# define NSIZE 4
-# define NSIZEMASK 0x1f
-# define NSIZEDMASK 0x3f
-#endif
-#define UNIT(unit) ((unit)*NSIZE)
-#define UNITM1(unit) (((unit)*NSIZE)-1)
-
-#ifdef ANDROID_CHANGES
-LEAF(MEMCPY_NAME, 0)
-#else
-LEAF(MEMCPY_NAME)
-#endif
-	.set	nomips16
-/*
- * Below we handle the case where memcpy is called with overlapping src and dst.
- * Although memcpy is not required to handle this case, some parts of Android
- * like Skia rely on such usage. We call memmove to handle such cases.
- */
-#ifdef USE_MEMMOVE_FOR_OVERLAP
-	PTR_SUBU t0,a0,a1
-	PTR_SRA	t2,t0,31
-	xor	t1,t0,t2
-	PTR_SUBU t0,t1,t2
-	sltu	t2,t0,a2
-	la	t9,memmove
-	beq	t2,zero,L(memcpy)
-	jr	t9
-L(memcpy):
-#endif
-/*
- * If the size is less than 2*NSIZE (8 or 16), go to L(lastb).  Regardless of
- * size, copy dst pointer to v0 for the return value.
- */
-	slti	t2,a2,(2 * NSIZE)
-#if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH)
-	move	v0,zero
-#else
-	move	v0,a0
-#endif
-	bne	t2,zero,L(lasts)
-
-#ifndef R6_CODE
-
-/*
- * If src and dst have different alignments, go to L(unaligned), if they
- * have the same alignment (but are not actually aligned) do a partial
- * load/store to make them aligned.  If they are both already aligned
- * we can start copying at L(aligned).
- */
-	xor	t8,a1,a0
-	andi	t8,t8,(NSIZE-1)		/* t8 is a0/a1 word-displacement */
-	PTR_SUBU a3, zero, a0
-	bne	t8,zero,L(unaligned)
-
-	andi	a3,a3,(NSIZE-1)		/* copy a3 bytes to align a0/a1	  */
-	PTR_SUBU a2,a2,a3		/* a2 is the remining bytes count */
-	beq	a3,zero,L(aligned)	/* if a3=0, it is already aligned */
-
-	C_LDHI	t8,0(a1)
-	PTR_ADDU a1,a1,a3
-	C_STHI	t8,0(a0)
-	PTR_ADDU a0,a0,a3
-
-#else /* R6_CODE */
-
-/*
- * Align the destination and hope that the source gets aligned too.  If it
- * doesn't we jump to L(r6_unaligned*) to do unaligned copies using the r6
- * align instruction.
- */
-	andi	t8,a0,7
-#ifdef __mips_micromips
-	auipc	t9,%pcrel_hi(L(atable))
-	addiu	t9,t9,%pcrel_lo(L(atable)+4)
-	PTR_LSA	t9,t8,t9,1
-#else
-	lapc	t9,L(atable)
-	PTR_LSA	t9,t8,t9,2
-#endif
-	jrc	t9
-L(atable):
-	PTR_BC	L(lb0)
-	PTR_BC	L(lb7)
-	PTR_BC	L(lb6)
-	PTR_BC	L(lb5)
-	PTR_BC	L(lb4)
-	PTR_BC	L(lb3)
-	PTR_BC	L(lb2)
-	PTR_BC	L(lb1)
-L(lb7):
-	lb	a3, 6(a1)
-	sb	a3, 6(a0)
-L(lb6):
-	lb	a3, 5(a1)
-	sb	a3, 5(a0)
-L(lb5):
-	lb	a3, 4(a1)
-	sb	a3, 4(a0)
-L(lb4):
-	lb	a3, 3(a1)
-	sb	a3, 3(a0)
-L(lb3):
-	lb	a3, 2(a1)
-	sb	a3, 2(a0)
-L(lb2):
-	lb	a3, 1(a1)
-	sb	a3, 1(a0)
-L(lb1):
-	lb	a3, 0(a1)
-	sb	a3, 0(a0)
-
-	li	t9,8
-	subu	t8,t9,t8
-	PTR_SUBU a2,a2,t8
-	PTR_ADDU a0,a0,t8
-	PTR_ADDU a1,a1,t8
-L(lb0):
-
-	andi	t8,a1,(NSIZE-1)
-#ifdef __mips_micromips
-	auipc	t9,%pcrel_hi(L(jtable))
-	addiu	t9,t9,%pcrel_lo(L(jtable)+4)
-	PTR_LSA	t9,t8,t9,1
-#else
-	lapc	t9,L(jtable)
-	PTR_LSA	t9,t8,t9,2
-#endif
-	jrc	t9
-L(jtable):
-	PTR_BC      L(aligned)
-	PTR_BC      L(r6_unaligned1)
-	PTR_BC      L(r6_unaligned2)
-	PTR_BC      L(r6_unaligned3)
-#ifdef USE_DOUBLE
-	PTR_BC      L(r6_unaligned4)
-	PTR_BC      L(r6_unaligned5)
-	PTR_BC      L(r6_unaligned6)
-	PTR_BC      L(r6_unaligned7)
-#endif
-#endif /* R6_CODE */
-
-L(aligned):
-
-/*
- * Now dst/src are both aligned to (word or double word) aligned addresses
- * Set a2 to count how many bytes we have to copy after all the 64/128 byte
- * chunks are copied and a3 to the dst pointer after all the 64/128 byte
- * chunks have been copied.  We will loop, incrementing a0 and a1 until a0
- * equals a3.
- */
-
-	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
-	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
-	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
-
-/* When in the loop we may prefetch with the 'prepare to store' hint,
- * in this case the a0+x should not be past the "t0-32" address.  This
- * means: for x=128 the last "safe" a0 address is "t0-160".  Alternatively,
- * for x=64 the last "safe" a0 address is "t0-96" In the current version we
- * will use "prefetch hint,128(a0)", so "t0-160" is the limit.
- */
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	PTR_ADDU t0,a0,a2		/* t0 is the "past the end" address */
-	PTR_SUBU t9,t0,PREFETCH_LIMIT	/* t9 is the "last safe pref" address */
-#endif
-	PREFETCH_FOR_LOAD  (0, a1)
-	PREFETCH_FOR_LOAD  (1, a1)
-	PREFETCH_FOR_LOAD  (2, a1)
-	PREFETCH_FOR_LOAD  (3, a1)
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
-	PREFETCH_FOR_STORE (1, a0)
-	PREFETCH_FOR_STORE (2, a0)
-	PREFETCH_FOR_STORE (3, a0)
-#endif
-#if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
-# if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
-	sltu    v1,t9,a0
-	bgtz    v1,L(skip_set)
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
-L(skip_set):
-# else
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
-# endif
-#endif
-#if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH) \
-    && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*3)
-# ifdef USE_DOUBLE
-	PTR_ADDIU v0,v0,32
-# endif
-#endif
-L(loop16w):
-	C_LD	t0,UNIT(0)(a1)
-/* We need to separate out the C_LD instruction here so that it will work
-   both when it is used by itself and when it is used with the branch
-   instruction.  */
-#if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
-	C_LD	t1,UNIT(1)(a1)
-	bgtz	v1,L(skip_pref)
-#else
-	C_LD	t1,UNIT(1)(a1)
-#endif
-#ifdef R6_CODE
-	PREFETCH_FOR_STORE (2, a0)
-#else
-	PREFETCH_FOR_STORE (4, a0)
-	PREFETCH_FOR_STORE (5, a0)
-#endif
-#if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH)
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*5)
-# ifdef USE_DOUBLE
-	PTR_ADDIU v0,v0,32
-# endif
-#endif
-L(skip_pref):
-	C_LD	REG2,UNIT(2)(a1)
-	C_LD	REG3,UNIT(3)(a1)
-	C_LD	REG4,UNIT(4)(a1)
-	C_LD	REG5,UNIT(5)(a1)
-	C_LD	REG6,UNIT(6)(a1)
-	C_LD	REG7,UNIT(7)(a1)
-#ifdef R6_CODE
-	PREFETCH_FOR_LOAD (3, a1)
-#else
-	PREFETCH_FOR_LOAD (4, a1)
-#endif
-	C_ST	t0,UNIT(0)(a0)
-	C_ST	t1,UNIT(1)(a0)
-	C_ST	REG2,UNIT(2)(a0)
-	C_ST	REG3,UNIT(3)(a0)
-	C_ST	REG4,UNIT(4)(a0)
-	C_ST	REG5,UNIT(5)(a0)
-	C_ST	REG6,UNIT(6)(a0)
-	C_ST	REG7,UNIT(7)(a0)
-
-	C_LD	t0,UNIT(8)(a1)
-	C_LD	t1,UNIT(9)(a1)
-	C_LD	REG2,UNIT(10)(a1)
-	C_LD	REG3,UNIT(11)(a1)
-	C_LD	REG4,UNIT(12)(a1)
-	C_LD	REG5,UNIT(13)(a1)
-	C_LD	REG6,UNIT(14)(a1)
-	C_LD	REG7,UNIT(15)(a1)
-#ifndef R6_CODE
-        PREFETCH_FOR_LOAD (5, a1)
-#endif
-	C_ST	t0,UNIT(8)(a0)
-	C_ST	t1,UNIT(9)(a0)
-	C_ST	REG2,UNIT(10)(a0)
-	C_ST	REG3,UNIT(11)(a0)
-	C_ST	REG4,UNIT(12)(a0)
-	C_ST	REG5,UNIT(13)(a0)
-	C_ST	REG6,UNIT(14)(a0)
-	C_ST	REG7,UNIT(15)(a0)
-	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
-	bne	a0,a3,L(loop16w)
-	move	a2,t8
-
-/* Here we have src and dest word-aligned but less than 64-bytes or
- * 128 bytes to go.  Check for a 32(64) byte chunk and copy if there
- * is one.  Otherwise jump down to L(chk1w) to handle the tail end of
- * the copy.
- */
-
-L(chkw):
-	PREFETCH_FOR_LOAD (0, a1)
-	andi	t8,a2,NSIZEMASK	/* Is there a 32-byte/64-byte chunk.  */
-				/* The t8 is the reminder count past 32-bytes */
-	beq	a2,t8,L(chk1w)	/* When a2=t8, no 32-byte chunk  */
-	C_LD	t0,UNIT(0)(a1)
-	C_LD	t1,UNIT(1)(a1)
-	C_LD	REG2,UNIT(2)(a1)
-	C_LD	REG3,UNIT(3)(a1)
-	C_LD	REG4,UNIT(4)(a1)
-	C_LD	REG5,UNIT(5)(a1)
-	C_LD	REG6,UNIT(6)(a1)
-	C_LD	REG7,UNIT(7)(a1)
-	PTR_ADDIU a1,a1,UNIT(8)
-	C_ST	t0,UNIT(0)(a0)
-	C_ST	t1,UNIT(1)(a0)
-	C_ST	REG2,UNIT(2)(a0)
-	C_ST	REG3,UNIT(3)(a0)
-	C_ST	REG4,UNIT(4)(a0)
-	C_ST	REG5,UNIT(5)(a0)
-	C_ST	REG6,UNIT(6)(a0)
-	C_ST	REG7,UNIT(7)(a0)
-	PTR_ADDIU a0,a0,UNIT(8)
-
-/*
- * Here we have less than 32(64) bytes to copy.  Set up for a loop to
- * copy one word (or double word) at a time.  Set a2 to count how many
- * bytes we have to copy after all the word (or double word) chunks are
- * copied and a3 to the dst pointer after all the (d)word chunks have
- * been copied.  We will loop, incrementing a0 and a1 until a0 equals a3.
- */
-L(chk1w):
-	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
-	beq	a2,t8,L(lastw)
-	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
-
-/* copying in words (4-byte or 8-byte chunks) */
-L(wordCopy_loop):
-	C_LD	REG3,UNIT(0)(a1)
-	PTR_ADDIU a0,a0,UNIT(1)
-	PTR_ADDIU a1,a1,UNIT(1)
-	C_ST	REG3,UNIT(-1)(a0)
-	bne	a0,a3,L(wordCopy_loop)
-
-/* If we have been copying double words, see if we can copy a single word
-   before doing byte copies.  We can have, at most, one word to copy.  */
-
-L(lastw):
-#ifdef USE_DOUBLE
-	andi    t8,a2,3		/* a2 is the remainder past 4 byte chunks.  */
-	beq	t8,a2,L(lastb)
-	move	a2,t8
-	lw	REG3,0(a1)
-	sw	REG3,0(a0)
-	PTR_ADDIU a0,a0,4
-	PTR_ADDIU a1,a1,4
-#endif
-
-/* Copy the last 8 (or 16) bytes */
-L(lastb):
-	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
-	blez	a2,L(leave)
-L(lastbloop):
-	lb	v1,0(a1)
-	PTR_ADDIU a0,a0,1
-	PTR_ADDIU a1,a1,1
-	sb	v1,-1(a0)
-	bne	a0,a3,L(lastbloop)
-L(leave):
-	jr	ra
-
-/* We jump here with a memcpy of less than 8 or 16 bytes, depending on
-   whether or not USE_DOUBLE is defined.  Instead of just doing byte
-   copies, check the alignment and size and use lw/sw if possible.
-   Otherwise, do byte copies.  */
-
-L(lasts):
-	andi	t8,a2,3
-	beq	t8,a2,L(lastb)
-
-	andi	t9,a0,3
-	bne	t9,zero,L(lastb)
-	andi	t9,a1,3
-	bne	t9,zero,L(lastb)
-
-	PTR_SUBU a3,a2,t8
-	PTR_ADDU a3,a0,a3
-
-L(wcopy_loop):
-	lw	REG3,0(a1)
-	PTR_ADDIU a0,a0,4
-	PTR_ADDIU a1,a1,4
-	bne	a0,a3,L(wcopy_loop)
-	sw	REG3,-4(a0)
-
-	b	L(lastb)
-	move	a2,t8
-
-#ifndef R6_CODE
-/*
- * UNALIGNED case, got here with a3 = "negu a0"
- * This code is nearly identical to the aligned code above
- * but only the destination (not the source) gets aligned
- * so we need to do partial loads of the source followed
- * by normal stores to the destination (once we have aligned
- * the destination).
- */
-
-L(unaligned):
-	andi	a3,a3,(NSIZE-1)	/* copy a3 bytes to align a0/a1 */
-	PTR_SUBU a2,a2,a3	/* a2 is the remining bytes count */
-	beqz	a3,L(ua_chk16w) /* if a3=0, it is already aligned */
-
-	C_LDHI	v1,UNIT(0)(a1)
-	C_LDLO	v1,UNITM1(1)(a1)
-	PTR_ADDU a1,a1,a3
-	C_STHI	v1,UNIT(0)(a0)
-	PTR_ADDU a0,a0,a3
-
-/*
- *  Now the destination (but not the source) is aligned
- * Set a2 to count how many bytes we have to copy after all the 64/128 byte
- * chunks are copied and a3 to the dst pointer after all the 64/128 byte
- * chunks have been copied.  We will loop, incrementing a0 and a1 until a0
- * equals a3.
- */
-
-L(ua_chk16w):
-	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
-	beq	a2,t8,L(ua_chkw) /* if a2==t8, no 64-byte/128-byte chunks */
-	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
-
-# if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	PTR_ADDU t0,a0,a2	  /* t0 is the "past the end" address */
-	PTR_SUBU t9,t0,PREFETCH_LIMIT /* t9 is the "last safe pref" address */
-# endif
-	PREFETCH_FOR_LOAD  (0, a1)
-	PREFETCH_FOR_LOAD  (1, a1)
-	PREFETCH_FOR_LOAD  (2, a1)
-# if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
-	PREFETCH_FOR_STORE (1, a0)
-	PREFETCH_FOR_STORE (2, a0)
-	PREFETCH_FOR_STORE (3, a0)
-# endif
-# if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
-#  if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu    v1,t9,a0
-	bgtz    v1,L(ua_skip_set)
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
-L(ua_skip_set):
-#  else
-	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
-#  endif
-# endif
-L(ua_loop16w):
-	PREFETCH_FOR_LOAD  (3, a1)
-	C_LDHI	t0,UNIT(0)(a1)
-	C_LDHI	t1,UNIT(1)(a1)
-	C_LDHI	REG2,UNIT(2)(a1)
-/* We need to separate out the C_LDHI instruction here so that it will work
-   both when it is used by itself and when it is used with the branch
-   instruction.  */
-# if defined(USE_PREFETCH) && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0
-	C_LDHI	REG3,UNIT(3)(a1)
-	bgtz	v1,L(ua_skip_pref)
-# else
-	C_LDHI	REG3,UNIT(3)(a1)
-# endif
-	PREFETCH_FOR_STORE (4, a0)
-	PREFETCH_FOR_STORE (5, a0)
-L(ua_skip_pref):
-	C_LDHI	REG4,UNIT(4)(a1)
-	C_LDHI	REG5,UNIT(5)(a1)
-	C_LDHI	REG6,UNIT(6)(a1)
-	C_LDHI	REG7,UNIT(7)(a1)
-	C_LDLO	t0,UNITM1(1)(a1)
-	C_LDLO	t1,UNITM1(2)(a1)
-	C_LDLO	REG2,UNITM1(3)(a1)
-	C_LDLO	REG3,UNITM1(4)(a1)
-	C_LDLO	REG4,UNITM1(5)(a1)
-	C_LDLO	REG5,UNITM1(6)(a1)
-	C_LDLO	REG6,UNITM1(7)(a1)
-	C_LDLO	REG7,UNITM1(8)(a1)
-        PREFETCH_FOR_LOAD (4, a1)
-	C_ST	t0,UNIT(0)(a0)
-	C_ST	t1,UNIT(1)(a0)
-	C_ST	REG2,UNIT(2)(a0)
-	C_ST	REG3,UNIT(3)(a0)
-	C_ST	REG4,UNIT(4)(a0)
-	C_ST	REG5,UNIT(5)(a0)
-	C_ST	REG6,UNIT(6)(a0)
-	C_ST	REG7,UNIT(7)(a0)
-	C_LDHI	t0,UNIT(8)(a1)
-	C_LDHI	t1,UNIT(9)(a1)
-	C_LDHI	REG2,UNIT(10)(a1)
-	C_LDHI	REG3,UNIT(11)(a1)
-	C_LDHI	REG4,UNIT(12)(a1)
-	C_LDHI	REG5,UNIT(13)(a1)
-	C_LDHI	REG6,UNIT(14)(a1)
-	C_LDHI	REG7,UNIT(15)(a1)
-	C_LDLO	t0,UNITM1(9)(a1)
-	C_LDLO	t1,UNITM1(10)(a1)
-	C_LDLO	REG2,UNITM1(11)(a1)
-	C_LDLO	REG3,UNITM1(12)(a1)
-	C_LDLO	REG4,UNITM1(13)(a1)
-	C_LDLO	REG5,UNITM1(14)(a1)
-	C_LDLO	REG6,UNITM1(15)(a1)
-	C_LDLO	REG7,UNITM1(16)(a1)
-        PREFETCH_FOR_LOAD (5, a1)
-	C_ST	t0,UNIT(8)(a0)
-	C_ST	t1,UNIT(9)(a0)
-	C_ST	REG2,UNIT(10)(a0)
-	C_ST	REG3,UNIT(11)(a0)
-	C_ST	REG4,UNIT(12)(a0)
-	C_ST	REG5,UNIT(13)(a0)
-	C_ST	REG6,UNIT(14)(a0)
-	C_ST	REG7,UNIT(15)(a0)
-	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-	PTR_ADDIU a1,a1,UNIT(16)	/* adding 64/128 to src */
-	bne	a0,a3,L(ua_loop16w)
-	move	a2,t8
-
-/* Here we have src and dest word-aligned but less than 64-bytes or
- * 128 bytes to go.  Check for a 32(64) byte chunk and copy if there
- * is one.  Otherwise jump down to L(ua_chk1w) to handle the tail end of
- * the copy.  */
-
-L(ua_chkw):
-	PREFETCH_FOR_LOAD (0, a1)
-	andi	t8,a2,NSIZEMASK	  /* Is there a 32-byte/64-byte chunk.  */
-				  /* t8 is the reminder count past 32-bytes */
-	beq	a2,t8,L(ua_chk1w) /* When a2=t8, no 32-byte chunk */
-	C_LDHI	t0,UNIT(0)(a1)
-	C_LDHI	t1,UNIT(1)(a1)
-	C_LDHI	REG2,UNIT(2)(a1)
-	C_LDHI	REG3,UNIT(3)(a1)
-	C_LDHI	REG4,UNIT(4)(a1)
-	C_LDHI	REG5,UNIT(5)(a1)
-	C_LDHI	REG6,UNIT(6)(a1)
-	C_LDHI	REG7,UNIT(7)(a1)
-	C_LDLO	t0,UNITM1(1)(a1)
-	C_LDLO	t1,UNITM1(2)(a1)
-	C_LDLO	REG2,UNITM1(3)(a1)
-	C_LDLO	REG3,UNITM1(4)(a1)
-	C_LDLO	REG4,UNITM1(5)(a1)
-	C_LDLO	REG5,UNITM1(6)(a1)
-	C_LDLO	REG6,UNITM1(7)(a1)
-	C_LDLO	REG7,UNITM1(8)(a1)
-	PTR_ADDIU a1,a1,UNIT(8)
-	C_ST	t0,UNIT(0)(a0)
-	C_ST	t1,UNIT(1)(a0)
-	C_ST	REG2,UNIT(2)(a0)
-	C_ST	REG3,UNIT(3)(a0)
-	C_ST	REG4,UNIT(4)(a0)
-	C_ST	REG5,UNIT(5)(a0)
-	C_ST	REG6,UNIT(6)(a0)
-	C_ST	REG7,UNIT(7)(a0)
-	PTR_ADDIU a0,a0,UNIT(8)
-/*
- * Here we have less than 32(64) bytes to copy.  Set up for a loop to
- * copy one word (or double word) at a time.
- */
-L(ua_chk1w):
-	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
-	beq	a2,t8,L(ua_smallCopy)
-	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
-
-/* copying in words (4-byte or 8-byte chunks) */
-L(ua_wordCopy_loop):
-	C_LDHI	v1,UNIT(0)(a1)
-	C_LDLO	v1,UNITM1(1)(a1)
-	PTR_ADDIU a0,a0,UNIT(1)
-	PTR_ADDIU a1,a1,UNIT(1)
-	C_ST	v1,UNIT(-1)(a0)
-	bne	a0,a3,L(ua_wordCopy_loop)
-
-/* Copy the last 8 (or 16) bytes */
-L(ua_smallCopy):
-	PTR_ADDU a3,a0,a2	/* a3 is the last dst address */
-	beqz	a2,L(leave)
-L(ua_smallCopy_loop):
-	lb	v1,0(a1)
-	PTR_ADDIU a0,a0,1
-	PTR_ADDIU a1,a1,1
-	sb	v1,-1(a0)
-	bne	a0,a3,L(ua_smallCopy_loop)
-
-	jr	ra
-
-#else /* R6_CODE */
-
-# ifdef __MIPSEB
-#  define SWAP_REGS(X,Y) X, Y
-#  define ALIGN_OFFSET(N) (N)
-# else
-#  define SWAP_REGS(X,Y) Y, X
-#  define ALIGN_OFFSET(N) (NSIZE-N)
-# endif
-# define R6_UNALIGNED_WORD_COPY(BYTEOFFSET) \
-	andi	REG7, a2, (NSIZE-1);/* REG7 is # of bytes to by bytes.     */ \
-	PTR_SUBU a3, a2, REG7;	/* a3 is number of bytes to be copied in   */ \
-				/* (d)word chunks.			   */ \
-	beq	REG7, a2, L(lastb); /* Check for bytes to copy by word	   */ \
-	move	a2, REG7;	/* a2 is # of bytes to copy byte by byte   */ \
-				/* after word loop is finished.		   */ \
-	PTR_ADDU REG6, a0, a3;	/* REG6 is the dst address after loop.	   */ \
-	PTR_SUBU REG2, a1, t8;	/* REG2 is the aligned src address.	   */ \
-	PTR_ADDU a1, a1, a3;	/* a1 is addr of source after word loop.   */ \
-	C_LD	t0, UNIT(0)(REG2);  /* Load first part of source.	   */ \
-L(r6_ua_wordcopy##BYTEOFFSET):						      \
-	C_LD	t1, UNIT(1)(REG2);  /* Load second part of source.	   */ \
-	C_ALIGN	REG3, SWAP_REGS(t1,t0), ALIGN_OFFSET(BYTEOFFSET);	      \
-	PTR_ADDIU a0, a0, UNIT(1);  /* Increment destination pointer.	   */ \
-	PTR_ADDIU REG2, REG2, UNIT(1); /* Increment aligned source pointer.*/ \
-	move	t0, t1;		/* Move second part of source to first.	   */ \
-	C_ST	REG3, UNIT(-1)(a0);					      \
-	bne	a0, REG6,L(r6_ua_wordcopy##BYTEOFFSET);			      \
-	j	L(lastb);						      \
-
-	/* We are generating R6 code, the destination is 4 byte aligned and
-	   the source is not 4 byte aligned. t8 is 1, 2, or 3 depending on the
-           alignment of the source.  */
-
-L(r6_unaligned1):
-	R6_UNALIGNED_WORD_COPY(1)
-L(r6_unaligned2):
-	R6_UNALIGNED_WORD_COPY(2)
-L(r6_unaligned3):
-	R6_UNALIGNED_WORD_COPY(3)
-# ifdef USE_DOUBLE
-L(r6_unaligned4):
-	R6_UNALIGNED_WORD_COPY(4)
-L(r6_unaligned5):
-	R6_UNALIGNED_WORD_COPY(5)
-L(r6_unaligned6):
-	R6_UNALIGNED_WORD_COPY(6)
-L(r6_unaligned7):
-	R6_UNALIGNED_WORD_COPY(7)
-# endif
-#endif /* R6_CODE */
-
-	.set	at
-END(MEMCPY_NAME)
-#ifndef ANDROID_CHANGES
-# ifdef _LIBC
-libc_hidden_builtin_def (MEMCPY_NAME)
-# endif
-#endif
diff --git a/sysdeps/mips/memcpy.c b/sysdeps/mips/memcpy.c
new file mode 100644
index 0000000000..8c3aec7b36
--- /dev/null
+++ b/sysdeps/mips/memcpy.c
@@ -0,0 +1,415 @@
+/*
+ * Copyright (C) 2024 MIPS Tech, LLC
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from this
+ * software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifdef  __GNUC__
+
+#undef memcpy
+
+/* Typical observed latency in cycles in fetching from DRAM.  */
+#define LATENCY_CYCLES 63
+
+/* Pre-fetch performance is subject to accurate prefetch ahead,
+   which in turn depends on both the cache-line size and the amount
+   of look-ahead.  Since cache-line size is not nominally fixed in
+   a typically library built for multiple platforms, we make conservative
+   assumptions in the default case.  This code will typically operate
+   on such conservative assumptions, but if compiled with the correct
+   -mtune=xx options, will perform even better on those specific
+   platforms.  */
+#if defined(_MIPS_TUNE_OCTEON2) || defined(_MIPS_TUNE_OCTEON3)
+ #define CACHE_LINE 128
+ #define BLOCK_CYCLES 30
+ #undef LATENCY_CYCLES
+ #define LATENCY_CYCLES 150
+#elif defined(_MIPS_TUNE_I6400) || defined(_MIPS_TUNE_I6500)
+ #define CACHE_LINE 64
+ #define BLOCK_CYCLES 16
+#elif defined(_MIPS_TUNE_P6600)
+ #define CACHE_LINE 32
+ #define BLOCK_CYCLES 12
+#elif defined(_MIPS_TUNE_INTERAPTIV) || defined(_MIPS_TUNE_INTERAPTIV_MR2)
+ #define CACHE_LINE 32
+ #define BLOCK_CYCLES 30
+#else
+ #define CACHE_LINE 32
+ #define BLOCK_CYCLES 11
+#endif
+
+/* Pre-fetch look ahead = ceil (latency / block-cycles)  */
+#define PREF_AHEAD (LATENCY_CYCLES / BLOCK_CYCLES			\
+		    + ((LATENCY_CYCLES % BLOCK_CYCLES) == 0 ? 0 : 1))
+
+/* Unroll-factor, controls how many words at a time in the core loop.  */
+#define BLOCK (CACHE_LINE == 128 ? 16 : 8)
+
+#define __overloadable
+#if !defined(UNALIGNED_INSTR_SUPPORT)
+/* does target have unaligned lw/ld/ualw/uald instructions? */
+ #define UNALIGNED_INSTR_SUPPORT 0
+#if (__mips_isa_rev < 6 && !defined(__mips1))
+  #undef UNALIGNED_INSTR_SUPPORT
+  #define UNALIGNED_INSTR_SUPPORT 1
+ #endif
+#endif
+#if !defined(HW_UNALIGNED_SUPPORT)
+/* Does target have hardware support for unaligned accesses?  */
+ #define HW_UNALIGNED_SUPPORT 0
+ #if __mips_isa_rev >= 6
+  #undef HW_UNALIGNED_SUPPORT
+  #define HW_UNALIGNED_SUPPORT 1
+ #endif
+#endif
+#define ENABLE_PREFETCH     1
+#if ENABLE_PREFETCH
+ #define PREFETCH(addr)  __builtin_prefetch (addr, 0, 0)
+#else
+ #define PREFETCH(addr)
+#endif
+
+#include <string.h>
+
+#ifdef __mips64
+typedef unsigned long long reg_t;
+typedef struct
+{
+  reg_t B0:8, B1:8, B2:8, B3:8, B4:8, B5:8, B6:8, B7:8;
+} bits_t;
+#else
+typedef unsigned long reg_t;
+typedef struct
+{
+  reg_t B0:8, B1:8, B2:8, B3:8;
+} bits_t;
+#endif
+
+#define CACHE_LINES_PER_BLOCK ((BLOCK * sizeof (reg_t) > CACHE_LINE) ?	\
+			       (BLOCK * sizeof (reg_t) / CACHE_LINE)	\
+			       : 1)
+
+typedef union
+{
+  reg_t v;
+  bits_t b;
+} bitfields_t;
+
+#define DO_BYTE(a, i)   \
+  a[i] = bw.b.B##i;     \
+  len--;                \
+  if(!len) return ret;  \
+
+/* This code is called when aligning a pointer, there are remaining bytes
+   after doing word compares, or architecture does not have some form
+   of unaligned support.  */
+static inline void * __attribute__ ((always_inline))
+do_bytes (void *a, const void *b, unsigned long len, void *ret)
+{
+  unsigned char *x = (unsigned char *) a;
+  unsigned char *y = (unsigned char *) b;
+  unsigned long i;
+  /* 'len' might be zero here, so preloading the first two values
+     before the loop may access unallocated memory.  */
+  for (i = 0; i < len; i++)
+    {
+      *x = *y;
+      x++;
+      y++;
+    }
+  return ret;
+}
+
+/* This code is called to copy only remaining bytes within word or doubleword */
+static inline void * __attribute__ ((always_inline))
+do_bytes_remaining (void *a, const void *b, unsigned long len, void *ret)
+{
+  unsigned char *x = (unsigned char *) a;
+  bitfields_t bw;
+  if(len > 0)
+    {
+      bw.v = *(reg_t *)b;
+      DO_BYTE(x, 0);
+      DO_BYTE(x, 1);
+      DO_BYTE(x, 2);
+#ifdef __mips64
+      DO_BYTE(x, 3);
+      DO_BYTE(x, 4);
+      DO_BYTE(x, 5);
+      DO_BYTE(x, 6);
+#endif
+    }
+  return ret;
+}
+
+static inline void * __attribute__ ((always_inline))
+do_words_remaining (reg_t *a, const reg_t *b, unsigned long words,
+		    unsigned long bytes, void *ret)
+{
+  /* Use a set-back so that load/stores have incremented addresses in
+     order to promote bonding.  */
+  int off = (BLOCK - words);
+  a -= off;
+  b -= off;
+  switch (off)
+    {
+      case 1: a[1] = b[1]; // Fall through
+      case 2: a[2] = b[2]; // Fall through
+      case 3: a[3] = b[3]; // Fall through
+      case 4: a[4] = b[4]; // Fall through
+      case 5: a[5] = b[5]; // Fall through
+      case 6: a[6] = b[6]; // Fall through
+      case 7: a[7] = b[7]; // Fall through
+#if BLOCK==16
+      case 8: a[8] = b[8]; // Fall through
+      case 9: a[9] = b[9]; // Fall through
+      case 10: a[10] = b[10]; // Fall through
+      case 11: a[11] = b[11]; // Fall through
+      case 12: a[12] = b[12]; // Fall through
+      case 13: a[13] = b[13]; // Fall through
+      case 14: a[14] = b[14]; // Fall through
+      case 15: a[15] = b[15];
+#endif
+    }
+  return do_bytes_remaining (a + BLOCK, b + BLOCK, bytes, ret);
+}
+
+#if !HW_UNALIGNED_SUPPORT
+#if UNALIGNED_INSTR_SUPPORT
+/* For MIPS GCC, there are no unaligned builtins - so this struct forces
+   the compiler to treat the pointer access as unaligned.  */
+struct ulw
+{
+  reg_t uli;
+} __attribute__ ((packed));
+static inline void * __attribute__ ((always_inline))
+do_uwords_remaining (struct ulw *a, const reg_t *b, unsigned long words,
+		     unsigned long bytes, void *ret)
+{
+  /* Use a set-back so that load/stores have incremented addresses in
+     order to promote bonding.  */
+  int off = (BLOCK - words);
+  a -= off;
+  b -= off;
+  switch (off)
+    {
+      case 1: a[1].uli = b[1]; // Fall through
+      case 2: a[2].uli = b[2]; // Fall through
+      case 3: a[3].uli = b[3]; // Fall through
+      case 4: a[4].uli = b[4]; // Fall through
+      case 5: a[5].uli = b[5]; // Fall through
+      case 6: a[6].uli = b[6]; // Fall through
+      case 7: a[7].uli = b[7]; // Fall through
+#if BLOCK==16
+      case 8: a[8].uli = b[8]; // Fall through
+      case 9: a[9].uli = b[9]; // Fall through
+      case 10: a[10].uli = b[10]; // Fall through
+      case 11: a[11].uli = b[11]; // Fall through
+      case 12: a[12].uli = b[12]; // Fall through
+      case 13: a[13].uli = b[13]; // Fall through
+      case 14: a[14].uli = b[14]; // Fall through
+      case 15: a[15].uli = b[15];
+#endif
+    }
+  return do_bytes_remaining (a + BLOCK, b + BLOCK, bytes, ret);
+}
+
+/* The first pointer is not aligned while second pointer is.  */
+static void *
+unaligned_words (struct ulw *a, const reg_t * b,
+		 unsigned long words, unsigned long bytes, void *ret)
+{
+  unsigned long i, words_by_block, words_by_1;
+  words_by_1 = words % BLOCK;
+  words_by_block = words / BLOCK;
+  for (; words_by_block > 0; words_by_block--)
+    {
+      if (words_by_block >= PREF_AHEAD - CACHE_LINES_PER_BLOCK)
+	for (i = 0; i < CACHE_LINES_PER_BLOCK; i++)
+	  PREFETCH (b + (BLOCK / CACHE_LINES_PER_BLOCK) * (PREF_AHEAD + i));
+
+      reg_t y0 = b[0], y1 = b[1], y2 = b[2], y3 = b[3];
+      reg_t y4 = b[4], y5 = b[5], y6 = b[6], y7 = b[7];
+      a[0].uli = y0;
+      a[1].uli = y1;
+      a[2].uli = y2;
+      a[3].uli = y3;
+      a[4].uli = y4;
+      a[5].uli = y5;
+      a[6].uli = y6;
+      a[7].uli = y7;
+#if BLOCK==16
+      y0 = b[8], y1 = b[9], y2 = b[10], y3 = b[11];
+      y4 = b[12], y5 = b[13], y6 = b[14], y7 = b[15];
+      a[8].uli = y0;
+      a[9].uli = y1;
+      a[10].uli = y2;
+      a[11].uli = y3;
+      a[12].uli = y4;
+      a[13].uli = y5;
+      a[14].uli = y6;
+      a[15].uli = y7;
+#endif
+      a += BLOCK;
+      b += BLOCK;
+  }
+
+  /* Mop up any remaining bytes.  */
+  return do_uwords_remaining (a, b, words_by_1, bytes, ret);
+}
+
+#else
+
+/* No HW support or unaligned lw/ld/ualw/uald instructions.  */
+static void *
+unaligned_words (reg_t * a, const reg_t * b,
+		 unsigned long words, unsigned long bytes, void *ret)
+{
+  unsigned long i;
+  unsigned char *x;
+  for (i = 0; i < words; i++)
+    {
+      bitfields_t bw;
+      bw.v = *((reg_t*) b);
+      x = (unsigned char *) a;
+      x[0] = bw.b.B0;
+      x[1] = bw.b.B1;
+      x[2] = bw.b.B2;
+      x[3] = bw.b.B3;
+#ifdef __mips64
+      x[4] = bw.b.B4;
+      x[5] = bw.b.B5;
+      x[6] = bw.b.B6;
+      x[7] = bw.b.B7;
+#endif
+      a += 1;
+      b += 1;
+    }
+  /* Mop up any remaining bytes.  */
+  return do_bytes_remaining (a, b, bytes, ret);
+}
+
+#endif /* UNALIGNED_INSTR_SUPPORT */
+#endif /* HW_UNALIGNED_SUPPORT */
+
+/* both pointers are aligned, or first isn't and HW support for unaligned.  */
+static void *
+aligned_words (reg_t * a, const reg_t * b,
+	       unsigned long words, unsigned long bytes, void *ret)
+{
+  unsigned long i, words_by_block, words_by_1;
+  words_by_1 = words % BLOCK;
+  words_by_block = words / BLOCK;
+  for (; words_by_block > 0; words_by_block--)
+    {
+      if(words_by_block >= PREF_AHEAD - CACHE_LINES_PER_BLOCK)
+	for (i = 0; i < CACHE_LINES_PER_BLOCK; i++)
+	  PREFETCH (b + ((BLOCK / CACHE_LINES_PER_BLOCK) * (PREF_AHEAD + i)));
+
+      reg_t x0 = b[0], x1 = b[1], x2 = b[2], x3 = b[3];
+      reg_t x4 = b[4], x5 = b[5], x6 = b[6], x7 = b[7];
+      a[0] = x0;
+      a[1] = x1;
+      a[2] = x2;
+      a[3] = x3;
+      a[4] = x4;
+      a[5] = x5;
+      a[6] = x6;
+      a[7] = x7;
+#if BLOCK==16
+      x0 = b[8], x1 = b[9], x2 = b[10], x3 = b[11];
+      x4 = b[12], x5 = b[13], x6 = b[14], x7 = b[15];
+      a[8] = x0;
+      a[9] = x1;
+      a[10] = x2;
+      a[11] = x3;
+      a[12] = x4;
+      a[13] = x5;
+      a[14] = x6;
+      a[15] = x7;
+#endif
+      a += BLOCK;
+      b += BLOCK;
+    }
+
+  /* mop up any remaining bytes.  */
+  return do_words_remaining (a, b, words_by_1, bytes, ret);
+}
+
+void *
+memcpy (void *a, const void *b, size_t len) __overloadable
+{
+  unsigned long bytes, words, i;
+  void *ret = a;
+  /* shouldn't hit that often.  */
+  if (len <= 8)
+    return do_bytes (a, b, len, a);
+
+  /* Start pre-fetches ahead of time.  */
+  if (len > CACHE_LINE * (PREF_AHEAD - 1))
+    for (i = 1; i < PREF_AHEAD - 1; i++)
+      PREFETCH ((char *)b + CACHE_LINE * i);
+  else
+    for (i = 1; i < len / CACHE_LINE; i++)
+      PREFETCH ((char *)b + CACHE_LINE * i);
+
+  /* Align the second pointer to word/dword alignment.
+     Note that the pointer is only 32-bits for o32/n32 ABIs.  For
+     n32, loads are done as 64-bit while address remains 32-bit.   */
+  bytes = ((unsigned long) b) % (sizeof (reg_t));
+
+  if (bytes)
+    {
+      bytes = (sizeof (reg_t)) - bytes;
+      if (bytes > len)
+	bytes = len;
+      do_bytes (a, b, bytes, ret);
+      if (len == bytes)
+	return ret;
+      len -= bytes;
+      a = (void *) (((unsigned char *) a) + bytes);
+      b = (const void *) (((unsigned char *) b) + bytes);
+    }
+
+  /* Second pointer now aligned.  */
+  words = len / sizeof (reg_t);
+  bytes = len % sizeof (reg_t);
+
+#if HW_UNALIGNED_SUPPORT
+  /* treat possible unaligned first pointer as aligned.  */
+  return aligned_words (a, b, words, bytes, ret);
+#else
+  if (((unsigned long) a) % sizeof (reg_t) == 0)
+    return aligned_words (a, b, words, bytes, ret);
+  /* need to use unaligned instructions on first pointer.  */
+  return unaligned_words (a, b, words, bytes, ret);
+#endif
+}
+
+libc_hidden_builtin_def (memcpy)
+
+#else
+#include <string/memcpy.c>
+#endif
diff --git a/sysdeps/mips/memset.S b/sysdeps/mips/memset.S
deleted file mode 100644
index 0c8375c9f5..0000000000
--- a/sysdeps/mips/memset.S
+++ /dev/null
@@ -1,430 +0,0 @@
-/* Copyright (C) 2013-2024 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#ifdef ANDROID_CHANGES
-# include "machine/asm.h"
-# include "machine/regdef.h"
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#elif _LIBC
-# include <sysdep.h>
-# include <regdef.h>
-# include <sys/asm.h>
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#elif defined _COMPILING_NEWLIB
-# include "machine/asm.h"
-# include "machine/regdef.h"
-# define PREFETCH_STORE_HINT PREFETCH_HINT_PREPAREFORSTORE
-#else
-# include <regdef.h>
-# include <sys/asm.h>
-#endif
-
-/* Check to see if the MIPS architecture we are compiling for supports
-   prefetching.  */
-
-#if (__mips == 4) || (__mips == 5) || (__mips == 32) || (__mips == 64)
-# ifndef DISABLE_PREFETCH
-#  define USE_PREFETCH
-# endif
-#endif
-
-#if defined(_MIPS_SIM) && ((_MIPS_SIM == _ABI64) || (_MIPS_SIM == _ABIN32))
-# ifndef DISABLE_DOUBLE
-#  define USE_DOUBLE
-# endif
-#endif
-
-#ifndef USE_DOUBLE
-# ifndef DISABLE_DOUBLE_ALIGN
-#  define DOUBLE_ALIGN
-# endif
-#endif
-
-
-/* Some asm.h files do not have the L macro definition.  */
-#ifndef L
-# if _MIPS_SIM == _ABIO32
-#  define L(label) $L ## label
-# else
-#  define L(label) .L ## label
-# endif
-#endif
-
-/* Some asm.h files do not have the PTR_ADDIU macro definition.  */
-#ifndef PTR_ADDIU
-# ifdef USE_DOUBLE
-#  define PTR_ADDIU	daddiu
-# else
-#  define PTR_ADDIU	addiu
-# endif
-#endif
-
-/* New R6 instructions that may not be in asm.h.  */
-#ifndef PTR_LSA
-# if _MIPS_SIM == _ABI64
-#  define PTR_LSA        dlsa
-# else
-#  define PTR_LSA        lsa
-# endif
-#endif
-
-#if __mips_isa_rev > 5 && defined (__mips_micromips)
-# define PTR_BC	      bc16
-#else
-# define PTR_BC	      bc
-#endif
-
-/* Using PREFETCH_HINT_PREPAREFORSTORE instead of PREFETCH_STORE
-   or PREFETCH_STORE_STREAMED offers a large performance advantage
-   but PREPAREFORSTORE has some special restrictions to consider.
-
-   Prefetch with the 'prepare for store' hint does not copy a memory
-   location into the cache, it just allocates a cache line and zeros
-   it out.  This means that if you do not write to the entire cache
-   line before writing it out to memory some data will get zero'ed out
-   when the cache line is written back to memory and data will be lost.
-
-   There are ifdef'ed sections of this memcpy to make sure that it does not
-   do prefetches on cache lines that are not going to be completely written.
-   This code is only needed and only used when PREFETCH_STORE_HINT is set to
-   PREFETCH_HINT_PREPAREFORSTORE.  This code assumes that cache lines are
-   less than MAX_PREFETCH_SIZE bytes and if the cache line is larger it will
-   not work correctly.  */
-
-#ifdef USE_PREFETCH
-# define PREFETCH_HINT_STORE		1
-# define PREFETCH_HINT_STORE_STREAMED	5
-# define PREFETCH_HINT_STORE_RETAINED	7
-# define PREFETCH_HINT_PREPAREFORSTORE	30
-
-/* If we have not picked out what hints to use at this point use the
-   standard load and store prefetch hints.  */
-# ifndef PREFETCH_STORE_HINT
-#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE
-# endif
-
-/* We double everything when USE_DOUBLE is true so we do 2 prefetches to
-   get 64 bytes in that case.  The assumption is that each individual
-   prefetch brings in 32 bytes.  */
-# ifdef USE_DOUBLE
-#  define PREFETCH_CHUNK 64
-#  define PREFETCH_FOR_STORE(chunk, reg) \
-    pref PREFETCH_STORE_HINT, (chunk)*64(reg); \
-    pref PREFETCH_STORE_HINT, ((chunk)*64)+32(reg)
-# else
-#  define PREFETCH_CHUNK 32
-#  define PREFETCH_FOR_STORE(chunk, reg) \
-    pref PREFETCH_STORE_HINT, (chunk)*32(reg)
-# endif
-
-/* MAX_PREFETCH_SIZE is the maximum size of a prefetch, it must not be less
-   than PREFETCH_CHUNK, the assumed size of each prefetch.  If the real size
-   of a prefetch is greater than MAX_PREFETCH_SIZE and the PREPAREFORSTORE
-   hint is used, the code will not work correctly.  If PREPAREFORSTORE is not
-   used than MAX_PREFETCH_SIZE does not matter.  */
-# define MAX_PREFETCH_SIZE 128
-/* PREFETCH_LIMIT is set based on the fact that we never use an offset greater
-   than 5 on a STORE prefetch and that a single prefetch can never be larger
-   than MAX_PREFETCH_SIZE.  We add the extra 32 when USE_DOUBLE is set because
-   we actually do two prefetches in that case, one 32 bytes after the other.  */
-# ifdef USE_DOUBLE
-#  define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + 32 + MAX_PREFETCH_SIZE
-# else
-#  define PREFETCH_LIMIT (5 * PREFETCH_CHUNK) + MAX_PREFETCH_SIZE
-# endif
-
-# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE) \
-    && ((PREFETCH_CHUNK * 4) < MAX_PREFETCH_SIZE)
-/* We cannot handle this because the initial prefetches may fetch bytes that
-   are before the buffer being copied.  We start copies with an offset
-   of 4 so avoid this situation when using PREPAREFORSTORE.  */
-#  error "PREFETCH_CHUNK is too large and/or MAX_PREFETCH_SIZE is too small."
-# endif
-#else /* USE_PREFETCH not defined */
-# define PREFETCH_FOR_STORE(offset, reg)
-#endif
-
-#if __mips_isa_rev > 5
-# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-#  undef PREFETCH_STORE_HINT
-#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
-# endif
-# define R6_CODE
-#endif
-
-/* Allow the routine to be named something else if desired.  */
-#ifndef MEMSET_NAME
-# define MEMSET_NAME memset
-#endif
-
-/* We load/store 64 bits at a time when USE_DOUBLE is true.
-   The C_ prefix stands for CHUNK and is used to avoid macro name
-   conflicts with system header files.  */
-
-#ifdef USE_DOUBLE
-# define C_ST	sd
-# ifdef __MIPSEB
-#  define C_STHI	sdl	/* high part is left in big-endian	*/
-# else
-#  define C_STHI	sdr	/* high part is right in little-endian	*/
-# endif
-#else
-# define C_ST	sw
-# ifdef __MIPSEB
-#  define C_STHI	swl	/* high part is left in big-endian	*/
-# else
-#  define C_STHI	swr	/* high part is right in little-endian	*/
-# endif
-#endif
-
-/* Bookkeeping values for 32 vs. 64 bit mode.  */
-#ifdef USE_DOUBLE
-# define NSIZE 8
-# define NSIZEMASK 0x3f
-# define NSIZEDMASK 0x7f
-#else
-# define NSIZE 4
-# define NSIZEMASK 0x1f
-# define NSIZEDMASK 0x3f
-#endif
-#define UNIT(unit) ((unit)*NSIZE)
-#define UNITM1(unit) (((unit)*NSIZE)-1)
-
-#ifdef ANDROID_CHANGES
-LEAF(MEMSET_NAME,0)
-#else
-LEAF(MEMSET_NAME)
-#endif
-
-	.set	nomips16
-/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb).  Regardless of
-   size, copy dst pointer to v0 for the return value.  */
-	slti	t2,a2,(4 * NSIZE)
-	move	v0,a0
-	bne	t2,zero,L(lastb)
-
-/* If memset value is not zero, we copy it to all the bytes in a 32 or 64
-   bit word.  */
-	PTR_SUBU a3,zero,a0
-	beq	a1,zero,L(set0)		/* If memset value is zero no smear  */
-	nop
-
-	/* smear byte into 32 or 64 bit word */
-#if ((__mips == 64) || (__mips == 32)) && (__mips_isa_rev >= 2)
-# ifdef USE_DOUBLE
-	dins	a1, a1, 8, 8        /* Replicate fill byte into half-word.  */
-	dins	a1, a1, 16, 16      /* Replicate fill byte into word.       */
-	dins	a1, a1, 32, 32      /* Replicate fill byte into dbl word.   */
-# else
-	ins	a1, a1, 8, 8        /* Replicate fill byte into half-word.  */
-	ins	a1, a1, 16, 16      /* Replicate fill byte into word.       */
-# endif
-#else
-# ifdef USE_DOUBLE
-        and     a1,0xff
-	dsll	t2,a1,8
-	or	a1,t2
-	dsll	t2,a1,16
-	or	a1,t2
-	dsll	t2,a1,32
-	or	a1,t2
-# else
-        and     a1,0xff
-	sll	t2,a1,8
-	or	a1,t2
-	sll	t2,a1,16
-	or	a1,t2
-# endif
-#endif
-
-/* If the destination address is not aligned do a partial store to get it
-   aligned.  If it is already aligned just jump to L(aligned).  */
-L(set0):
-#ifndef R6_CODE
-	andi	t2,a3,(NSIZE-1)		/* word-unaligned address?          */
-	PTR_SUBU a2,a2,t2
-	beq	t2,zero,L(aligned)	/* t2 is the unalignment count      */
-	C_STHI	a1,0(a0)
-	PTR_ADDU a0,a0,t2
-#else /* R6_CODE */
-	andi	t2,a0,7
-# ifdef __mips_micromips
-	auipc	t9,%pcrel_hi(L(atable))
-	addiu	t9,t9,%pcrel_lo(L(atable)+4)
-	PTR_LSA	t9,t2,t9,1
-# else
-	lapc	t9,L(atable)
-	PTR_LSA	t9,t2,t9,2
-# endif
-	jrc	t9
-L(atable):
-	PTR_BC	L(aligned)
-	PTR_BC	L(lb7)
-	PTR_BC	L(lb6)
-	PTR_BC	L(lb5)
-	PTR_BC	L(lb4)
-	PTR_BC	L(lb3)
-	PTR_BC	L(lb2)
-	PTR_BC	L(lb1)
-L(lb7):
-	sb	a1,6(a0)
-L(lb6):
-	sb	a1,5(a0)
-L(lb5):
-	sb	a1,4(a0)
-L(lb4):
-	sb	a1,3(a0)
-L(lb3):
-	sb	a1,2(a0)
-L(lb2):
-	sb	a1,1(a0)
-L(lb1):
-	sb	a1,0(a0)
-
-	li	t9,NSIZE
-	subu	t2,t9,t2
-	PTR_SUBU a2,a2,t2
-	PTR_ADDU a0,a0,t2
-#endif /* R6_CODE */
-
-L(aligned):
-/* If USE_DOUBLE is not set we may still want to align the data on a 16
-   byte boundary instead of an 8 byte boundary to maximize the opportunity
-   of proAptiv chips to do memory bonding (combining two sequential 4
-   byte stores into one 8 byte store).  We know there are at least 4 bytes
-   left to store or we would have jumped to L(lastb) earlier in the code.  */
-#ifdef DOUBLE_ALIGN
-	andi	t2,a3,4
-	PTR_SUBU a2,a2,t2
-	beq	t2,zero,L(double_aligned)
-	sw	a1,0(a0)
-	PTR_ADDU a0,a0,t2
-L(double_aligned):
-#endif
-
-/* Now the destination is aligned to (word or double word) aligned address
-   Set a2 to count how many bytes we have to copy after all the 64/128 byte
-   chunks are copied and a3 to the dest pointer after all the 64/128 byte
-   chunks have been copied.  We will loop, incrementing a0 until it equals
-   a3.  */
-	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
-	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
-	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
-	PTR_ADDU a3,a0,a3	 /* Now a3 is the final dst after loop */
-
-/* When in the loop we may prefetch with the 'prepare to store' hint,
-   in this case the a0+x should not be past the "t0-32" address.  This
-   means: for x=128 the last "safe" a0 address is "t0-160".  Alternatively,
-   for x=64 the last "safe" a0 address is "t0-96" In the current version we
-   will use "prefetch hint,128(a0)", so "t0-160" is the limit.  */
-#if defined(USE_PREFETCH) \
-    && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	PTR_ADDU t0,a0,a2		/* t0 is the "past the end" address */
-	PTR_SUBU t9,t0,PREFETCH_LIMIT	/* t9 is the "last safe pref" address */
-#endif
-#if defined(USE_PREFETCH) \
-    && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
-	PREFETCH_FOR_STORE (1, a0)
-	PREFETCH_FOR_STORE (2, a0)
-	PREFETCH_FOR_STORE (3, a0)
-#endif
-
-L(loop16w):
-#if defined(USE_PREFETCH) \
-    && (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
-	sltu	v1,t9,a0		/* If a0 > t9 don't use next prefetch */
-	bgtz	v1,L(skip_pref)
-#endif
-#ifdef R6_CODE
-	PREFETCH_FOR_STORE (2, a0)
-#else
-	PREFETCH_FOR_STORE (4, a0)
-	PREFETCH_FOR_STORE (5, a0)
-#endif
-L(skip_pref):
-	C_ST	a1,UNIT(0)(a0)
-	C_ST	a1,UNIT(1)(a0)
-	C_ST	a1,UNIT(2)(a0)
-	C_ST	a1,UNIT(3)(a0)
-	C_ST	a1,UNIT(4)(a0)
-	C_ST	a1,UNIT(5)(a0)
-	C_ST	a1,UNIT(6)(a0)
-	C_ST	a1,UNIT(7)(a0)
-	C_ST	a1,UNIT(8)(a0)
-	C_ST	a1,UNIT(9)(a0)
-	C_ST	a1,UNIT(10)(a0)
-	C_ST	a1,UNIT(11)(a0)
-	C_ST	a1,UNIT(12)(a0)
-	C_ST	a1,UNIT(13)(a0)
-	C_ST	a1,UNIT(14)(a0)
-	C_ST	a1,UNIT(15)(a0)
-	PTR_ADDIU a0,a0,UNIT(16)	/* adding 64/128 to dest */
-	bne	a0,a3,L(loop16w)
-	move	a2,t8
-
-/* Here we have dest word-aligned but less than 64-bytes or 128 bytes to go.
-   Check for a 32(64) byte chunk and copy if there is one.  Otherwise
-   jump down to L(chk1w) to handle the tail end of the copy.  */
-L(chkw):
-	andi	t8,a2,NSIZEMASK	/* is there a 32-byte/64-byte chunk.  */
-				/* the t8 is the reminder count past 32-bytes */
-	beq	a2,t8,L(chk1w)/* when a2==t8, no 32-byte chunk */
-	C_ST	a1,UNIT(0)(a0)
-	C_ST	a1,UNIT(1)(a0)
-	C_ST	a1,UNIT(2)(a0)
-	C_ST	a1,UNIT(3)(a0)
-	C_ST	a1,UNIT(4)(a0)
-	C_ST	a1,UNIT(5)(a0)
-	C_ST	a1,UNIT(6)(a0)
-	C_ST	a1,UNIT(7)(a0)
-	PTR_ADDIU a0,a0,UNIT(8)
-
-/* Here we have less than 32(64) bytes to set.  Set up for a loop to
-   copy one word (or double word) at a time.  Set a2 to count how many
-   bytes we have to copy after all the word (or double word) chunks are
-   copied and a3 to the dest pointer after all the (d)word chunks have
-   been copied.  We will loop, incrementing a0 until a0 equals a3.  */
-L(chk1w):
-	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
-	beq	a2,t8,L(lastb)
-	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
-
-/* copying in words (4-byte or 8 byte chunks) */
-L(wordCopy_loop):
-	PTR_ADDIU a0,a0,UNIT(1)
-	C_ST	a1,UNIT(-1)(a0)
-	bne	a0,a3,L(wordCopy_loop)
-
-/* Copy the last 8 (or 16) bytes */
-L(lastb):
-	PTR_ADDU a3,a0,a2       /* a3 is the last dst address */
-	blez	a2,L(leave)
-L(lastbloop):
-	PTR_ADDIU a0,a0,1
-	sb	a1,-1(a0)
-	bne	a0,a3,L(lastbloop)
-L(leave):
-	jr	ra
-
-	.set	at
-END(MEMSET_NAME)
-#ifndef ANDROID_CHANGES
-# ifdef _LIBC
-libc_hidden_builtin_def (MEMSET_NAME)
-# endif
-#endif
diff --git a/sysdeps/mips/memset.c b/sysdeps/mips/memset.c
new file mode 100644
index 0000000000..813b3bc0e6
--- /dev/null
+++ b/sysdeps/mips/memset.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2024 MIPS Tech, LLC
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from this
+ * software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifdef  __GNUC__
+
+#undef memset
+
+#include <string.h>
+
+#if _MIPS_SIM == _ABIO32
+#define SIZEOF_reg_t 4
+typedef unsigned long reg_t;
+#else
+#define SIZEOF_reg_t 8
+typedef unsigned long long reg_t;
+#endif
+
+typedef struct bits8
+{
+  reg_t B0:8, B1:8, B2:8, B3:8;
+#if SIZEOF_reg_t == 8
+  reg_t B4:8, B5:8, B6:8, B7:8;
+#endif
+} bits8_t;
+typedef struct bits16
+{
+  reg_t B0:16, B1:16;
+#if SIZEOF_reg_t == 8
+  reg_t B2:16, B3:16;
+#endif
+} bits16_t;
+typedef struct bits32
+{
+  reg_t B0:32;
+#if SIZEOF_reg_t == 8
+  reg_t B1:32;
+#endif
+} bits32_t;
+
+/* This union assumes that small structures can be in registers.  If
+   not, then memory accesses will be done - not optimal, but ok.  */
+typedef union
+{
+  reg_t v;
+  bits8_t b8;
+  bits16_t b16;
+  bits32_t b32;
+} bitfields_t;
+
+/* This code is called when aligning a pointer or there are remaining bytes
+   after doing word sets.  */
+static inline void * __attribute__ ((always_inline))
+do_bytes (void *a, void *retval, unsigned char fill, const unsigned long len)
+{
+  unsigned char *x = ((unsigned char *) a);
+  unsigned long i;
+
+  for (i = 0; i < len; i++)
+    *x++ = fill;
+
+  return retval;
+}
+
+/* Pointer is aligned.  */
+static void *
+do_aligned_words (reg_t * a, void * retval, reg_t fill,
+	 unsigned long words, unsigned long bytes)
+{
+  unsigned long i, words_by_1, words_by_16;
+
+  words_by_1 = words % 16;
+  words_by_16 = words / 16;
+
+  /*
+   * Note: prefetching the store memory is not beneficial on most
+   * cores since the ls/st unit has store buffers that will be filled
+   * before the cache line is actually needed.
+   *
+   * Also, using prepare-for-store cache op is problematic since we
+   * don't know the implementation-defined cache line length and we
+   * don't want to touch unintended memory.
+   */
+  for (i = 0; i < words_by_16; i++)
+    {
+      a[0] = fill;
+      a[1] = fill;
+      a[2] = fill;
+      a[3] = fill;
+      a[4] = fill;
+      a[5] = fill;
+      a[6] = fill;
+      a[7] = fill;
+      a[8] = fill;
+      a[9] = fill;
+      a[10] = fill;
+      a[11] = fill;
+      a[12] = fill;
+      a[13] = fill;
+      a[14] = fill;
+      a[15] = fill;
+      a += 16;
+    }
+
+  /* do remaining words.  */
+  for (i = 0; i < words_by_1; i++)
+    *a++ = fill;
+
+  /* mop up any remaining bytes.  */
+  return do_bytes (a, retval, fill, bytes);
+}
+
+void *
+memset (void *a, int ifill, size_t len)
+{
+  unsigned long bytes, words;
+  bitfields_t fill;
+  void *retval = (void *) a;
+
+  /* shouldn't hit that often.  */
+  if (len < 16)
+    return do_bytes (a, retval, ifill, len);
+
+  /* Align the pointer to word/dword alignment.
+     Note that the pointer is only 32-bits for o32/n32 ABIs. For
+     n32, loads are done as 64-bit while address remains 32-bit.   */
+  bytes = ((unsigned long) a) % (sizeof (reg_t) * 2);
+  if (bytes)
+    {
+      bytes = (sizeof (reg_t) * 2 - bytes);
+      if (bytes > len)
+	bytes = len;
+      do_bytes (a, retval, ifill, bytes);
+      if (len == bytes)
+	return retval;
+      len -= bytes;
+      a = (void *) (((unsigned char *) a) + bytes);
+    }
+
+  /* Create correct fill value for reg_t sized variable.  */
+  if (ifill != 0)
+    {
+      fill.b8.B0 = (unsigned char) ifill;
+      fill.b8.B1 = fill.b8.B0;
+      fill.b16.B1 = fill.b16.B0;
+#if SIZEOF_reg_t == 8
+      fill.b32.B1 = fill.b32.B0;
+#endif
+    }
+  else
+    fill.v = 0;
+
+  words = len / sizeof (reg_t);
+  bytes = len % sizeof (reg_t);
+  return do_aligned_words (a, retval, fill.v, words, bytes);
+}
+
+
+libc_hidden_builtin_def (memset)
+
+#else
+#include <string/memset.c>
+#endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 05/11] Add optimized assembly for strcmp
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (4 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 04/11] Add C implementation of memcpy/memset Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 06/11] Fix prefetching beyond copied memory Aleksandar Rakic
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

Cherry-picked ff356419673a5d122335dd81bd5726de7bc5e08f
from https://github.com/MIPS/glibc

Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/strcmp.S | 228 +++++++++++++++++++++++++-----------------
 1 file changed, 137 insertions(+), 91 deletions(-)

diff --git a/sysdeps/mips/strcmp.S b/sysdeps/mips/strcmp.S
index 36379be021..4878cd3aac 100644
--- a/sysdeps/mips/strcmp.S
+++ b/sysdeps/mips/strcmp.S
@@ -1,4 +1,5 @@
 /* Copyright (C) 2014-2024 Free Software Foundation, Inc.
+   Optimized strcmp for MIPS
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -22,9 +23,6 @@
 # include <sysdep.h>
 # include <regdef.h>
 # include <sys/asm.h>
-#elif defined _COMPILING_NEWLIB
-# include "machine/asm.h"
-# include "machine/regdef.h"
 #else
 # include <regdef.h>
 # include <sys/asm.h>
@@ -46,6 +44,10 @@
    performance loss, so we are not turning it on by default.  */
 #if defined(ENABLE_CLZ) && (__mips_isa_rev > 1)
 # define USE_CLZ
+#elif (__mips_isa_rev >= 2)
+# define USE_EXT 1
+#else
+# define USE_EXT 0
 #endif
 
 /* Some asm.h files do not have the L macro definition.  */
@@ -66,6 +68,10 @@
 # endif
 #endif
 
+/* Haven't yet found a configuration where DSP code outperforms
+   normal assembly.  */
+#define __mips_using_dsp 0
+
 /* Allow the routine to be named something else if desired.  */
 #ifndef STRCMP_NAME
 # define STRCMP_NAME strcmp
@@ -77,28 +83,35 @@ LEAF(STRCMP_NAME, 0)
 LEAF(STRCMP_NAME)
 #endif
 	.set	nomips16
-	.set	noreorder
-
 	or	t0, a0, a1
-	andi	t0,0x3
+	andi	t0, t0, 0x3
 	bne	t0, zero, L(byteloop)
 
 /* Both strings are 4 byte aligned at this point.  */
+	li	t8, 0x01010101
+#if !__mips_using_dsp
+	li	t9, 0x7f7f7f7f
+#endif
 
-	lui	t8, 0x0101
-	ori	t8, t8, 0x0101
-	lui	t9, 0x7f7f
-	ori	t9, 0x7f7f
-
-#define STRCMP32(OFFSET) \
-	lw	v0, OFFSET(a0); \
-	lw	v1, OFFSET(a1); \
-	subu	t0, v0, t8; \
-	bne	v0, v1, L(worddiff); \
-	nor	t1, v0, t9; \
-	and	t0, t0, t1; \
+#if __mips_using_dsp
+# define STRCMP32(OFFSET)	\
+	lw	a2, OFFSET(a0); \
+	lw	a3, OFFSET(a1); \
+	subu_s.qb t0, t8, a2;	\
+	bne	a2, a3, L(worddiff); \
 	bne	t0, zero, L(returnzero)
+#else  /* !__mips_using_dsp */
+# define STRCMP32(OFFSET)	\
+	lw	a2, OFFSET(a0); \
+	lw	a3, OFFSET(a1); \
+	subu	t0, a2, t8;	\
+	nor	t1, a2, t9;	\
+	bne	a2, a3, L(worddiff); \
+	and	t1, t0, t1;	\
+	bne	t1, zero, L(returnzero)
+#endif /* __mips_using_dsp */
 
+	.align 2
 L(wordloop):
 	STRCMP32(0)
 	DELAY_READ
@@ -113,112 +126,143 @@ L(wordloop):
 	STRCMP32(20)
 	DELAY_READ
 	STRCMP32(24)
-	DELAY_READ
-	STRCMP32(28)
+	lw	a2, 28(a0)
+	lw	a3, 28(a1)
+#if __mips_using_dsp
+	subu_s.qb t0, t8, a2
+#else
+	subu	t0, a2, t8
+	nor	t1, a2, t9
+	and	t1, t0, t1
+#endif
+
 	PTR_ADDIU a0, a0, 32
-	b	L(wordloop)
+	bne	a2, a3, L(worddiff)
 	PTR_ADDIU a1, a1, 32
+	beq	t1, zero, L(wordloop)
 
 L(returnzero):
-	j	ra
 	move	v0, zero
+	jr	ra
 
+	.align 2
 L(worddiff):
 #ifdef USE_CLZ
-	subu	t0, v0, t8
-	nor	t1, v0, t9
-	and	t1, t0, t1
-	xor	t0, v0, v1
+	xor	t0, a2, a3
 	or	t0, t0, t1
 # if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
 	wsbh	t0, t0
 	rotr	t0, t0, 16
-# endif
+# endif /* LITTLE_ENDIAN */
 	clz	t1, t0
-	and	t1, 0xf8
-# if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-	neg	t1
-	addu	t1, 24
+	or	t0, t1, 24	/* Only care about multiples of 8.  */
+	xor 	t1, t1, t0	/* {0,8,16,24} => {24,16,8,0}  */
+# if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	sllv	a2,a2,t1
+	sllv	a3,a3,t1
+# else
+	srlv	a2,a2,t1
+	srlv	a3,a3,t1
 # endif
-	rotrv	v0, v0, t1
-	rotrv	v1, v1, t1
-	and	v0, v0, 0xff
-	and	v1, v1, 0xff
-	j	ra
-	subu	v0, v0, v1
+	subu	v0, a2, a3
+	jr	ra
 #else /* USE_CLZ */
 # if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-	andi	t0, v0, 0xff
-	beq	t0, zero, L(wexit01)
-	andi	t1, v1, 0xff
-	bne	t0, t1, L(wexit01)
-
-	srl	t8, v0, 8
-	srl	t9, v1, 8
-	andi	t8, t8, 0xff
+	andi	a0, a2, 0xff	/* abcd => d */
+	andi	a1, a3, 0xff
+	beq	a0, zero, L(wexit01)
+#  if USE_EXT
+	ext	t8, a2, 8, 8
+	bne	a0, a1, L(wexit01)
+	ext	t9, a3, 8, 8
 	beq	t8, zero, L(wexit89)
+	ext	a0, a2, 16, 8
+	bne	t8, t9, L(wexit89)
+	ext	a1, a3, 16, 8
+#  else  /* !USE_EXT */
+	srl	t8, a2, 8
+	bne	a0, a1, L(wexit01)
+	srl	t9, a3, 8
+	andi	t8, t8, 0xff
 	andi	t9, t9, 0xff
+	beq	t8, zero, L(wexit89)
+	srl	a0, a2, 16
 	bne	t8, t9, L(wexit89)
+	srl	a1, a3, 16
+	andi	a0, a0, 0xff
+	andi	a1, a1, 0xff
+#  endif  /* !USE_EXT */
 
-	srl	t0, v0, 16
-	srl	t1, v1, 16
-	andi	t0, t0, 0xff
-	beq	t0, zero, L(wexit01)
-	andi	t1, t1, 0xff
-	bne	t0, t1, L(wexit01)
-
-	srl	t8, v0, 24
-	srl	t9, v1, 24
 # else /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
-	srl	t0, v0, 24
-	beq	t0, zero, L(wexit01)
-	srl	t1, v1, 24
-	bne	t0, t1, L(wexit01)
+	srl	a0, a2, 24	/* abcd => a */
+	srl	a1, a3, 24
+	beq	a0, zero, L(wexit01)
 
-	srl	t8, v0, 16
-	srl	t9, v1, 16
-	andi	t8, t8, 0xff
+#  if USE_EXT
+	ext	t8, a2, 16, 8
+	bne	a0, a1, L(wexit01)
+	ext	t9, a3, 16, 8
 	beq	t8, zero, L(wexit89)
+	ext	a0, a2, 8, 8
+	bne	t8, t9, L(wexit89)
+	ext	a1, a3, 8, 8
+#  else /* ! USE_EXT */
+	srl	t8, a2, 8
+	bne	a0, a1, L(wexit01)
+	srl	t9, a3, 8
+	andi	t8, t8, 0xff
 	andi	t9, t9, 0xff
+	beq	t8, zero, L(wexit89)
+	srl	a0, a2, 16
 	bne	t8, t9, L(wexit89)
+	srl	a1, a3, 16
+	andi	a0, a0, 0xff
+	andi	a1, a1, 0xff
+#  endif  /* USE_EXT */
 
-	srl	t0, v0, 8
-	srl	t1, v1, 8
-	andi	t0, t0, 0xff
-	beq	t0, zero, L(wexit01)
-	andi	t1, t1, 0xff
-	bne	t0, t1, L(wexit01)
-
-	andi	t8, v0, 0xff
-	andi	t9, v1, 0xff
 # endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
 
+	beq	a0, zero, L(wexit01)
+	bne	a0, a1, L(wexit01)
+
+	/* The other bytes are identical, so just subract the 2 words
+	  and return the difference.  */
+	move a0, a2
+	move a1, a3
+
+L(wexit01):
+	subu	v0, a0, a1
+	jr	ra
+
 L(wexit89):
-	j	ra
 	subu	v0, t8, t9
-L(wexit01):
-	j	ra
-	subu	v0, t0, t1
+	jr	ra
+
 #endif /* USE_CLZ */
 
+#define DELAY_NOP nop
+
 /* It might seem better to do the 'beq' instruction between the two 'lbu'
    instructions so that the nop is not needed but testing showed that this
    code is actually faster (based on glibc strcmp test).  */
-#define BYTECMP01(OFFSET) \
-	lbu	v0, OFFSET(a0); \
-	lbu	v1, OFFSET(a1); \
-	beq	v0, zero, L(bexit01); \
-	nop; \
-	bne	v0, v1, L(bexit01)
-
-#define BYTECMP89(OFFSET) \
-	lbu	t8, OFFSET(a0); \
+
+#define BYTECMP01(OFFSET)	\
+	lbu	a3, OFFSET(a1); \
+	DELAY_NOP;		\
+	beq	a2, zero, L(bexit01); \
+	lbu	t8, OFFSET+1(a0); \
+	bne	a2, a3, L(bexit01)
+
+#define BYTECMP89(OFFSET)	\
 	lbu	t9, OFFSET(a1); \
+	DELAY_NOP;		\
 	beq	t8, zero, L(bexit89); \
-	nop;	\
+	lbu	a2, OFFSET+1(a0); \
 	bne	t8, t9, L(bexit89)
 
+	.align 2
 L(byteloop):
+	lbu	a2, 0(a0)
 	BYTECMP01(0)
 	BYTECMP89(1)
 	BYTECMP01(2)
@@ -226,20 +270,22 @@ L(byteloop):
 	BYTECMP01(4)
 	BYTECMP89(5)
 	BYTECMP01(6)
-	BYTECMP89(7)
+	lbu	t9, 7(a1)
+
 	PTR_ADDIU a0, a0, 8
-	b	L(byteloop)
+	beq	t8, zero, L(bexit89)
 	PTR_ADDIU a1, a1, 8
+	beq	t8, t9, L(byteloop)
 
-L(bexit01):
-	j	ra
-	subu	v0, v0, v1
 L(bexit89):
-	j	ra
 	subu	v0, t8, t9
+	jr	ra
+
+L(bexit01):
+	subu	v0, a2, a3
+	jr	ra
 
 	.set	at
-	.set	reorder
 
 END(STRCMP_NAME)
 #ifndef ANDROID_CHANGES
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 06/11] Fix prefetching beyond copied memory
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (5 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 05/11] Add optimized assembly for strcmp Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 07/11] Fix strcmp bug for little endian target Aleksandar Rakic
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

GTM18-287/PP118771: memcpy prefetches beyond copied memory.
Fix prefetching in core loop to avoid exceeding the operated upon
memory region. Revert accidentally changed prefetch-hint back to
streaming mode. Refactor various bits and provide pre-processor
checks to allow parameters to be overridden from compiler command
line.

Cherry-picked 132e0bbbbed01f95ec88b68b5f7f2056f6125531
from https://github.com/MIPS/glibc

Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/memcpy.c | 188 +++++++++++++++++++++++++-----------------
 1 file changed, 111 insertions(+), 77 deletions(-)

diff --git a/sysdeps/mips/memcpy.c b/sysdeps/mips/memcpy.c
index 8c3aec7b36..798e991f6d 100644
--- a/sysdeps/mips/memcpy.c
+++ b/sysdeps/mips/memcpy.c
@@ -1,37 +1,29 @@
-/*
- * Copyright (C) 2024 MIPS Tech, LLC
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice,
- * this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from this
- * software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
- * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
- * POSSIBILITY OF SUCH DAMAGE.
-*/
+/* Copyright (C) 2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Wave Computing
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
 
 #ifdef  __GNUC__
 
 #undef memcpy
 
 /* Typical observed latency in cycles in fetching from DRAM.  */
-#define LATENCY_CYCLES 63
+#ifndef LATENCY_CYCLES
+ #define LATENCY_CYCLES 63
+#endif
 
 /* Pre-fetch performance is subject to accurate prefetch ahead,
    which in turn depends on both the cache-line size and the amount
@@ -48,30 +40,42 @@
  #define LATENCY_CYCLES 150
 #elif defined(_MIPS_TUNE_I6400) || defined(_MIPS_TUNE_I6500)
  #define CACHE_LINE 64
- #define BLOCK_CYCLES 16
+ #define BLOCK_CYCLES 15
 #elif defined(_MIPS_TUNE_P6600)
  #define CACHE_LINE 32
- #define BLOCK_CYCLES 12
+ #define BLOCK_CYCLES 15
 #elif defined(_MIPS_TUNE_INTERAPTIV) || defined(_MIPS_TUNE_INTERAPTIV_MR2)
  #define CACHE_LINE 32
  #define BLOCK_CYCLES 30
 #else
- #define CACHE_LINE 32
- #define BLOCK_CYCLES 11
+ #ifndef CACHE_LINE
+  #define CACHE_LINE 32
+ #endif
+ #ifndef BLOCK_CYCLES
+  #ifdef __nanomips__
+   #define BLOCK_CYCLES 20
+  #else
+   #define BLOCK_CYCLES 11
+  #endif
+ #endif
 #endif
 
 /* Pre-fetch look ahead = ceil (latency / block-cycles)  */
 #define PREF_AHEAD (LATENCY_CYCLES / BLOCK_CYCLES			\
 		    + ((LATENCY_CYCLES % BLOCK_CYCLES) == 0 ? 0 : 1))
 
-/* Unroll-factor, controls how many words at a time in the core loop.  */
-#define BLOCK (CACHE_LINE == 128 ? 16 : 8)
+/* The unroll-factor controls how many words at a time in the core loop.  */
+#ifndef BLOCK_SIZE
+ #define BLOCK_SIZE (CACHE_LINE == 128 ? 16 : 8)
+#elif BLOCK_SIZE != 8 && BLOCK_SIZE != 16
+ #error "BLOCK_SIZE must be 8 or 16"
+#endif
 
 #define __overloadable
 #if !defined(UNALIGNED_INSTR_SUPPORT)
 /* does target have unaligned lw/ld/ualw/uald instructions? */
  #define UNALIGNED_INSTR_SUPPORT 0
-#if (__mips_isa_rev < 6 && !defined(__mips1))
+#if (__mips_isa_rev < 6 && !defined(__mips1)) || defined(__nanomips__)
   #undef UNALIGNED_INSTR_SUPPORT
   #define UNALIGNED_INSTR_SUPPORT 1
  #endif
@@ -79,17 +83,35 @@
 #if !defined(HW_UNALIGNED_SUPPORT)
 /* Does target have hardware support for unaligned accesses?  */
  #define HW_UNALIGNED_SUPPORT 0
- #if __mips_isa_rev >= 6
+ #if __mips_isa_rev >= 6 && !defined(__nanomips__)
   #undef HW_UNALIGNED_SUPPORT
   #define HW_UNALIGNED_SUPPORT 1
  #endif
 #endif
-#define ENABLE_PREFETCH     1
+
+#ifndef ENABLE_PREFETCH
+ #define ENABLE_PREFETCH 1
+#endif
+
+#ifndef ENABLE_PREFETCH_CHECK
+ #define ENABLE_PREFETCH_CHECK 0
+#endif
+
 #if ENABLE_PREFETCH
- #define PREFETCH(addr)  __builtin_prefetch (addr, 0, 0)
-#else
+ #if ENABLE_PREFETCH_CHECK
+#include <assert.h>
+static  char *limit;
+#define PREFETCH(addr)				\
+  do {						\
+    assert ((char *)(addr) < limit);		\
+    __builtin_prefetch ((addr), 0, 1);		\
+  } while (0)
+#else /* ENABLE_PREFETCH_CHECK */
+  #define PREFETCH(addr)  __builtin_prefetch (addr, 0, 1)
+ #endif /* ENABLE_PREFETCH_CHECK */
+#else /* ENABLE_PREFETCH */
  #define PREFETCH(addr)
-#endif
+#endif /* ENABLE_PREFETCH */
 
 #include <string.h>
 
@@ -99,17 +121,18 @@ typedef struct
 {
   reg_t B0:8, B1:8, B2:8, B3:8, B4:8, B5:8, B6:8, B7:8;
 } bits_t;
-#else
+#else /* __mips64 */
 typedef unsigned long reg_t;
 typedef struct
 {
   reg_t B0:8, B1:8, B2:8, B3:8;
 } bits_t;
-#endif
+#endif /* __mips64 */
 
-#define CACHE_LINES_PER_BLOCK ((BLOCK * sizeof (reg_t) > CACHE_LINE) ?	\
-			       (BLOCK * sizeof (reg_t) / CACHE_LINE)	\
-			       : 1)
+#define CACHE_LINES_PER_BLOCK						\
+  ((BLOCK_SIZE * sizeof (reg_t) > CACHE_LINE)				\
+   ? (BLOCK_SIZE * sizeof (reg_t) / CACHE_LINE)				\
+   : 1)
 
 typedef union
 {
@@ -120,7 +143,7 @@ typedef union
 #define DO_BYTE(a, i)   \
   a[i] = bw.b.B##i;     \
   len--;                \
-  if(!len) return ret;  \
+  if (!len) return ret;  \
 
 /* This code is called when aligning a pointer, there are remaining bytes
    after doing word compares, or architecture does not have some form
@@ -148,7 +171,7 @@ do_bytes_remaining (void *a, const void *b, unsigned long len, void *ret)
 {
   unsigned char *x = (unsigned char *) a;
   bitfields_t bw;
-  if(len > 0)
+  if (len > 0)
     {
       bw.v = *(reg_t *)b;
       DO_BYTE(x, 0);
@@ -159,7 +182,7 @@ do_bytes_remaining (void *a, const void *b, unsigned long len, void *ret)
       DO_BYTE(x, 4);
       DO_BYTE(x, 5);
       DO_BYTE(x, 6);
-#endif
+#endif /* __mips64 */
     }
   return ret;
 }
@@ -170,7 +193,7 @@ do_words_remaining (reg_t *a, const reg_t *b, unsigned long words,
 {
   /* Use a set-back so that load/stores have incremented addresses in
      order to promote bonding.  */
-  int off = (BLOCK - words);
+  int off = (BLOCK_SIZE - words);
   a -= off;
   b -= off;
   switch (off)
@@ -182,7 +205,7 @@ do_words_remaining (reg_t *a, const reg_t *b, unsigned long words,
       case 5: a[5] = b[5]; // Fall through
       case 6: a[6] = b[6]; // Fall through
       case 7: a[7] = b[7]; // Fall through
-#if BLOCK==16
+#if BLOCK_SIZE==16
       case 8: a[8] = b[8]; // Fall through
       case 9: a[9] = b[9]; // Fall through
       case 10: a[10] = b[10]; // Fall through
@@ -191,9 +214,9 @@ do_words_remaining (reg_t *a, const reg_t *b, unsigned long words,
       case 13: a[13] = b[13]; // Fall through
       case 14: a[14] = b[14]; // Fall through
       case 15: a[15] = b[15];
-#endif
+#endif /* BLOCK_SIZE==16 */
     }
-  return do_bytes_remaining (a + BLOCK, b + BLOCK, bytes, ret);
+  return do_bytes_remaining (a + BLOCK_SIZE, b + BLOCK_SIZE, bytes, ret);
 }
 
 #if !HW_UNALIGNED_SUPPORT
@@ -210,7 +233,7 @@ do_uwords_remaining (struct ulw *a, const reg_t *b, unsigned long words,
 {
   /* Use a set-back so that load/stores have incremented addresses in
      order to promote bonding.  */
-  int off = (BLOCK - words);
+  int off = (BLOCK_SIZE - words);
   a -= off;
   b -= off;
   switch (off)
@@ -222,7 +245,7 @@ do_uwords_remaining (struct ulw *a, const reg_t *b, unsigned long words,
       case 5: a[5].uli = b[5]; // Fall through
       case 6: a[6].uli = b[6]; // Fall through
       case 7: a[7].uli = b[7]; // Fall through
-#if BLOCK==16
+#if BLOCK_SIZE==16
       case 8: a[8].uli = b[8]; // Fall through
       case 9: a[9].uli = b[9]; // Fall through
       case 10: a[10].uli = b[10]; // Fall through
@@ -231,9 +254,9 @@ do_uwords_remaining (struct ulw *a, const reg_t *b, unsigned long words,
       case 13: a[13].uli = b[13]; // Fall through
       case 14: a[14].uli = b[14]; // Fall through
       case 15: a[15].uli = b[15];
-#endif
+#endif /* BLOCK_SIZE==16 */
     }
-  return do_bytes_remaining (a + BLOCK, b + BLOCK, bytes, ret);
+  return do_bytes_remaining (a + BLOCK_SIZE, b + BLOCK_SIZE, bytes, ret);
 }
 
 /* The first pointer is not aligned while second pointer is.  */
@@ -242,13 +265,19 @@ unaligned_words (struct ulw *a, const reg_t * b,
 		 unsigned long words, unsigned long bytes, void *ret)
 {
   unsigned long i, words_by_block, words_by_1;
-  words_by_1 = words % BLOCK;
-  words_by_block = words / BLOCK;
+  words_by_1 = words % BLOCK_SIZE;
+  words_by_block = words / BLOCK_SIZE;
+
   for (; words_by_block > 0; words_by_block--)
     {
-      if (words_by_block >= PREF_AHEAD - CACHE_LINES_PER_BLOCK)
+      /* This condition is deliberately conservative.  One could theoretically
+	 pre-fetch another time around in some cases without crossing the page
+	 boundary at the limit, but checking for the right conditions here is
+	 too expensive to be worth it.  */
+      if (words_by_block > PREF_AHEAD)
 	for (i = 0; i < CACHE_LINES_PER_BLOCK; i++)
-	  PREFETCH (b + (BLOCK / CACHE_LINES_PER_BLOCK) * (PREF_AHEAD + i));
+	  PREFETCH (b + ((BLOCK_SIZE / CACHE_LINES_PER_BLOCK)
+			 * (PREF_AHEAD + i)));
 
       reg_t y0 = b[0], y1 = b[1], y2 = b[2], y3 = b[3];
       reg_t y4 = b[4], y5 = b[5], y6 = b[6], y7 = b[7];
@@ -260,7 +289,7 @@ unaligned_words (struct ulw *a, const reg_t * b,
       a[5].uli = y5;
       a[6].uli = y6;
       a[7].uli = y7;
-#if BLOCK==16
+#if BLOCK_SIZE==16
       y0 = b[8], y1 = b[9], y2 = b[10], y3 = b[11];
       y4 = b[12], y5 = b[13], y6 = b[14], y7 = b[15];
       a[8].uli = y0;
@@ -271,16 +300,16 @@ unaligned_words (struct ulw *a, const reg_t * b,
       a[13].uli = y5;
       a[14].uli = y6;
       a[15].uli = y7;
-#endif
-      a += BLOCK;
-      b += BLOCK;
+#endif /* BLOCK_SIZE==16 */
+      a += BLOCK_SIZE;
+      b += BLOCK_SIZE;
   }
 
   /* Mop up any remaining bytes.  */
   return do_uwords_remaining (a, b, words_by_1, bytes, ret);
 }
 
-#else
+#else /* !UNALIGNED_INSTR_SUPPORT */
 
 /* No HW support or unaligned lw/ld/ualw/uald instructions.  */
 static void *
@@ -320,13 +349,15 @@ aligned_words (reg_t * a, const reg_t * b,
 	       unsigned long words, unsigned long bytes, void *ret)
 {
   unsigned long i, words_by_block, words_by_1;
-  words_by_1 = words % BLOCK;
-  words_by_block = words / BLOCK;
+  words_by_1 = words % BLOCK_SIZE;
+  words_by_block = words / BLOCK_SIZE;
+
   for (; words_by_block > 0; words_by_block--)
     {
-      if(words_by_block >= PREF_AHEAD - CACHE_LINES_PER_BLOCK)
+      if (words_by_block > PREF_AHEAD)
 	for (i = 0; i < CACHE_LINES_PER_BLOCK; i++)
-	  PREFETCH (b + ((BLOCK / CACHE_LINES_PER_BLOCK) * (PREF_AHEAD + i)));
+	  PREFETCH (b + ((BLOCK_SIZE / CACHE_LINES_PER_BLOCK)
+			 * (PREF_AHEAD + i)));
 
       reg_t x0 = b[0], x1 = b[1], x2 = b[2], x3 = b[3];
       reg_t x4 = b[4], x5 = b[5], x6 = b[6], x7 = b[7];
@@ -338,7 +369,7 @@ aligned_words (reg_t * a, const reg_t * b,
       a[5] = x5;
       a[6] = x6;
       a[7] = x7;
-#if BLOCK==16
+#if BLOCK_SIZE==16
       x0 = b[8], x1 = b[9], x2 = b[10], x3 = b[11];
       x4 = b[12], x5 = b[13], x6 = b[14], x7 = b[15];
       a[8] = x0;
@@ -349,9 +380,9 @@ aligned_words (reg_t * a, const reg_t * b,
       a[13] = x5;
       a[14] = x6;
       a[15] = x7;
-#endif
-      a += BLOCK;
-      b += BLOCK;
+#endif /* BLOCK_SIZE==16 */
+      a += BLOCK_SIZE;
+      b += BLOCK_SIZE;
     }
 
   /* mop up any remaining bytes.  */
@@ -363,13 +394,16 @@ memcpy (void *a, const void *b, size_t len) __overloadable
 {
   unsigned long bytes, words, i;
   void *ret = a;
+#if ENABLE_PREFETCH_CHECK
+  limit = (char *)b + len;
+#endif /* ENABLE_PREFETCH_CHECK */
   /* shouldn't hit that often.  */
   if (len <= 8)
     return do_bytes (a, b, len, a);
 
   /* Start pre-fetches ahead of time.  */
-  if (len > CACHE_LINE * (PREF_AHEAD - 1))
-    for (i = 1; i < PREF_AHEAD - 1; i++)
+  if (len > CACHE_LINE * PREF_AHEAD)
+    for (i = 1; i < PREF_AHEAD; i++)
       PREFETCH ((char *)b + CACHE_LINE * i);
   else
     for (i = 1; i < len / CACHE_LINE; i++)
@@ -400,12 +434,12 @@ memcpy (void *a, const void *b, size_t len) __overloadable
 #if HW_UNALIGNED_SUPPORT
   /* treat possible unaligned first pointer as aligned.  */
   return aligned_words (a, b, words, bytes, ret);
-#else
+#else /* !HW_UNALIGNED_SUPPORT */
   if (((unsigned long) a) % sizeof (reg_t) == 0)
     return aligned_words (a, b, words, bytes, ret);
   /* need to use unaligned instructions on first pointer.  */
   return unaligned_words (a, b, words, bytes, ret);
-#endif
+#endif /* HW_UNALIGNED_SUPPORT */
 }
 
 libc_hidden_builtin_def (memcpy)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 07/11] Fix strcmp bug for little endian target
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (6 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 06/11] Fix prefetching beyond copied memory Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 16:20   ` Joseph Myers
  2025-01-23 18:23   ` Adhemerval Zanella Netto
  2025-01-23 13:43 ` [PATCH 08/11] Add script to run tests through a qemu wrapper Aleksandar Rakic
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

Strcmp gives incorrect result for little endian targets under
the following conditions:
1. Length of 1st string is 1 less than a multiple of 4 (i.e len%4=3)
2. First string is a prefix of the second string
3. The first differing character in the second string is extended
   ASCII (that is > 127)

Cherry-picked 7c709e878f836069bbdbf42979937794623cfa68
from https://github.com/MIPS/glibc

Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/strcmp.S | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/sysdeps/mips/strcmp.S b/sysdeps/mips/strcmp.S
index 4878cd3aac..8d1bab12ec 100644
--- a/sysdeps/mips/strcmp.S
+++ b/sysdeps/mips/strcmp.S
@@ -225,10 +225,13 @@ L(worddiff):
 	beq	a0, zero, L(wexit01)
 	bne	a0, a1, L(wexit01)
 
-	/* The other bytes are identical, so just subract the 2 words
-	  and return the difference.  */
+# if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	srl a0, a2, 24
+	srl a1, a3, 24
+# else /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
 	move a0, a2
 	move a1, a3
+# endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
 
 L(wexit01):
 	subu	v0, a0, a1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 08/11] Add script to run tests through a qemu wrapper
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (7 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 07/11] Fix strcmp bug for little endian target Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch Aleksandar Rakic
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

GTM19-545: Add script to run tests through a qemu wrapper

Cherry-picked 9f9923a4f14406026426d857acf9c2babe2908bf
from https://github.com/MIPS/glibc

Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 scripts/cross-test-qemu.sh | 152 +++++++++++++++++++++++++++++++++++++
 1 file changed, 152 insertions(+)
 create mode 100755 scripts/cross-test-qemu.sh

diff --git a/scripts/cross-test-qemu.sh b/scripts/cross-test-qemu.sh
new file mode 100755
index 0000000000..7636414141
--- /dev/null
+++ b/scripts/cross-test-qemu.sh
@@ -0,0 +1,152 @@
+#!/bin/bash
+# Run a testcase on a remote system, via qemu.
+# Copyright (C) 2024 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <http://www.gnu.org/licenses/>.
+
+# usage: cross-test-qemu.sh HOST COMMAND ...
+# Run with --help flag to get more detailed help.
+
+progname="$(basename $0)"
+
+usage="usage: ${progname} [--ssh SSH] HOST COMMAND ..."
+timeoutfactor=$TIMEOUTFACTOR
+addon_libpath=""
+while [ $# -gt 0 ]; do
+  case "$1" in
+
+      "--timeoutfactor")
+      shift
+      if [ $# -lt 1 ]; then
+	break
+      fi
+      timeoutfactor="$1"
+      ;;
+
+    "--addon-libpath")
+	shift
+	if [ $# -lt 1 ]; then
+	    break
+	fi
+	addon_libpath="$1"
+	;;
+
+    "--help")
+      echo "$usage"
+      echo "$help"
+      exit 0
+      ;;
+
+    *)
+      break
+      ;;
+  esac
+  shift
+done
+
+if [ $# -lt 1 ]; then
+  echo "$usage" >&2
+  echo "Type '${progname} --help' for more detailed help." >&2
+  exit 1
+fi
+
+emulator="$1"; shift
+envpat="[:alpha:]*=.*"
+ldpat=".*/.*ld.*\.so.*"
+lgccpat="libgcc_s.so.1"
+libpat="--library-path"
+ldpath=""
+lgccpath=""
+envlist=""
+liblist=""
+command=""
+toolchain=`dirname \`dirname $emulator\``
+target=`ls $toolchain | grep -e linux-gnu`
+# Print the sequence of arguments as strings properly quoted for the
+# Bourne shell, separated by spaces.
+bourne_quote ()
+{
+  local arg qarg libflag variant
+  libflag=0
+
+  for arg in $@; do
+      if [ "x$done" != "x" ]; then
+	  command="$command $arg"
+      elif [[ $arg =~ $envpat ]]; then
+	  if [ -z $envlist ]; then
+	     envlist="$arg"
+	   else
+	       envlist="$arg,$envlist"
+	  fi
+      elif [[ $arg =~ $ldpat ]]; then
+	  ldfile=`basename $arg`
+	  variant=`basename \`dirname \\\`dirname $arg\\\`\``
+	  libdir=${variant##*_}
+	  variant=${variant%_*}
+	  variant=${variant#obj_}
+	  ldpath=$toolchain/sysroot/$variant
+	  if [ ! -f $ldpath/$libdir/$ldfile ]; then
+	      ldpath=`dirname $arg`
+	  fi
+	  lgccpath=$toolchain/$target/lib/$variant/$libdir
+	  liblist="$ldpath:$lgccpath:$liblist"
+      elif [[ $arg =~ $libpat ]]; then
+	  libflag=1
+      elif [ $libflag -ne 0 ]; then
+	  liblist="$arg:$liblist"
+	  libflag=0
+      elif [ "x$arg" != "xenv" ]; then
+	  if [[ $arg =~ "tst-" ]]; then
+	      if [ -f $arg ]; then
+		  done=1
+	      fi
+	  fi
+	  command="$command $arg"
+      fi
+  done
+}
+
+# Transform the current argument list into a properly quoted Bourne shell
+# command string.
+bourne_quote "$@"
+
+liblist=$addon_libpath:$liblist
+liblist=`tr -s : <<< $liblist`
+liblist=${liblist#:*}
+liblist=${liblist%*:}
+
+if [ "x$liblist" != "x" ]; then
+    LIBPATH_OPT="-E LD_LIBRARY_PATH=$liblist"
+fi
+
+if [ "x$envlist" != "x" ]; then
+    ENV_OPT="-E $envlist"
+fi
+
+if [ "x$ldpath" != "x" ]; then
+    LDPATH_OPT="-L $ldpath"
+fi
+
+if [ "x$timeoutfactor" != "x" ]; then
+    $emulator $LDPATH_OPT $LIBPATH_OPT $ENV_OPT $command &
+    pid=$!
+    trap "kill -SIGINT $pid" SIGALRM
+    sleep $timeoutfactor && kill -SIGALRM $$
+    exit 1
+else
+    $emulator $LDPATH_OPT $LIBPATH_OPT $ENV_OPT $command
+fi
+
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (8 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 08/11] Add script to run tests through a qemu wrapper Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 16:16   ` Joseph Myers
  2025-01-23 13:43 ` [PATCH 10/11] Avoid GCC 11 warning from -Wmaybe-uninitialized Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 11/11] Prevent turning memset into self-recursion Aleksandar Rakic
  11 siblings, 1 reply; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu

Avoid GCC 11 warning from -Wbuiltin-declaration-mismatch for modfl and
sincosl under MIPS o32 ABI.

Cherry-picked 056065bbe644d396a6fadd7c759f91bba1855bd6
from https://github.com/MIPS/glibc

Signed-off-by: Chao-ying Fu <cfu@mips.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/ieee754/dbl-64/s_modf.c   | 4 ++++
 sysdeps/ieee754/dbl-64/s_sincos.c | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/sysdeps/ieee754/dbl-64/s_modf.c b/sysdeps/ieee754/dbl-64/s_modf.c
index 0de2084caf..eda2d65b51 100644
--- a/sysdeps/ieee754/dbl-64/s_modf.c
+++ b/sysdeps/ieee754/dbl-64/s_modf.c
@@ -23,6 +23,7 @@
 #include <math_private.h>
 #include <libm-alias-double.h>
 #include <stdint.h>
+#include <libc-diag.h>
 
 static const double one = 1.0;
 
@@ -60,5 +61,8 @@ __modf(double x, double *iptr)
 	}
 }
 #ifndef __modf
+DIAG_PUSH_NEEDS_COMMENT;
+DIAG_IGNORE_NEEDS_COMMENT (11, "-Wbuiltin-declaration-mismatch");
 libm_alias_double (__modf, modf)
+DIAG_POP_NEEDS_COMMENT;
 #endif
diff --git a/sysdeps/ieee754/dbl-64/s_sincos.c b/sysdeps/ieee754/dbl-64/s_sincos.c
index adbc57af28..531940d4c8 100644
--- a/sysdeps/ieee754/dbl-64/s_sincos.c
+++ b/sysdeps/ieee754/dbl-64/s_sincos.c
@@ -23,6 +23,7 @@
 #include <fenv_private.h>
 #include <math-underflow.h>
 #include <libm-alias-double.h>
+#include <libc-diag.h>
 
 #ifndef SECTION
 # define SECTION
@@ -106,5 +107,8 @@ __sincos (double x, double *sinx, double *cosx)
   *sinx = *cosx = x / x;
 }
 #ifndef __sincos
+DIAG_PUSH_NEEDS_COMMENT;
+DIAG_IGNORE_NEEDS_COMMENT (11, "-Wbuiltin-declaration-mismatch");
 libm_alias_double (__sincos, sincos)
+DIAG_POP_NEEDS_COMMENT;
 #endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 10/11] Avoid GCC 11 warning from -Wmaybe-uninitialized
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (9 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 13:43 ` [PATCH 11/11] Prevent turning memset into self-recursion Aleksandar Rakic
  11 siblings, 0 replies; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu

Cherry-picked 4dad697124b3bc82d9f4fbad62f30224216ab996
from https://github.com/MIPS/glibc

Signed-off-by: Chao-ying Fu <cfu@mips.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/ieee754/soft-fp/s_fdiv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sysdeps/ieee754/soft-fp/s_fdiv.c b/sysdeps/ieee754/soft-fp/s_fdiv.c
index 8c92aa6fb2..d02da4ca71 100644
--- a/sysdeps/ieee754/soft-fp/s_fdiv.c
+++ b/sysdeps/ieee754/soft-fp/s_fdiv.c
@@ -35,6 +35,7 @@
    may be where the macro is defined.  This happens only with -O1.  */
 DIAG_PUSH_NEEDS_COMMENT;
 DIAG_IGNORE_NEEDS_COMMENT (8, "-Wmaybe-uninitialized");
+DIAG_IGNORE_NEEDS_COMMENT (11, "-Wmaybe-uninitialized");
 #include <soft-fp.h>
 #include <single.h>
 #include <double.h>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 11/11] Prevent turning memset into self-recursion
  2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
                   ` (10 preceding siblings ...)
  2025-01-23 13:43 ` [PATCH 10/11] Avoid GCC 11 warning from -Wmaybe-uninitialized Aleksandar Rakic
@ 2025-01-23 13:43 ` Aleksandar Rakic
  2025-01-23 16:19   ` Joseph Myers
  11 siblings, 1 reply; 17+ messages in thread
From: Aleksandar Rakic @ 2025-01-23 13:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: aleksandar.rakic, djordje.todorovic, cfu, Dragan Mladjenovic

Prevent GCC 11 from turning memset into self-recursion.
GCC11 transforms byte-by-byte set loop pattern in memset.c into
a memset call, causing runtime failures. Apply -fno-builtin for
both the memset.c and memcpy.c to prevent similar bugs in future.

Cherry-picked 31906b3556bc18cfdb7a3d84a669d95486450704
from https://github.com/MIPS/glibc

Signed-off-by: Dragan Mladjenovic <dragan.mladjenovic@syrmia.com>
Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
---
 sysdeps/mips/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sysdeps/mips/Makefile b/sysdeps/mips/Makefile
index 17ddc2a97c..4464d73902 100644
--- a/sysdeps/mips/Makefile
+++ b/sysdeps/mips/Makefile
@@ -24,6 +24,9 @@ ASFLAGS-.o += $(pie-default)
 ASFLAGS-.op += $(pie-default)
 ASFLAGS += -O2
 
+CFLAGS-memset.c += -fno-builtin
+CFLAGS-memcpy.c += -fno-builtin
+
 ifeq ($(subdir),elf)
 
 # These tests fail on all mips configurations (BZ 29404)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch
  2025-01-23 13:43 ` [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch Aleksandar Rakic
@ 2025-01-23 16:16   ` Joseph Myers
  0 siblings, 0 replies; 17+ messages in thread
From: Joseph Myers @ 2025-01-23 16:16 UTC (permalink / raw)
  To: Aleksandar Rakic; +Cc: libc-alpha, aleksandar.rakic, djordje.todorovic, cfu

On Thu, 23 Jan 2025, Aleksandar Rakic wrote:

> Avoid GCC 11 warning from -Wbuiltin-declaration-mismatch for modfl and
> sincosl under MIPS o32 ABI.

This should not be needed.  math/Makefile has

CFLAGS-s_modf.c += -fno-builtin-modfl
CFLAGS-s_sincos.c += -fno-builtin-sincosl

which are supposed to avoid such warnings.  (It wouldn't surprise me if 
we're missing some such -fno-builtin-* for functions not currently 
supported as built-in functions in GCC, but if such built-in functions get 
added in future and result in glibc build failures, we can add the 
corresponding options that were previously missed - the options work fine 
even when there is no such built-in function.)

-- 
Joseph S. Myers
josmyers@redhat.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 11/11] Prevent turning memset into self-recursion
  2025-01-23 13:43 ` [PATCH 11/11] Prevent turning memset into self-recursion Aleksandar Rakic
@ 2025-01-23 16:19   ` Joseph Myers
  0 siblings, 0 replies; 17+ messages in thread
From: Joseph Myers @ 2025-01-23 16:19 UTC (permalink / raw)
  To: Aleksandar Rakic
  Cc: libc-alpha, aleksandar.rakic, djordje.todorovic, cfu, Dragan Mladjenovic

On Thu, 23 Jan 2025, Aleksandar Rakic wrote:

> Prevent GCC 11 from turning memset into self-recursion.
> GCC11 transforms byte-by-byte set loop pattern in memset.c into
> a memset call, causing runtime failures. Apply -fno-builtin for
> both the memset.c and memcpy.c to prevent similar bugs in future.

We use inhibit_loop_to_libcall to provide __attribute__ ((__optimize__ 
("-fno-tree-loop-distribute-patterns"))) in such cases.

-- 
Joseph S. Myers
josmyers@redhat.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/11] Fix strcmp bug for little endian target
  2025-01-23 13:43 ` [PATCH 07/11] Fix strcmp bug for little endian target Aleksandar Rakic
@ 2025-01-23 16:20   ` Joseph Myers
  2025-01-23 18:23   ` Adhemerval Zanella Netto
  1 sibling, 0 replies; 17+ messages in thread
From: Joseph Myers @ 2025-01-23 16:20 UTC (permalink / raw)
  To: Aleksandar Rakic
  Cc: libc-alpha, aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker

On Thu, 23 Jan 2025, Aleksandar Rakic wrote:

> Strcmp gives incorrect result for little endian targets under
> the following conditions:
> 1. Length of 1st string is 1 less than a multiple of 4 (i.e len%4=3)
> 2. First string is a prefix of the second string
> 3. The first differing character in the second string is extended
>    ASCII (that is > 127)

Is there a test in the glibc testsuite that fails before and passes after 
this patch?  If not, one should be added.

-- 
Joseph S. Myers
josmyers@redhat.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/11] Fix strcmp bug for little endian target
  2025-01-23 13:43 ` [PATCH 07/11] Fix strcmp bug for little endian target Aleksandar Rakic
  2025-01-23 16:20   ` Joseph Myers
@ 2025-01-23 18:23   ` Adhemerval Zanella Netto
  1 sibling, 0 replies; 17+ messages in thread
From: Adhemerval Zanella Netto @ 2025-01-23 18:23 UTC (permalink / raw)
  To: Aleksandar Rakic, libc-alpha
  Cc: aleksandar.rakic, djordje.todorovic, cfu, Faraz Shahbazker



On 23/01/25 10:43, Aleksandar Rakic wrote:
> Strcmp gives incorrect result for little endian targets under
> the following conditions:
> 1. Length of 1st string is 1 less than a multiple of 4 (i.e len%4=3)
> 2. First string is a prefix of the second string
> 3. The first differing character in the second string is extended
>    ASCII (that is > 127)
> 
> Cherry-picked 7c709e878f836069bbdbf42979937794623cfa68
> from https://github.com/MIPS/glibc
> 
> Signed-off-by: Faraz Shahbazker <fshahbazker@wavecomp.com>
> Signed-off-by: Aleksandar Rakic <aleksandar.rakic@htecgroup.com>
> ---
>  sysdeps/mips/strcmp.S | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/sysdeps/mips/strcmp.S b/sysdeps/mips/strcmp.S
> index 4878cd3aac..8d1bab12ec 100644
> --- a/sysdeps/mips/strcmp.S
> +++ b/sysdeps/mips/strcmp.S
> @@ -225,10 +225,13 @@ L(worddiff):
>  	beq	a0, zero, L(wexit01)
>  	bne	a0, a1, L(wexit01)
>  
> -	/* The other bytes are identical, so just subract the 2 words
> -	  and return the difference.  */
> +# if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +	srl a0, a2, 24
> +	srl a1, a3, 24
> +# else /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
>  	move a0, a2
>  	move a1, a3
> +# endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */
>  
>  L(wexit01):
>  	subu	v0, a0, a1

Can't you use the generic implementation instead?  If I understand correctly,
mips optimizes only the aligned case, while generic code also do word size
read for unaligned case (with the MERGE and shift tricks).  The only trick
I see that mips implementation does is loop unrolling, which I think you
can do by adding some compiler flags on mips Makefile.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-01-23 18:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-23 13:42 [PATCH 0/11] Improve Mips target Aleksandar Rakic
2025-01-23 13:42 ` [PATCH 00/11] " Aleksandar Rakic
2025-01-23 13:42 ` [PATCH 01/11] Updates for microMIPS Release 6 Aleksandar Rakic
2025-01-23 13:42 ` [PATCH 02/11] Fix rtld link_map initialization issues Aleksandar Rakic
2025-01-23 13:42 ` [PATCH 03/11] Fix issues with removing no-reorder directives Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 04/11] Add C implementation of memcpy/memset Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 05/11] Add optimized assembly for strcmp Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 06/11] Fix prefetching beyond copied memory Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 07/11] Fix strcmp bug for little endian target Aleksandar Rakic
2025-01-23 16:20   ` Joseph Myers
2025-01-23 18:23   ` Adhemerval Zanella Netto
2025-01-23 13:43 ` [PATCH 08/11] Add script to run tests through a qemu wrapper Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 09/11] Avoid warning from -Wbuiltin-declaration-mismatch Aleksandar Rakic
2025-01-23 16:16   ` Joseph Myers
2025-01-23 13:43 ` [PATCH 10/11] Avoid GCC 11 warning from -Wmaybe-uninitialized Aleksandar Rakic
2025-01-23 13:43 ` [PATCH 11/11] Prevent turning memset into self-recursion Aleksandar Rakic
2025-01-23 16:19   ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).