[PATCH 0/3] aarch64: Update ld.so for vector abi

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/3] aarch64: Update ld.so for vector abi
@ 2018-08-01 22:23 rth
  2018-08-01 22:24 ` [PATCH 2/3] aarch64: Clean up _dl_runtime_profile rth
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: rth @ 2018-08-01 22:23 UTC (permalink / raw)
  To: libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

From: Richard Henderson <richard.henderson@linaro.org>

There is a new calling convention defined for vectorized functions [1].

This is similar to what has happened for x86_64, where the original ABI
did not pass or preserve full vector contents, but then a new ABI is
defined that does.

There was an old patch for [BZ #15128] that saves full AdvSIMD registers
along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile.
In the chatter for the BZ [2], Markus Shawcroft mentions that it should
save and restore d0-d7, which is indeed correct fro the original ABI.
It is not clear from the BZ why q0-q7 are saved instead.

That said, the new abi for AdvSIMD does use q0-q7.
When SVE is enabled, we need to save even more: z0-z7 plus p0-p3.

This fixes a number of minor issues with _dl_runtime_resolve itself,
and more major issues with _dl_runtime_profile before copying those
routines and making the modifications required for SVE.

I have *not* attempted to extend the <bits/link.h> interface for
the new ABI.  This should be done with more discussion on list.
I have instead simply saved and restored registers as the abi
requires, so that the actual callee gets the correct data.

I have lightly tested this under QEMU, in that the new sve paths
pass the same glibc tests as the old paths.

r~

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11

Richard Henderson (3):
  aarch64: Clean up _dl_runtime_resolve
  aarch64: Clean up _dl_runtime_profile
  aarch64: Save and restore SVE registers in ld.so

 sysdeps/aarch64/dl-machine.h    |  13 +-
 sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++-------
 2 files changed, 438 insertions(+), 106 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/3] aarch64: Clean up _dl_runtime_profile
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
@ 2018-08-01 22:24 ` rth
  2018-08-02 17:25   ` Szabolcs Nagy
  2018-08-01 22:24 ` [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so rth
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: rth @ 2018-08-01 22:24 UTC (permalink / raw)
  To: libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

From: Richard Henderson <richard.henderson@linaro.org>

Not adjusting La_aarch64_regs or La_aarch64_retval for the new AdvSIMD
vector ABI; that will require more thought and coordination.  In the
meantime, this will at least pass the proper values to each callee,
even if the values are not visible to auditing.

	* sysdeps/aarch64/dl-trampoline.S (_dl_runtime_profile):
	Do not record unwind info for arguments -- this is unneeded;
	properly include the 16 bytes of PLT stack into the unwind;
	save and restore the structure return pointer, x8;
	save all of the AdvSIMD registers defined for the vector ABI.
---
 sysdeps/aarch64/dl-trampoline.S | 138 ++++++++++++++++----------------
 1 file changed, 71 insertions(+), 67 deletions(-)

diff --git a/sysdeps/aarch64/dl-trampoline.S b/sysdeps/aarch64/dl-trampoline.S
index e8e2af485a..67a7c1b207 100644
--- a/sysdeps/aarch64/dl-trampoline.S
+++ b/sysdeps/aarch64/dl-trampoline.S
@@ -101,7 +101,6 @@ _dl_runtime_resolve:
 #ifndef PROF
 	.globl _dl_runtime_profile
 	.type _dl_runtime_profile, #function
-	cfi_startproc
 	.align 2
 _dl_runtime_profile:
 	/* AArch64 we get called with:
@@ -111,15 +110,16 @@ _dl_runtime_profile:
 	   [sp, #0]	&PLTGOT[n]
 
 	   Stack frame layout:
-	   [sp,   #...] lr
-	   [sp,   #...] &PLTGOT[n]
-	   [sp,    #96] La_aarch64_regs
-	   [sp,    #48] La_aarch64_retval
-	   [sp,    #40] frame size return from pltenter
-	   [sp,    #32] dl_profile_call saved x1
-	   [sp,    #24] dl_profile_call saved x0
-	   [sp,    #16] t1
-	   [sp,     #0] x29, lr   <- x29
+	   [x29,  #...] lr
+	   [x29,  #...] &PLTGOT[n]
+	   [x29,   #96] La_aarch64_regs
+	   [x29,   #48] La_aarch64_retval
+	   [x29,   #40] frame size return from pltenter
+	   [x29,   #32] dl_profile_call saved x1
+	   [x29,   #24] dl_profile_call saved x0
+	   [x29,   #16] t1
+	   [x29,    #0] x29, x8
+	   [x29, #-128] full q[0-7] contents
 	 */
 
 # define OFFSET_T1		16
@@ -127,46 +127,39 @@ _dl_runtime_profile:
 # define OFFSET_FS		OFFSET_SAVED_CALL_X0 + 16
 # define OFFSET_RV		OFFSET_FS + 8
 # define OFFSET_RG		OFFSET_RV + DL_SIZEOF_RV
+# define OFFSET_SAVED_VEC	(-16 * 8)
 
-# define SF_SIZE		OFFSET_RG + DL_SIZEOF_RG
+# define SF_SIZE		(OFFSET_RG + DL_SIZEOF_RG)
 
 # define OFFSET_PLTGOTN		SF_SIZE
 # define OFFSET_LR		OFFSET_PLTGOTN + 8
 
-	/* Save arguments.  */
-	sub	sp, sp, #SF_SIZE
-	cfi_adjust_cfa_offset (SF_SIZE)
-	stp	x29, x30, [SP, #0]
-	mov	x29, sp
-	cfi_def_cfa_register (x29)
-	cfi_rel_offset (x29, 0)
+	cfi_startproc
+	cfi_adjust_cfa_offset(16)	/* Incorporate PLT */
 	cfi_rel_offset (lr, 8)
 
-	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
-	cfi_rel_offset (x0, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 0)
-	cfi_rel_offset (x1, OFFSET_RG + DL_OFFSET_RG_X0 + 16*0 + 8)
-	stp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
-	cfi_rel_offset (x2, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 0)
-	cfi_rel_offset (x3, OFFSET_RG + DL_OFFSET_RG_X0 + 16*1 + 8)
-	stp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
-	cfi_rel_offset (x4, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 0)
-	cfi_rel_offset (x5, OFFSET_RG + DL_OFFSET_RG_X0 + 16*2 + 8)
-	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
-	cfi_rel_offset (x6, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 0)
-	cfi_rel_offset (x7, OFFSET_RG + DL_OFFSET_RG_X0 + 16*3 + 8)
+	stp	x29, x8, [SP, #-SF_SIZE]!
+	cfi_adjust_cfa_offset (SF_SIZE)
+	cfi_rel_offset (x29, 0)
+	mov	x29, sp
+	cfi_def_cfa_register (x29)
+	sub	sp, sp, #-OFFSET_SAVED_VEC
 
-	stp	d0, d1, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
-	cfi_rel_offset (d0, OFFSET_RG + DL_OFFSET_RG_D0 + 16*0)
-	cfi_rel_offset (d1, OFFSET_RG + DL_OFFSET_RG_D0 + 16*0 + 8)
-	stp	d2, d3, [X29, #OFFSET_RG+ DL_OFFSET_RG_D0 + 16*1]
-	cfi_rel_offset (d2, OFFSET_RG + DL_OFFSET_RG_D0 + 16*1 + 0)
-	cfi_rel_offset (d3, OFFSET_RG + DL_OFFSET_RG_D0 + 16*1 + 8)
-	stp	d4, d5, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
-	cfi_rel_offset (d4, OFFSET_RG + DL_OFFSET_RG_D0 + 16*2 + 0)
-	cfi_rel_offset (d5, OFFSET_RG + DL_OFFSET_RG_D0 + 16*2 + 8)
-	stp	d6, d7, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
-	cfi_rel_offset (d6, OFFSET_RG + DL_OFFSET_RG_D0 + 16*3 + 0)
-	cfi_rel_offset (d7, OFFSET_RG + DL_OFFSET_RG_D0 + 16*3 + 8)
+	/* Save La_aarch64_regs.  */
+	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
+	stp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
+	stp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
+	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
+	stp	d0, d1, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
+	stp	d2, d3, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*1]
+	stp	d4, d5, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
+	stp	d6, d7, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
+
+	/* Re-save the full contents of the vector arguments.  */
+	stp	q0, q1, [x29, #OFFSET_SAVED_VEC + 16*0]
+	stp	q2, q3, [x29, #OFFSET_SAVED_VEC + 16*2]
+	stp	q4, q5, [x29, #OFFSET_SAVED_VEC + 16*4]
+	stp	q6, q7, [x29, #OFFSET_SAVED_VEC + 16*6]
 
 	add     x0, x29, #SF_SIZE + 16
 	ldr	x1, [x29, #OFFSET_LR]
@@ -201,31 +194,28 @@ _dl_runtime_profile:
 	mov	ip0, x0
 
 	/* Get arguments and return address back.  */
+	ldr	lr, [x29, #OFFSET_LR]
 	ldp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
 	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
 	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
 	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
-	ldp	d0, d1, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
-	ldp	d2, d3, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*1]
-	ldp	d4, d5, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
-	ldp	d6, d7, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
+	ldp	q0, q1, [x29, #OFFSET_SAVED_VEC + 16*0]
+	ldp	q2, q3, [x29, #OFFSET_SAVED_VEC + 16*2]
+	ldp	q4, q5, [x29, #OFFSET_SAVED_VEC + 16*4]
+	ldp	q6, q7, [x29, #OFFSET_SAVED_VEC + 16*6]
 
-	cfi_def_cfa_register (sp)
-	ldp	x29, x30, [x29, #0]
-	cfi_restore(x29)
-	cfi_restore(x30)
-
-	add	sp, sp, SF_SIZE + 16
-	cfi_adjust_cfa_offset (- SF_SIZE - 16)
+	mov	sp, x29
+	ldp	x29, x8, [sp], SF_SIZE + 16
+	cfi_def_cfa (sp, 0)
+	cfi_restore (x29)
+	cfi_restore (lr)
 
 	/* Jump to the newly found address.  */
 	br	ip0
 
 	cfi_restore_state
-1:
-	/* The new frame size is in ip0.  */
-
-	sub	PTR_REG (1), PTR_REG (29), ip0l
+	/* The new frame size is in ip0, extended for pointer size.  */
+1:	sub	x1, sp, ip0
 	and	sp, x1, #0xfffffffffffffff0
 
 	str	x0, [x29, #OFFSET_T1]
@@ -237,42 +227,56 @@ _dl_runtime_profile:
 
 	ldr	ip0, [x29, #OFFSET_T1]
 
-	/* Call the function.  */
+	/* Load the original arguments.  */
 	ldp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
 	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
 	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
 	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
-	ldp	d0, d1, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
-	ldp	d2, d3, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*1]
-	ldp	d4, d5, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
-	ldp	d6, d7, [x29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
+	ldr	x8, [x29, 8]
+	ldp	q0, q1, [x29, #OFFSET_SAVED_VEC + 16*0]
+	ldp	q2, q3, [x29, #OFFSET_SAVED_VEC + 16*2]
+	ldp	q4, q5, [x29, #OFFSET_SAVED_VEC + 16*4]
+	ldp	q6, q7, [x29, #OFFSET_SAVED_VEC + 16*6]
+
+	/* Call the function.  */
 	blr	ip0
+
+	/* Save La_aarch64_retval.  */
 	stp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
 	stp	d0, d1, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*0]
 	stp	d2, d3, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*1]
 
+	/* Re-save the full contents of the vector return.  */
+	stp	q0, q1, [x29, #OFFSET_SAVED_VEC + 16*0]
+	stp	q2, q3, [x29, #OFFSET_SAVED_VEC + 16*2]
+	stp	q4, q5, [x29, #OFFSET_SAVED_VEC + 16*4]
+	stp	q6, q7, [x29, #OFFSET_SAVED_VEC + 16*6]
+
 	/* Setup call to pltexit  */
 	ldp	x0, x1, [x29, #OFFSET_SAVED_CALL_X0]
 	add	x2, x29, #OFFSET_RG
 	add	x3, x29, #OFFSET_RV
 	bl	_dl_call_pltexit
 
+	/* Restore the full return value.  */
 	ldp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
-	ldp	d0, d1, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*0]
-	ldp	d2, d3, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*1]
+	ldp	q0, q1, [x29, #OFFSET_SAVED_VEC + 16*0]
+	ldp	q2, q3, [x29, #OFFSET_SAVED_VEC + 16*2]
+	ldp	q4, q5, [x29, #OFFSET_SAVED_VEC + 16*4]
+	ldp	q6, q7, [x29, #OFFSET_SAVED_VEC + 16*6]
+
 	/* LR from within La_aarch64_reg */
 	ldr	lr, [x29, #OFFSET_RG + DL_OFFSET_RG_LR]
-	cfi_restore(lr)
 	mov	sp, x29
 	cfi_def_cfa_register (sp)
 	ldr	x29, [x29, #0]
-	cfi_restore(x29)
 	add	sp, sp, SF_SIZE + 16
 	cfi_adjust_cfa_offset (- SF_SIZE - 16)
+	cfi_restore(x29)
+	cfi_restore(lr)
 
 	br	lr
 
 	cfi_endproc
 	.size _dl_runtime_profile, .-_dl_runtime_profile
 #endif
-	.previous
-- 
2.17.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
  2018-08-01 22:24 ` [PATCH 2/3] aarch64: Clean up _dl_runtime_profile rth
@ 2018-08-01 22:24 ` rth
  2018-08-02  7:29   ` Florian Weimer
  2018-08-01 22:24 ` [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve rth
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: rth @ 2018-08-01 22:24 UTC (permalink / raw)
  To: libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

From: Richard Henderson <richard.henderson@linaro.org>

Add SVE versions of _dl_runtime_resolve and _dl_runtime_profile.
This honors the extended vector calling conventionn described in
ARM_100986_0000_00_en (SVEpcs 00bet1).

	* sysdeps/aarch64/dl-trampoline.S (_dl_runtime_resolve_sve): New.
	(_dl_runtime_profile_sve): New.
	* sysdeps/aarch64/dl-machine.h (elf_machine_runtime_set): Use the
	new routines if HWCAP_SVE is set.
---
 sysdeps/aarch64/dl-machine.h    |  13 +-
 sysdeps/aarch64/dl-trampoline.S | 343 ++++++++++++++++++++++++++++++++
 2 files changed, 353 insertions(+), 3 deletions(-)

diff --git a/sysdeps/aarch64/dl-machine.h b/sysdeps/aarch64/dl-machine.h
index 4935aa7c54..ea7c5c71d5 100644
--- a/sysdeps/aarch64/dl-machine.h
+++ b/sysdeps/aarch64/dl-machine.h
@@ -69,6 +69,9 @@ elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
       ElfW(Addr) *got;
       extern void _dl_runtime_resolve (ElfW(Word));
       extern void _dl_runtime_profile (ElfW(Word));
+      extern void _dl_runtime_resolve_sve (ElfW(Word));
+      extern void _dl_runtime_profile_sve (ElfW(Word));
+      unsigned has_sve = GLRO(dl_hwcap) & HWCAP_SVE;
 
       got = (ElfW(Addr) *) D_PTR (l, l_info[DT_PLTGOT]);
       if (got[1])
@@ -83,9 +86,11 @@ elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
 	 to intercept the calls to collect information.  In this case we
 	 don't store the address in the GOT so that all future calls also
 	 end in this function.  */
-      if ( profile)
+      if (profile)
 	{
-	   got[2] = (ElfW(Addr)) &_dl_runtime_profile;
+	  got[2] = (has_sve
+		    ? (ElfW(Addr)) &_dl_runtime_profile_sve
+		    : (ElfW(Addr)) &_dl_runtime_profile);
 
 	  if (GLRO(dl_profile) != NULL
 	      && _dl_name_match_p (GLRO(dl_profile), l))
@@ -98,7 +103,9 @@ elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
 	  /* This function will get called to fix up the GOT entry
 	     indicated by the offset on the stack, and then jump to
 	     the resolved address.  */
-	  got[2] = (ElfW(Addr)) &_dl_runtime_resolve;
+	  got[2] = (has_sve
+		    ? (ElfW(Addr)) &_dl_runtime_resolve_sve
+		    : (ElfW(Addr)) &_dl_runtime_resolve);
 	}
     }
 
diff --git a/sysdeps/aarch64/dl-trampoline.S b/sysdeps/aarch64/dl-trampoline.S
index 67a7c1b207..e23e5f1aad 100644
--- a/sysdeps/aarch64/dl-trampoline.S
+++ b/sysdeps/aarch64/dl-trampoline.S
@@ -280,3 +280,346 @@ _dl_runtime_profile:
 	cfi_endproc
 	.size _dl_runtime_profile, .-_dl_runtime_profile
 #endif
+
+/*
+ * For functions conforming to the procedure call standard as
+ * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
+ * we must save the entire contents of Z0-Z7 as well as P0-P3.
+ */
+        .arch   armv8-a+sve
+
+	.globl _dl_runtime_resolve_sve
+	.type _dl_runtime_resolve_sve, #function
+	.align 2
+_dl_runtime_resolve_sve:
+	/* AArch64 we get called with:
+	   ip0		&PLTGOT[2]
+	   ip1		temp(dl resolver entry point)
+	   [sp, #8]	lr
+	   [sp, #0]	&PLTGOT[n]
+	 */
+	cfi_startproc
+	cfi_adjust_cfa_offset(16)	/* Incorporate PLT */
+	cfi_rel_offset (lr, 8)
+
+	/* Save arguments.  */
+	stp	x29, x8, [sp, #-80]!
+	cfi_adjust_cfa_offset (80)
+	cfi_rel_offset (x29, 0)
+	mov	x29, sp
+	cfi_def_cfa_register (x29)
+
+	stp	x6, x7, [sp,  #16]
+	stp	x4, x5, [sp,  #32]
+	stp	x2, x3, [sp,  #48]
+	stp	x0, x1, [sp,  #64]
+
+	/* Allocate space for, and store, Z[0-7].  */
+	addvl	sp, sp, #-8
+	str	z0, [sp, #0, mul vl]
+	str	z1, [sp, #1, mul vl]
+	str	z2, [sp, #2, mul vl]
+	str	z3, [sp, #3, mul vl]
+	str	z4, [sp, #4, mul vl]
+	str	z5, [sp, #5, mul vl]
+	str	z6, [sp, #6, mul vl]
+	str	z7, [sp, #7, mul vl]
+
+	/* Allocate space for, and store, P[0-3].  */
+	addpl	sp, sp, #-4
+	str	p0, [sp, #0, mul vl]
+	str	p1, [sp, #1, mul vl]
+	str	p2, [sp, #2, mul vl]
+	str	p3, [sp, #3, mul vl]
+
+	/* Get pointer to linker struct.  */
+	ldr	PTR_REG (0), [ip0, #-PTR_SIZE]
+
+	/* Prepare to call _dl_fixup().  */
+	ldr	x1, [x29, 80]		/* Recover &PLTGOT[n] */
+
+	sub     x1, x1, ip0
+	add     x1, x1, x1, lsl #1
+	lsl     x1, x1, #3
+	sub     x1, x1, #(RELA_SIZE<<3)
+	lsr     x1, x1, #3
+
+	/* Call fixup routine.  */
+	bl	_dl_fixup
+
+	/* Save the return.  */
+	mov	ip0, x0
+
+	/* Get arguments and return address back.  */
+	ldr	p0, [sp, #0, mul vl]
+	ldr	p1, [sp, #1, mul vl]
+	ldr	p2, [sp, #2, mul vl]
+	ldr	p3, [sp, #3, mul vl]
+	addpl	sp, sp, #4
+
+	ldr	z0, [sp, #0, mul vl]
+	ldr	z1, [sp, #1, mul vl]
+	ldr	z2, [sp, #2, mul vl]
+	ldr	z3, [sp, #3, mul vl]
+	ldr	z4, [sp, #4, mul vl]
+	ldr	z5, [sp, #5, mul vl]
+	ldr	z6, [sp, #6, mul vl]
+	ldr	z7, [sp, #7, mul vl]
+	addvl	sp, sp, #8
+
+	ldr	lr, [sp, #88]
+	ldp	x0, x1, [sp, #64]
+	ldp	x2, x3, [sp, #48]
+	ldp	x4, x5, [sp, #32]
+	ldp	x6, x7, [sp, #16]
+	ldp	x29, x8, [sp], #96
+	cfi_def_cfa (sp, 0)
+	cfi_restore (lr)
+	cfi_restore (x29)
+
+	/* Jump to the newly found address.  */
+	br	ip0
+
+	cfi_endproc
+	.size _dl_runtime_resolve_sve, .-_dl_runtime_resolve_sve
+
+#ifndef PROF
+	.globl _dl_runtime_profile_sve
+	.type _dl_runtime_profile_sve, #function
+	.align 2
+_dl_runtime_profile_sve:
+	/* AArch64 we get called with:
+	   ip0		&PLTGOT[2]
+	   ip1		temp(dl resolver entry point)
+	   [sp, #8]	lr
+	   [sp, #0]	&PLTGOT[n]
+
+	   Stack frame layout:
+	         [x29,  #...] lr
+	         [x29,  #...] &PLTGOT[n]
+	         [x29,   #96] La_aarch64_regs
+	         [x29,   #48] La_aarch64_retval
+	         [x29,   #40] frame size return from pltenter
+	         [x29,   #32] dl_profile_call saved x1
+	         [x29,   #24] dl_profile_call saved x0
+	         [x29,   #16] t1
+	         [x29,    #0] x29, lr       <- x29
+           [x29, #-1, mul vl] full p[0-3]
+	   [x29, #-2, mul vl] full z[0-8]   <- sp
+
+	   ??? Extending the profiling hook for full SVE register export
+	   is tricky given the variable register size.  Perhaps the new
+	   La_aarch64_regs should contain pointers to Z0 and P0, and
+	   the current VL, and one infers the addresses from there.
+
+	   This one new form could be used for all, with AdvSIMD
+	   devolving into VL=16 with no predicate registers.
+
+	   In the meantime, this function simply saves the contents of
+	   the SVE registers, but only exposes the AdvSIMD portion to
+	   the profile hooks.
+	 */
+
+	cfi_startproc
+	cfi_adjust_cfa_offset(16)	/* Incorporate PLT */
+	cfi_rel_offset (lr, 8)
+
+	stp	x29, x8, [SP, #-SF_SIZE]!
+	cfi_adjust_cfa_offset (SF_SIZE)
+	cfi_rel_offset (x29, 0)
+	mov	x29, sp
+	cfi_def_cfa_register (x29)
+
+	/* Save La_aarch64_regs.  */
+	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
+	stp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
+	stp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
+	stp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
+	stp	d0, d1, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*0]
+	stp	d2, d3, [X29, #OFFSET_RG+ DL_OFFSET_RG_D0 + 16*1]
+	stp	d4, d5, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*2]
+	stp	d6, d7, [X29, #OFFSET_RG + DL_OFFSET_RG_D0 + 16*3]
+
+	/* Re-save the full contents of the vector arguments.
+
+	   Note that PL = VL/8, so we can save all 4 predicates
+	   in (less than) the space of one vector; this minimizes
+	   the number of stack adjustments required, and gives a
+	   predictable place for each register.
+
+	   Despite the unfortunate assembler mnemomics, the vector
+	   stores do not overlap the preceeding prediate stores.  */
+	addvl	sp, sp, #-9
+
+	str	p0, [x29, #-1, mul vl]
+	str	p1, [x29, #-2, mul vl]
+	str	p2, [x29, #-3, mul vl]
+	str	p3, [x29, #-4, mul vl]
+
+	str	z0, [x29, #-2, mul vl]
+	str	z1, [x29, #-3, mul vl]
+	str	z2, [x29, #-4, mul vl]
+	str	z3, [x29, #-5, mul vl]
+	str	z4, [x29, #-6, mul vl]
+	str	z5, [x29, #-7, mul vl]
+	str	z6, [x29, #-8, mul vl]
+	str	z7, [x29, #-9, mul vl]
+
+	add     x0, x29, #SF_SIZE + 16
+	ldr	x1, [x29, #OFFSET_LR]
+	stp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_SP]
+
+	/* Get pointer to linker struct.  */
+	ldr	PTR_REG (0), [ip0, #-PTR_SIZE]
+
+	/* Prepare to call _dl_profile_fixup().  */
+	ldr	x1, [x29, OFFSET_PLTGOTN]	/* Recover &PLTGOT[n] */
+
+	sub     x1, x1, ip0
+	add     x1, x1, x1, lsl #1
+	lsl     x1, x1, #3
+	sub     x1, x1, #(RELA_SIZE<<3)
+	lsr     x1, x1, #3
+
+	stp	x0, x1, [x29, #OFFSET_SAVED_CALL_X0]
+
+	/* Set up extra args for _dl_profile_fixup */
+	ldr	x2, [x29, #OFFSET_LR]		/* load saved LR */
+	add	x3, x29, #OFFSET_RG		/* address of La_aarch64_reg */
+	add	x4, x29, #OFFSET_FS		/* address of framesize */
+	bl	_dl_profile_fixup
+
+	ldr	ip0l, [x29, #OFFSET_FS]		/* framesize == 0 */
+	cmp	ip0l, #0
+	bge	1f
+	cfi_remember_state
+
+	/* Save the return.  */
+	mov	ip0, x0
+
+	/* Get arguments and return address back.  */
+	ldr	p0, [x29, #-1, mul vl]
+	ldr	p1, [x29, #-2, mul vl]
+	ldr	p2, [x29, #-3, mul vl]
+	ldr	p3, [x29, #-4, mul vl]
+
+	ldr	z0, [x29, #-2, mul vl]
+	ldr	z1, [x29, #-3, mul vl]
+	ldr	z2, [x29, #-4, mul vl]
+	ldr	z3, [x29, #-5, mul vl]
+	ldr	z4, [x29, #-6, mul vl]
+	ldr	z5, [x29, #-7, mul vl]
+	ldr	z6, [x29, #-8, mul vl]
+	ldr	z7, [x29, #-9, mul vl]
+
+	ldr	lr, [x29, #OFFSET_LR]
+	ldp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
+	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
+	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
+	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
+
+	mov	sp, x29
+	ldp	x29, x8, [sp], SF_SIZE + 16
+	cfi_def_cfa (sp, 0)
+	cfi_restore(x29)
+	cfi_restore(lr)
+
+	/* Jump to the newly found address.  */
+	br	ip0
+
+	cfi_restore_state
+	/* The new frame size is in ip0, extended for pointer size.  */
+1:	sub	x1, sp, ip0
+	and	sp, x1, #0xfffffffffffffff0
+
+	str	x0, [x29, #OFFSET_T1]
+
+	mov	x0, sp
+	add	x1, x29, #SF_SIZE + 16
+	mov	x2, ip0
+	bl	memcpy
+
+	ldr	ip0, [x29, #OFFSET_T1]
+
+	/* Reload the full arguments.  */
+	ldp	x0, x1, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*0]
+	ldp	x2, x3, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*1]
+	ldp	x4, x5, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*2]
+	ldp	x6, x7, [x29, #OFFSET_RG + DL_OFFSET_RG_X0 + 16*3]
+	ldr	x8, [x29, 8]
+
+	ldr	p0, [x29, #-1, mul vl]
+	ldr	p1, [x29, #-2, mul vl]
+	ldr	p2, [x29, #-3, mul vl]
+	ldr	p3, [x29, #-4, mul vl]
+
+	ldr	z0, [x29, #-2, mul vl]
+	ldr	z1, [x29, #-3, mul vl]
+	ldr	z2, [x29, #-4, mul vl]
+	ldr	z3, [x29, #-5, mul vl]
+	ldr	z4, [x29, #-6, mul vl]
+	ldr	z5, [x29, #-7, mul vl]
+	ldr	z6, [x29, #-8, mul vl]
+	ldr	z7, [x29, #-9, mul vl]
+
+	/* Call the function.  */
+	blr	ip0
+
+	/* Store La_aarch64_retval, as if for the non-vector ABI.  */
+	stp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
+	stp	d0, d1, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*0]
+	stp	d2, d3, [x29, #OFFSET_RV + DL_OFFSET_RV_D0 + 16*1]
+
+	/* Store the full contents of the vector return.  */
+	str	p0, [x29, #-1, mul vl]
+	str	p1, [x29, #-2, mul vl]
+	str	p2, [x29, #-3, mul vl]
+	str	p3, [x29, #-4, mul vl]
+
+	str	z0, [x29, #-2, mul vl]
+	str	z1, [x29, #-3, mul vl]
+	str	z2, [x29, #-4, mul vl]
+	str	z3, [x29, #-5, mul vl]
+	str	z4, [x29, #-6, mul vl]
+	str	z5, [x29, #-7, mul vl]
+	str	z6, [x29, #-8, mul vl]
+	str	z7, [x29, #-9, mul vl]
+
+	/* Setup call to pltexit  */
+	ldp	x0, x1, [x29, #OFFSET_SAVED_CALL_X0]
+	add	x2, x29, #OFFSET_RG
+	add	x3, x29, #OFFSET_RV
+	bl	_dl_call_pltexit
+
+	/* Reload the full return value.  */
+	ldp	x0, x1, [x29, #OFFSET_RV + DL_OFFSET_RV_X0]
+
+	ldr	p0, [x29, #-1, mul vl]
+	ldr	p1, [x29, #-2, mul vl]
+	ldr	p2, [x29, #-3, mul vl]
+	ldr	p3, [x29, #-4, mul vl]
+
+	ldr	z0, [x29, #-2, mul vl]
+	ldr	z1, [x29, #-3, mul vl]
+	ldr	z2, [x29, #-4, mul vl]
+	ldr	z3, [x29, #-5, mul vl]
+	ldr	z4, [x29, #-6, mul vl]
+	ldr	z5, [x29, #-7, mul vl]
+	ldr	z6, [x29, #-8, mul vl]
+	ldr	z7, [x29, #-9, mul vl]
+
+	/* LR from within La_aarch64_reg */
+	ldr	lr, [x29, #OFFSET_RG + DL_OFFSET_RG_LR]
+	mov	sp, x29
+	cfi_def_cfa_register (sp)
+	ldr	x29, [x29, #0]
+	add	sp, sp, SF_SIZE + 16
+	cfi_adjust_cfa_offset (- SF_SIZE - 16)
+	cfi_restore(x29)
+	cfi_restore(lr)
+
+	br	lr
+
+	cfi_endproc
+	.size _dl_runtime_profile_sve, .-_dl_runtime_profile_sve
+#endif
-- 
2.17.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
  2018-08-01 22:24 ` [PATCH 2/3] aarch64: Clean up _dl_runtime_profile rth
  2018-08-01 22:24 ` [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so rth
@ 2018-08-01 22:24 ` rth
  2018-08-02 16:03   ` Szabolcs Nagy
  2018-08-01 22:32 ` [PATCH 0/3] aarch64: Update ld.so for vector abi Richard Henderson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: rth @ 2018-08-01 22:24 UTC (permalink / raw)
  To: libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

From: Richard Henderson <richard.henderson@linaro.org>

	* sysdeps/aarch64/dl-trampoline.S (_dl_runtime_resolve):
	Do not record unwind info for arguments; this is unneeded;
	do not save x9 just to have a register to pair with x8;
	properly include the 16 bytes of PLT stack into the unwind;
	create a frame pointer with the spare stack slot;
	rearrange the exit to only adjust the stack once.
---
 sysdeps/aarch64/dl-trampoline.S | 50 +++++++++------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/sysdeps/aarch64/dl-trampoline.S b/sysdeps/aarch64/dl-trampoline.S
index a86d0722d4..e8e2af485a 100644
--- a/sysdeps/aarch64/dl-trampoline.S
+++ b/sysdeps/aarch64/dl-trampoline.S
@@ -32,7 +32,6 @@
 	.text
 	.globl _dl_runtime_resolve
 	.type _dl_runtime_resolve, #function
-	cfi_startproc
 	.align 2
 _dl_runtime_resolve:
 	/* AArch64 we get called with:
@@ -41,46 +40,24 @@ _dl_runtime_resolve:
 	   [sp, #8]	lr
 	   [sp, #0]	&PLTGOT[n]
 	 */
-
+	cfi_startproc
+	cfi_adjust_cfa_offset(16)	/* Incorporate PLT */
 	cfi_rel_offset (lr, 8)
 
 	/* Save arguments.  */
-	stp	x8, x9, [sp, #-(80+8*16)]!
+	stp	x29, x8, [sp, #-(80+8*16)]!
 	cfi_adjust_cfa_offset (80+8*16)
-	cfi_rel_offset (x8, 0)
-	cfi_rel_offset (x9, 8)
+	cfi_rel_offset (x29, 0)
+	mov	x29, sp
 
 	stp	x6, x7, [sp,  #16]
-	cfi_rel_offset (x6, 16)
-	cfi_rel_offset (x7, 24)
-
 	stp	x4, x5, [sp,  #32]
-	cfi_rel_offset (x4, 32)
-	cfi_rel_offset (x5, 40)
-
 	stp	x2, x3, [sp,  #48]
-	cfi_rel_offset (x2, 48)
-	cfi_rel_offset (x3, 56)
-
 	stp	x0, x1, [sp,  #64]
-	cfi_rel_offset (x0, 64)
-	cfi_rel_offset (x1, 72)
-
 	stp	q0, q1, [sp, #(80+0*16)]
-	cfi_rel_offset (q0, 80+0*16)
-	cfi_rel_offset (q1, 80+1*16)
-
 	stp	q2, q3, [sp, #(80+2*16)]
-	cfi_rel_offset (q0, 80+2*16)
-	cfi_rel_offset (q1, 80+3*16)
-
 	stp	q4, q5, [sp, #(80+4*16)]
-	cfi_rel_offset (q0, 80+4*16)
-	cfi_rel_offset (q1, 80+5*16)
-
 	stp	q6, q7, [sp, #(80+6*16)]
-	cfi_rel_offset (q0, 80+6*16)
-	cfi_rel_offset (q1, 80+7*16)
 
 	/* Get pointer to linker struct.  */
 	ldr	PTR_REG (0), [ip0, #-PTR_SIZE]
@@ -101,25 +78,26 @@ _dl_runtime_resolve:
 	mov	ip0, x0
 
 	/* Get arguments and return address back.  */
-	ldp	q0, q1, [sp, #(80+0*16)]
-	ldp	q2, q3, [sp, #(80+2*16)]
-	ldp	q4, q5, [sp, #(80+4*16)]
+	ldr	lr, [sp, #80+8*16+8]
 	ldp	q6, q7, [sp, #(80+6*16)]
+	ldp	q4, q5, [sp, #(80+4*16)]
+	ldp	q2, q3, [sp, #(80+2*16)]
+	ldp	q0, q1, [sp, #(80+0*16)]
 	ldp	x0, x1, [sp, #64]
 	ldp	x2, x3, [sp, #48]
 	ldp	x4, x5, [sp, #32]
 	ldp	x6, x7, [sp, #16]
-	ldp	x8, x9, [sp], #(80+8*16)
-	cfi_adjust_cfa_offset (-(80+8*16))
-
-	ldp	ip1, lr, [sp], #16
-	cfi_adjust_cfa_offset (-16)
+	ldp	x29, x8, [sp], 80+8*16+16
+	cfi_adjust_cfa_offset (-(80+8*16+16))
+	cfi_restore (lr)
+	cfi_restore (x29)
 
 	/* Jump to the newly found address.  */
 	br	ip0
 
 	cfi_endproc
 	.size _dl_runtime_resolve, .-_dl_runtime_resolve
+
 #ifndef PROF
 	.globl _dl_runtime_profile
 	.type _dl_runtime_profile, #function
-- 
2.17.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
                   ` (2 preceding siblings ...)
  2018-08-01 22:24 ` [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve rth
@ 2018-08-01 22:32 ` Richard Henderson
  2018-08-02  9:26   ` Szabolcs Nagy
  2018-08-02 10:44 ` Szabolcs Nagy
  2018-08-02 15:24 ` Carlos O'Donell
  5 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2018-08-01 22:32 UTC (permalink / raw)
  To: libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
> There is a new calling convention defined for vectorized functions [1].

I should have added that it appears that armclang is already
using this ABI, and QEMU has received a bug report about "not
working properly" with dynamic linking.

Whee!


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-01 22:24 ` [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so rth
@ 2018-08-02  7:29   ` Florian Weimer
  2018-08-02 13:05     ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2018-08-02  7:29 UTC (permalink / raw)
  To: rth, libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy, Richard Henderson

On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
> +/*
> + * For functions conforming to the procedure call standard as
> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
> + */

What's the worst-case additional stack usage due to this change?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-01 22:32 ` [PATCH 0/3] aarch64: Update ld.so for vector abi Richard Henderson
@ 2018-08-02  9:26   ` Szabolcs Nagy
  0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02  9:26 UTC (permalink / raw)
  To: Richard Henderson, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 01/08/18 23:32, Richard Henderson wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
>> There is a new calling convention defined for vectorized functions [1].
> 
> I should have added that it appears that armclang is already
> using this ABI, and QEMU has received a bug report about "not
> working properly" with dynamic linking.
> 
> Whee!
> 

this abi is only usable without lazy binding otherwise
it's not backward compatible with existing dynamic linkers.

the pcs document does not talk about this since lazy
binding is for elf targets only, i think it's possible
to make the code generation backward compatible: e.g.
make the vector pcs attribute imply the noplt attribute,
this fixes pic/pie code, for executables somehow bindnow
should be turned on (or move the corresponding relocations
out from DT_JMPREL part of the relocation table so the
dynamic linker processes them at load time)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
                   ` (3 preceding siblings ...)
  2018-08-01 22:32 ` [PATCH 0/3] aarch64: Update ld.so for vector abi Richard Henderson
@ 2018-08-02 10:44 ` Szabolcs Nagy
  2018-08-02 13:17   ` Carlos O'Donell
  2018-08-02 15:24 ` Carlos O'Donell
  5 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 10:44 UTC (permalink / raw)
  To: rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 01/08/18 23:23, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>
> 
> There is a new calling convention defined for vectorized functions [1].
> 
> This is similar to what has happened for x86_64, where the original ABI
> did not pass or preserve full vector contents, but then a new ABI is
> defined that does.
> 
> There was an old patch for [BZ #15128] that saves full AdvSIMD registers
> along _dl_runtime_resolve, but failed to do so for _dl_runtime_profile.
> In the chatter for the BZ [2], Markus Shawcroft mentions that it should
> save and restore d0-d7, which is indeed correct fro the original ABI.
> It is not clear from the BZ why q0-q7 are saved instead.
> 

i think that comment was wrong, q0-q7 are argument registers in
the pcs so they have to be saved/restored (otherwise long double
args would be clobbered)

> That said, the new abi for AdvSIMD does use q0-q7.

the new abi also requires q8-q25 to be preserved (callee saved regs
for the vector pcs, but not for the normal pcs so the dynamic linker
may clobber them, although current code may not do so)

> When SVE is enabled, we need to save even more: z0-z7 plus p0-p3.

note that z8-z31 and p4-p15 have to be preserved across an sve
vector call, but the dynamic linker may clobber the z regs so
saving z0-z7 is not enough for lazy binding to work.

if a process touches sve regs then the kernel remembers that
the process uses sve and starts saving/restoring regs at kernel
entry/return, always accessing sve regs in the dynamic linker
defeats the purpose of that optimization.

> This fixes a number of minor issues with _dl_runtime_resolve itself,
> and more major issues with _dl_runtime_profile before copying those
> routines and making the modifications required for SVE.
> 

i will review the fixes, but the sve parts have to wait
(i will have to discuss with other ppl what to do with that).

> I have *not* attempted to extend the <bits/link.h> interface for
> the new ABI.  This should be done with more discussion on list.
> I have instead simply saved and restored registers as the abi
> requires, so that the actual callee gets the correct data.
> 

ok (ld audit is another reason to avoid plt for vector functions..)

> I have lightly tested this under QEMU, in that the new sve paths
> pass the same glibc tests as the old paths.
> 
> 
> r~
> 
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.ecm0665628/abi_sve_aapcs64_100986_0000_00_en.pdf
> [2] https://sourceware.org/bugzilla/show_bug.cgi?id=15128#c11
> 
> Richard Henderson (3):
>    aarch64: Clean up _dl_runtime_resolve
>    aarch64: Clean up _dl_runtime_profile
>    aarch64: Save and restore SVE registers in ld.so
> 
>   sysdeps/aarch64/dl-machine.h    |  13 +-
>   sysdeps/aarch64/dl-trampoline.S | 531 +++++++++++++++++++++++++-------
>   2 files changed, 438 insertions(+), 106 deletions(-)
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-02  7:29   ` Florian Weimer
@ 2018-08-02 13:05     ` Richard Henderson
  2018-08-02 13:33       ` Szabolcs Nagy
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2018-08-02 13:05 UTC (permalink / raw)
  To: Florian Weimer, rth, libc-alpha; +Cc: marcus.shawcroft, szabolcs.nagy

On 08/02/2018 03:29 AM, Florian Weimer wrote:
> On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
>> +/*
>> + * For functions conforming to the procedure call standard as
>> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
>> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
>> + */
> 
> What's the worst-case additional stack usage due to this change?

The current architectural maximum vector size is 256 bytes,
so 2176 bytes total.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 10:44 ` Szabolcs Nagy
@ 2018-08-02 13:17   ` Carlos O'Donell
  2018-08-02 13:27     ` Szabolcs Nagy
  2018-08-02 14:29     ` Szabolcs Nagy
  0 siblings, 2 replies; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 13:17 UTC (permalink / raw)
  To: Szabolcs Nagy, rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
> ok (ld audit is another reason to avoid plt for vector functions..)

If you avoid the PLT with no way to add back a call through the PLT
this completely breaks developer tooling that uses the PLT, like
LD_AUDIT, latrace, and custom tooling. Be aware that there are large
customers in various industry verticals that use LD_AUDIT extensively.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 13:17   ` Carlos O'Donell
@ 2018-08-02 13:27     ` Szabolcs Nagy
  2018-08-02 14:28       ` Carlos O'Donell
  2018-08-02 14:29     ` Szabolcs Nagy
  1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 13:27 UTC (permalink / raw)
  To: Carlos O'Donell, rth, libc-alpha
  Cc: nd, marcus.shawcroft, Richard Henderson

On 02/08/18 14:17, Carlos O'Donell wrote:
> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
>> ok (ld audit is another reason to avoid plt for vector functions..)
> 
> If you avoid the PLT with no way to add back a call through the PLT
> this completely breaks developer tooling that uses the PLT, like
> LD_AUDIT, latrace, and custom tooling. Be aware that there are large
> customers in various industry verticals that use LD_AUDIT extensively.
> 

ld_audit cannot possibly work correctly (because variadic
functions with large arguments are broken, it can cause
runtime crashes) so it is a 'best effort hack' that sometimes
happens to work.

so i'm sad that users depend on it.. possibly because of
false advertisement somewhere that this is a reliable mechanism.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-02 13:05     ` Richard Henderson
@ 2018-08-02 13:33       ` Szabolcs Nagy
  2018-08-02 14:18         ` Carlos O'Donell
  0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 13:33 UTC (permalink / raw)
  To: Richard Henderson, Florian Weimer, rth, libc-alpha; +Cc: nd, marcus.shawcroft

On 02/08/18 14:05, Richard Henderson wrote:
> On 08/02/2018 03:29 AM, Florian Weimer wrote:
>> On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
>>> +/*
>>> + * For functions conforming to the procedure call standard as
>>> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
>>> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
>>> + */
>>
>> What's the worst-case additional stack usage due to this change?
> 
> The current architectural maximum vector size is 256 bytes,
> so 2176 bytes total.
> 

that limit can increase in the future and saving 8 z regs is not enough,
you need to save all 32 if the resolver code may touch the 32 fp regs.
(so with 256byte regs it's 8k + 128bytes for 4 pregs)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-02 13:33       ` Szabolcs Nagy
@ 2018-08-02 14:18         ` Carlos O'Donell
  2018-08-02 14:26           ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 14:18 UTC (permalink / raw)
  To: Szabolcs Nagy, Richard Henderson, Florian Weimer, rth, libc-alpha
  Cc: nd, marcus.shawcroft

On 08/02/2018 09:32 AM, Szabolcs Nagy wrote:
> On 02/08/18 14:05, Richard Henderson wrote:
>> On 08/02/2018 03:29 AM, Florian Weimer wrote:
>>> On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
>>>> +/*
>>>> + * For functions conforming to the procedure call standard as
>>>> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
>>>> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
>>>> + */
>>>
>>> What's the worst-case additional stack usage due to this change?
>>
>> The current architectural maximum vector size is 256 bytes,
>> so 2176 bytes total.
>>
> 
> that limit can increase in the future and saving 8 z regs is not enough,
> you need to save all 32 if the resolver code may touch the 32 fp regs.
> (so with 256byte regs it's 8k + 128bytes for 4 pregs)

AArch64 has PTHREAD_MIN_STACK of 128KiB so this should be more than enough
to save and restore SVE registers for some simple function calling.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-02 14:18         ` Carlos O'Donell
@ 2018-08-02 14:26           ` Florian Weimer
  2018-08-02 14:31             ` Carlos O'Donell
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2018-08-02 14:26 UTC (permalink / raw)
  To: Carlos O'Donell, Szabolcs Nagy, Richard Henderson, rth, libc-alpha
  Cc: nd, marcus.shawcroft

On 08/02/2018 04:18 PM, Carlos O'Donell wrote:
> On 08/02/2018 09:32 AM, Szabolcs Nagy wrote:
>> On 02/08/18 14:05, Richard Henderson wrote:
>>> On 08/02/2018 03:29 AM, Florian Weimer wrote:
>>>> On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
>>>>> +/*
>>>>> + * For functions conforming to the procedure call standard as
>>>>> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
>>>>> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
>>>>> + */
>>>>
>>>> What's the worst-case additional stack usage due to this change?
>>>
>>> The current architectural maximum vector size is 256 bytes,
>>> so 2176 bytes total.
>>>
>>
>> that limit can increase in the future and saving 8 z regs is not enough,
>> you need to save all 32 if the resolver code may touch the 32 fp regs.
>> (so with 256byte regs it's 8k + 128bytes for 4 pregs)
> 
> AArch64 has PTHREAD_MIN_STACK of 128KiB so this should be more than enough
> to save and restore SVE registers for some simple function calling.

On the other hand, there is also this:

/* Minimum stack size for a signal handler.  */
#define MINSIGSTKSZ     5120

/* System default stack size.  */
#define SIGSTKSZ        16384

Since the kernel will also push >8KiB on the signal handler stack, 
SIGSTKSZ is no longer sufficient.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 13:27     ` Szabolcs Nagy
@ 2018-08-02 14:28       ` Carlos O'Donell
  0 siblings, 0 replies; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 14:28 UTC (permalink / raw)
  To: Szabolcs Nagy, rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 08/02/2018 09:26 AM, Szabolcs Nagy wrote:
> On 02/08/18 14:17, Carlos O'Donell wrote:
>> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
>>> ok (ld audit is another reason to avoid plt for vector functions..)
>>
>> If you avoid the PLT with no way to add back a call through the PLT
>> this completely breaks developer tooling that uses the PLT, like
>> LD_AUDIT, latrace, and custom tooling. Be aware that there are large
>> customers in various industry verticals that use LD_AUDIT extensively.
>>
> 
> ld_audit cannot possibly work correctly (because variadic
> functions with large arguments are broken, it can cause
> runtime crashes) so it is a 'best effort hack' that sometimes
> happens to work.

If it works for 100% of the cases the developer is interested in
supporting then it is perfect for the developer and the user can
use it without any problem.

Don't fall into the engineers fallacy that because you can't make
it 100% provably correct that it is not useful or not valuable.

Everything we do, to some extent, is a "best effort hack" within
the limits of the machines we operate under, particularly the
constraints of memory buses, addressing, and reasonable performance.

The lack of unlimited variadic argument support is a corner case
where each architecture could document the maximum limit (of the 
stack copying we do) or we could implement something *newer* that
operates *like* LD_AUDIT but works differently.

> so i'm sad that users depend on it.. possibly because of
> false advertisement somewhere that this is a reliable mechanism. 
No one has made any false advertisements. What is happening is that
silicon vendors are looking for performance by removing the PLT.
The PLT has been around for long enough that there is developer
tooling that considers it a part of the ELF ABI. If you remove support
for it without providing a way to add it back, like we are discussing
with H.J. for x86_64, you will negatively impact your architectures
ability to compete on a feature that developers use and deploy into
production. The feature, LD_AUDIT in particular, has been around since
Solaris implemented it, so it has become a relevant feature for some
large companies.

On x86_64 with -fno-plt, we hope to leave in the PLT and the required
relocations to route the GOT-direct calls through the PLT if LD_AUDIT
is enabled. If you can't do that, then you will have to document that
calls with the new vector ABI do not support LD_AUDIT, and again this
will impact some customers.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 13:17   ` Carlos O'Donell
  2018-08-02 13:27     ` Szabolcs Nagy
@ 2018-08-02 14:29     ` Szabolcs Nagy
  2018-08-02 14:36       ` Carlos O'Donell
  1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 14:29 UTC (permalink / raw)
  To: Carlos O'Donell, rth, libc-alpha
  Cc: nd, marcus.shawcroft, Richard Henderson

On 02/08/18 14:17, Carlos O'Donell wrote:
> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
>> ok (ld audit is another reason to avoid plt for vector functions..)
> 
> If you avoid the PLT with no way to add back a call through the PLT
> this completely breaks developer tooling that uses the PLT, like
> LD_AUDIT, latrace, and custom tooling. Be aware that there are large
> customers in various industry verticals that use LD_AUDIT extensively.
> 

it seems x86_64 also broke the 'plt entry abi' several times
(whenever new vector extensions were introduced)

we can do the same on aarch64, but that means plt entry does
not have a stable abi (new extensions don't work on old glibc)

in case of sve, supporting the plt entry has significant costs.
(stack usage, linux optimizations, save/restore overhead)
so i would like to avoid that.

it may be possible to keep the plt (and make ld_audit work),
but force bind now semantics for vector functions somehow, so
at least the lazy binding entry point remains backward compatible
and efficient.  but i'm not sure if the compiler can do this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so
  2018-08-02 14:26           ` Florian Weimer
@ 2018-08-02 14:31             ` Carlos O'Donell
  0 siblings, 0 replies; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 14:31 UTC (permalink / raw)
  To: Florian Weimer, Szabolcs Nagy, Richard Henderson, rth, libc-alpha
  Cc: nd, marcus.shawcroft

On 08/02/2018 10:25 AM, Florian Weimer wrote:
> On 08/02/2018 04:18 PM, Carlos O'Donell wrote:
>> On 08/02/2018 09:32 AM, Szabolcs Nagy wrote:
>>> On 02/08/18 14:05, Richard Henderson wrote:
>>>> On 08/02/2018 03:29 AM, Florian Weimer wrote:
>>>>> On 08/02/2018 12:23 AM, rth@twiddle.net wrote:
>>>>>> +/*
>>>>>> + * For functions conforming to the procedure call standard as
>>>>>> + * amended for SVE support (ARM_100986_0000_00_en (SVEpcs 00bet1)),
>>>>>> + * we must save the entire contents of Z0-Z7 as well as P0-P3.
>>>>>> + */
>>>>>
>>>>> What's the worst-case additional stack usage due to this change?
>>>>
>>>> The current architectural maximum vector size is 256 bytes,
>>>> so 2176 bytes total.
>>>>
>>>
>>> that limit can increase in the future and saving 8 z regs is not enough,
>>> you need to save all 32 if the resolver code may touch the 32 fp regs.
>>> (so with 256byte regs it's 8k + 128bytes for 4 pregs)
>>
>> AArch64 has PTHREAD_MIN_STACK of 128KiB so this should be more than enough
>> to save and restore SVE registers for some simple function calling.
> 
> On the other hand, there is also this:
> 
> /* Minimum stack size for a signal handler.Â  */
> #define MINSIGSTKSZÂ Â Â Â  5120
> 
> /* System default stack size.Â  */
> #define SIGSTKSZÂ Â Â Â Â Â Â  16384
> 
> Since the kernel will also push >8KiB on the signal handler stack, SIGSTKSZ is no longer sufficient.

Absolutely correct.

Clearly old binaries won't work, and you have to bump these values and then
draw a line in the sand for SVE working, or not working.

We could bump SIGSTKSZ and MINSIGSTKSZ, and then use a flag day ABI change
with object style markup like x86 did for Intel CET.

You markup all of the objects, check the markup at runtime, and enable or
disable the use of SVE at runtime if you have a mixed set of objects that
can't possibly work.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 14:29     ` Szabolcs Nagy
@ 2018-08-02 14:36       ` Carlos O'Donell
  0 siblings, 0 replies; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 14:36 UTC (permalink / raw)
  To: Szabolcs Nagy, rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 08/02/2018 10:29 AM, Szabolcs Nagy wrote:
> On 02/08/18 14:17, Carlos O'Donell wrote:
>> On 08/02/2018 06:44 AM, Szabolcs Nagy wrote:
>>> ok (ld audit is another reason to avoid plt for vector functions..)
>>
>> If you avoid the PLT with no way to add back a call through the PLT
>> this completely breaks developer tooling that uses the PLT, like
>> LD_AUDIT, latrace, and custom tooling. Be aware that there are large
>> customers in various industry verticals that use LD_AUDIT extensively.
>>
> 
> it seems x86_64 also broke the 'plt entry abi' several times
> (whenever new vector extensions were introduced)

Yes, and that's OK IMO, because LD_AUDIT is developer tooling and it's
the kind of scenario where we can tell users that they need to upgrade
their audit objects.

> we can do the same on aarch64, but that means plt entry does
> not have a stable abi (new extensions don't work on old glibc)

Yes, and I think that's OK.

> in case of sve, supporting the plt entry has significant costs.
> (stack usage, linux optimizations, save/restore overhead)
> so i would like to avoid that.

That's your choice. My position is to give you enough information to
make an informed choice. I'm reminding you that my experience within
Red Hat is that the lack of LD_AUDIT will impact an important set of
customers. The lack of PLT may not, as long as you have a GOT entry
that you indirect through and the GOT can be rewritten, we may still
support some of the key developer uses.

I will continue to work with HJ to ensure x86_64 doesn't loose the
LD_AUDIT capabilities, but there it's a little easier, we have a GOT
entry for the indirect call, and an unused PLT entry (extra binary
space), and we can redirect the GOT entry to the unused PLT entry to
get back auditing.

> it may be possible to keep the plt (and make ld_audit work),
> but force bind now semantics for vector functions somehow, so
> at least the lazy binding entry point remains backward compatible
> and efficient.Â  but i'm not sure if the compiler can do this.

Yes, I'm not suggesting you use the PLT entry, no, you just have it
there so you can use it in the case of LD_AUDIT, and have enough
relocation information present to make the vector functions callable
through the PLT entries. Is that possible?

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
                   ` (4 preceding siblings ...)
  2018-08-02 10:44 ` Szabolcs Nagy
@ 2018-08-02 15:24 ` Carlos O'Donell
  2018-08-02 15:52   ` Ramana Radhakrishnan
  2018-08-02 16:54   ` Richard Henderson
  5 siblings, 2 replies; 27+ messages in thread
From: Carlos O'Donell @ 2018-08-02 15:24 UTC (permalink / raw)
  To: rth, libc-alpha; +Cc: szabolcs.nagy, Richard Henderson

On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>
> 
> There is a new calling convention defined for vectorized functions [1].

Correct me if I'm wrong.

(a) We have a lot of stuff to save/restore in SVE.

(b) It appears Szabolcs really wants to avoid the PLT at all with the
    new vector procedure call stanadard, since this avoids ever
    having to save/restore the large amounts of register data.

(b.1) Assumes that save and restore of SVE has serious negative performance
      consequences, both in userspace and in kernel save/restore for 
      context switches.

(c) A PLT generally has only one kind of save/restore ABI that it follows
    and it follows pessimistically the worse case to support all possible
    calling conventions.

(d) The compiler you are using is generating calls using the new ABI and
    those are going through the PLT, something in the dynamic loader is
    also using these registers and corrupting call results, otherwise
    you would never have made this patch to fix the problem.

If all this is true, I think this is the wrong solution.

The better solution for aarch64 is:

(1) All new-style SVE calls do *not* go through the PLT by default, but
    indirect through the GOT and are always bind-now.

(2) By default ld.so does not save/restore the SVE registers during
    lazy binding.

(3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding
    is being forced, then it flips to PLT save/restore sequences that
    do save all the required SVE registers, and routes the GOT entries
    to the PLT entries, and we get *slow* lazy binding semantics that
    work.

I don't expect you signed up for this, but that's my analysis.

> I have *not* attempted to extend the <bits/link.h> interface for
> the new ABI.  This should be done with more discussion on list.
> I have instead simply saved and restored registers as the abi
> requires, so that the actual callee gets the correct data.

We *should* adjust bits/link.h at the same time and extend it like
we did for x86_64. LD_AUDIT should work.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 15:24 ` Carlos O'Donell
@ 2018-08-02 15:52   ` Ramana Radhakrishnan
  2018-08-02 16:54   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Ramana Radhakrishnan @ 2018-08-02 15:52 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: rth, GNU C Library, Szabolcs Nagy, Richard Henderson

On Thu, Aug 2, 2018 at 4:24 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
>> From: Richard Henderson <richard.henderson@linaro.org>
>>
>> There is a new calling convention defined for vectorized functions [1].
>
> Correct me if I'm wrong.
>
> (a) We have a lot of stuff to save/restore in SVE.
>
> (b) It appears Szabolcs really wants to avoid the PLT at all with the
>     new vector procedure call stanadard, since this avoids ever
>     having to save/restore the large amounts of register data.
>
> (b.1) Assumes that save and restore of SVE has serious negative performance
>       consequences, both in userspace and in kernel save/restore for
>       context switches.
>
> (c) A PLT generally has only one kind of save/restore ABI that it follows
>     and it follows pessimistically the worse case to support all possible
>     calling conventions.
>
> (d) The compiler you are using is generating calls using the new ABI and
>     those are going through the PLT, something in the dynamic loader is
>     also using these registers and corrupting call results, otherwise
>     you would never have made this patch to fix the problem.
>
> If all this is true, I think this is the wrong solution.
>
> The better solution for aarch64 is:
>
> (1) All new-style SVE calls do *not* go through the PLT by default, but
>     indirect through the GOT and are always bind-now.
>
> (2) By default ld.so does not save/restore the SVE registers during
>     lazy binding.
>
> (3) If ld.so detects LD_AUDIT in use, or BIND_NOW=0, or lazy binding
>     is being forced, then it flips to PLT save/restore sequences that
>     do save all the required SVE registers, and routes the GOT entries
>     to the PLT entries, and we get *slow* lazy binding semantics that
>     work.
>
> I don't expect you signed up for this, but that's my analysis.

This is a good summary. Thank you Carlos.

We would also need to discuss this internally with some ABI people before
taking a direction here.

regards
Ramana
>
>> I have *not* attempted to extend the <bits/link.h> interface for
>> the new ABI.  This should be done with more discussion on list.
>> I have instead simply saved and restored registers as the abi
>> requires, so that the actual callee gets the correct data.
>
> We *should* adjust bits/link.h at the same time and extend it like
> we did for x86_64. LD_AUDIT should work.
>
> --
> Cheers,
> Carlos.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve
  2018-08-01 22:24 ` [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve rth
@ 2018-08-02 16:03   ` Szabolcs Nagy
  2018-08-02 17:04     ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 16:03 UTC (permalink / raw)
  To: rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 01/08/18 23:23, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>
> 
> 	* sysdeps/aarch64/dl-trampoline.S (_dl_runtime_resolve):
> 	Do not record unwind info for arguments; this is unneeded;
> 	do not save x9 just to have a register to pair with x8;
> 	properly include the 16 bytes of PLT stack into the unwind;
> 	create a frame pointer with the spare stack slot;
> 	rearrange the exit to only adjust the stack once.

i thought the cfi annotations were needed for all registers
in case the debugger wants to investigate register content
across a _dl_runtime_resolve frame (possibly several frames
up in the call stack), this may not be a common use case though
and i don't know what's the convention in glibc asm, the compiler
seems to emit annotation for all spilled registers with -g.

> --- a/sysdeps/aarch64/dl-trampoline.S
> +++ b/sysdeps/aarch64/dl-trampoline.S
> @@ -32,7 +32,6 @@
>   	.text
>   	.globl _dl_runtime_resolve
>   	.type _dl_runtime_resolve, #function
> -	cfi_startproc
>   	.align 2
>   _dl_runtime_resolve:
>   	/* AArch64 we get called with:
> @@ -41,46 +40,24 @@ _dl_runtime_resolve:
>   	   [sp, #8]	lr
>   	   [sp, #0]	&PLTGOT[n]
>   	 */
> -
> +	cfi_startproc

is there a problem keeping it at its original place above?
the tlsdesc asm has cfi_startproc at the same place.

otherwise the patch looks ok.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 15:24 ` Carlos O'Donell
  2018-08-02 15:52   ` Ramana Radhakrishnan
@ 2018-08-02 16:54   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2018-08-02 16:54 UTC (permalink / raw)
  To: Carlos O'Donell, rth, libc-alpha; +Cc: szabolcs.nagy

On 08/02/2018 11:24 AM, Carlos O'Donell wrote:
> On 08/01/2018 06:23 PM, rth@twiddle.net wrote:
>> From: Richard Henderson <richard.henderson@linaro.org>
>>
>> There is a new calling convention defined for vectorized functions [1].
> 
> Correct me if I'm wrong.
> 
> (a) We have a lot of stuff to save/restore in SVE.

Yes.

> 
> (b) It appears Szabolcs really wants to avoid the PLT at all with the
>     new vector procedure call stanadard, since this avoids ever
>     having to save/restore the large amounts of register data.

Yes.

> 
> (b.1) Assumes that save and restore of SVE has serious negative performance
>       consequences, both in userspace and in kernel save/restore for 
>       context switches.

Yes.  This is the biggie, really.

> (c) A PLT generally has only one kind of save/restore ABI that it follows
>     and it follows pessimistically the worse case to support all possible
>     calling conventions.

Yes.

> (d) The compiler you are using is generating calls using the new ABI and
>     those are going through the PLT, something in the dynamic loader is
>     also using these registers and corrupting call results, otherwise
>     you would never have made this patch to fix the problem.

Yes.  Indeed, the normal AdvSIMD restore within _dl_runtime_resolve is exactly
what clobbers the SVE state.

> The better solution for aarch64 is:
> 
> (1) All new-style SVE calls do *not* go through the PLT by default, but
>     indirect through the GOT and are always bind-now.

This would be the ideal solution, yes.

> I don't expect you signed up for this, but that's my analysis.

The right people are now talking about the problem, which is the main thing.

>> I have *not* attempted to extend the <bits/link.h> interface for
>> the new ABI.  This should be done with more discussion on list.
>> I have instead simply saved and restored registers as the abi
>> requires, so that the actual callee gets the correct data.
> 
> We *should* adjust bits/link.h at the same time and extend it like
> we did for x86_64. LD_AUDIT should work.

Yes.  We do need to talk about the design for this though; it's not as simple
as for x86_64.  I did comment on this more in patch 3:

+	   ??? Extending the profiling hook for full SVE register export
+	   is tricky given the variable register size.  Perhaps the new
+	   La_aarch64_regs should contain pointers to Z0 and P0, and
+	   the current VL, and one infers the addresses from there.
+
+	   This one new form could be used for all, with AdvSIMD
+	   devolving into VL=16 with no predicate registers.



r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve
  2018-08-02 16:03   ` Szabolcs Nagy
@ 2018-08-02 17:04     ` Richard Henderson
  2018-08-02 17:21       ` Szabolcs Nagy
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2018-08-02 17:04 UTC (permalink / raw)
  To: Szabolcs Nagy, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 08/02/2018 12:03 PM, Szabolcs Nagy wrote:
> On 01/08/18 23:23, rth@twiddle.net wrote:
>> From: Richard Henderson <richard.henderson@linaro.org>
>>
>> Â Â Â Â * sysdeps/aarch64/dl-trampoline.S (_dl_runtime_resolve):
>> Â Â Â Â Do not record unwind info for arguments; this is unneeded;
>> Â Â Â Â do not save x9 just to have a register to pair with x8;
>> Â Â Â Â properly include the 16 bytes of PLT stack into the unwind;
>> Â Â Â Â create a frame pointer with the spare stack slot;
>> Â Â Â Â rearrange the exit to only adjust the stack once.
> 
> i thought the cfi annotations were needed for all registers
> in case the debugger wants to investigate register content
> across a _dl_runtime_resolve frame (possibly several frames
> up in the call stack),

However that's typically for the call-saved registers, where the compiler might
save data in that register across the call.  These are not call-saved
registers.  They are argument registers.  There will not be any debug info that
refers to them.

> this may not be a common use case though
> and i don't know what's the convention in glibc asm, the compiler
> seems to emit annotation for all spilled registers with -g.

Sure, because the compiler is spilling call-saved registers.
The others it just clobbers with no annotation.

>> -Â Â Â  cfi_startproc
>> Â Â Â Â Â  .align 2
>> Â  _dl_runtime_resolve:
>> Â Â Â Â Â  /* AArch64 we get called with:
>> @@ -41,46 +40,24 @@ _dl_runtime_resolve:
>> Â Â Â Â Â Â Â Â  [sp, #8]Â Â Â  lr
>> Â Â Â Â Â Â Â Â  [sp, #0]Â Â Â  &PLTGOT[n]
>> Â Â Â Â Â Â  */
>> -
>> +Â Â Â  cfi_startproc
> 
> is there a problem keeping it at its original place above?
> the tlsdesc asm has cfi_startproc at the same place.

It could stay where it is, but I thought it clearer to place the following two
annotations immediately adjacent (because it's state incoming, not anything we
are doing here, and should not be separated from the start).  Further, to place
all of the annotations immediately after the comment that describes why.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve
  2018-08-02 17:04     ` Richard Henderson
@ 2018-08-02 17:21       ` Szabolcs Nagy
  0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 17:21 UTC (permalink / raw)
  To: Richard Henderson, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 02/08/18 18:04, Richard Henderson wrote:
> On 08/02/2018 12:03 PM, Szabolcs Nagy wrote:
>> On 01/08/18 23:23, rth@twiddle.net wrote:
>>> From: Richard Henderson <richard.henderson@linaro.org>
>>>
>>>  Â Â Â Â * sysdeps/aarch64/dl-trampoline.S (_dl_runtime_resolve):
>>>  Â Â Â Â Do not record unwind info for arguments; this is unneeded;
>>>  Â Â Â Â do not save x9 just to have a register to pair with x8;
>>>  Â Â Â Â properly include the 16 bytes of PLT stack into the unwind;
>>>  Â Â Â Â create a frame pointer with the spare stack slot;
>>>  Â Â Â Â rearrange the exit to only adjust the stack once.
>>
>> i thought the cfi annotations were needed for all registers
>> in case the debugger wants to investigate register content
>> across a _dl_runtime_resolve frame (possibly several frames
>> up in the call stack),
> 
> However that's typically for the call-saved registers, where the compiler might
> save data in that register across the call.  These are not call-saved
> registers.  They are argument registers.  There will not be any debug info that
> refers to them.
> 
>> this may not be a common use case though
>> and i don't know what's the convention in glibc asm, the compiler
>> seems to emit annotation for all spilled registers with -g.
> 
> Sure, because the compiler is spilling call-saved registers.
> The others it just clobbers with no annotation.
> 
>>> -Â Â Â  cfi_startproc
>>>  Â Â Â Â Â  .align 2
>>>  Â  _dl_runtime_resolve:
>>>  Â Â Â Â Â  /* AArch64 we get called with:
>>> @@ -41,46 +40,24 @@ _dl_runtime_resolve:
>>>  Â Â Â Â Â Â Â Â  [sp, #8]Â Â Â  lr
>>>  Â Â Â Â Â Â Â Â  [sp, #0]Â Â Â  &PLTGOT[n]
>>>  Â Â Â Â Â Â  */
>>> -
>>> +Â Â Â  cfi_startproc
>>
>> is there a problem keeping it at its original place above?
>> the tlsdesc asm has cfi_startproc at the same place.
> 
> It could stay where it is, but I thought it clearer to place the following two
> annotations immediately adjacent (because it's state incoming, not anything we
> are doing here, and should not be separated from the start).  Further, to place
> all of the annotations immediately after the comment that describes why.
> 

i see, then this patch is OK to commit, thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] aarch64: Clean up _dl_runtime_profile
  2018-08-01 22:24 ` [PATCH 2/3] aarch64: Clean up _dl_runtime_profile rth
@ 2018-08-02 17:25   ` Szabolcs Nagy
  0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2018-08-02 17:25 UTC (permalink / raw)
  To: rth, libc-alpha; +Cc: nd, marcus.shawcroft, Richard Henderson

On 01/08/18 23:23, rth@twiddle.net wrote:
> From: Richard Henderson <richard.henderson@linaro.org>
> 
> Not adjusting La_aarch64_regs or La_aarch64_retval for the new AdvSIMD
> vector ABI; that will require more thought and coordination.  In the
> meantime, this will at least pass the proper values to each callee,
> even if the values are not visible to auditing.
> 
> 	* sysdeps/aarch64/dl-trampoline.S (_dl_runtime_profile):
> 	Do not record unwind info for arguments -- this is unneeded;
> 	properly include the 16 bytes of PLT stack into the unwind;
> 	save and restore the structure return pointer, x8;
> 	save all of the AdvSIMD registers defined for the vector ABI.

This is OK to commit, thanks.

Reviewed-By: Szabolcs Nagy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
  2018-08-02 10:43 ` Alex Bennée
@ 2018-08-02 13:33   ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2018-08-02 13:33 UTC (permalink / raw)
  To: Alex Bennée, libc-alpha

On 08/02/2018 06:43 AM, Alex BennÃ©e wrote:
> 
> I've tested the linker works with the failing test case so you can have
> a:
> 
>   Tested-by: Alex BennÃ©e <alex.bennee@linaro.org>
> 
> The patches look sane to me but I'm not really that familiar with the
> guts of glibc's resolver hooks. One question though, given we gate on
> GLRO(dl_hwcap) & HWCAP_SVE is this something that will be true for all
> binaries that run on a kernel on SVE enabled hardware even if the
> library or application don't use any SVE registers?

Yes it would.  Which is among the reasons why Szabolcs is saying these
functions should never go through the PLT at all, so we don't have to deal with it.

Which is of course an upstream compiler problem.


r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] aarch64: Update ld.so for vector abi
       [not found] <20180801222347 dot 18903-1-rth at twiddle dot net>
@ 2018-08-02 10:43 ` Alex Bennée
  2018-08-02 13:33   ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Bennée @ 2018-08-02 10:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

I've tested the linker works with the failing test case so you can have
a:

  Tested-by: Alex Bennée <alex.bennee@linaro.org>

The patches look sane to me but I'm not really that familiar with the
guts of glibc's resolver hooks. One question though, given we gate on
GLRO(dl_hwcap) & HWCAP_SVE is this something that will be true for all
binaries that run on a kernel on SVE enabled hardware even if the
library or application don't use any SVE registers?

--
Alex Bennée

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-08-02 17:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01 22:23 [PATCH 0/3] aarch64: Update ld.so for vector abi rth
2018-08-01 22:24 ` [PATCH 2/3] aarch64: Clean up _dl_runtime_profile rth
2018-08-02 17:25   ` Szabolcs Nagy
2018-08-01 22:24 ` [PATCH 3/3] aarch64: Save and restore SVE registers in ld.so rth
2018-08-02  7:29   ` Florian Weimer
2018-08-02 13:05     ` Richard Henderson
2018-08-02 13:33       ` Szabolcs Nagy
2018-08-02 14:18         ` Carlos O'Donell
2018-08-02 14:26           ` Florian Weimer
2018-08-02 14:31             ` Carlos O'Donell
2018-08-01 22:24 ` [PATCH 1/3] aarch64: Clean up _dl_runtime_resolve rth
2018-08-02 16:03   ` Szabolcs Nagy
2018-08-02 17:04     ` Richard Henderson
2018-08-02 17:21       ` Szabolcs Nagy
2018-08-01 22:32 ` [PATCH 0/3] aarch64: Update ld.so for vector abi Richard Henderson
2018-08-02  9:26   ` Szabolcs Nagy
2018-08-02 10:44 ` Szabolcs Nagy
2018-08-02 13:17   ` Carlos O'Donell
2018-08-02 13:27     ` Szabolcs Nagy
2018-08-02 14:28       ` Carlos O'Donell
2018-08-02 14:29     ` Szabolcs Nagy
2018-08-02 14:36       ` Carlos O'Donell
2018-08-02 15:24 ` Carlos O'Donell
2018-08-02 15:52   ` Ramana Radhakrishnan
2018-08-02 16:54   ` Richard Henderson
     [not found] <20180801222347 dot 18903-1-rth at twiddle dot net>
2018-08-02 10:43 ` Alex Bennée
2018-08-02 13:33   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).