public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle.
  2016-04-26 12:08 [PATCH 1/4] S390: Use mvcle for copies > 1MB on 32bit with default memcpy variant Stefan Liebler
  2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
  2016-04-26 12:08 ` [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt Stefan Liebler
@ 2016-04-26 12:08 ` Stefan Liebler
  2016-04-26 14:16   ` Florian Weimer
  2 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-04-26 12:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: carlos, Wilco.Dijkstra, neleai, Stefan Liebler

The __memcpy_default variant on s390 64bit calculates the number of
256byte blocks in a 64bit register and checks, if they exceed 1MB
to jump to mvcle. Otherwise a mvc-loop is used. The compare-instruction
only checks a 32bit value.
This patch uses a 64bit compare.

ChangeLog:

	* sysdeps/s390/s390-64/memcpy.S (memcpy):
	Use cghi instead of chi to compare 64bit value.
---
 sysdeps/s390/s390-64/memcpy.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
index e84a357..9d60a14 100644
--- a/sysdeps/s390/s390-64/memcpy.S
+++ b/sysdeps/s390/s390-64/memcpy.S
@@ -47,7 +47,7 @@ ENTRY(memcpy)
 .L_Z900_4:
 	br      %r14
 .L_Z900_13:
-	chi	%r5,4096             # Switch to mvcle for copies >1MB
+	cghi	%r5,4096            # Switch to mvcle for copies >1MB
 	jh      __memcpy_mvcle
 .L_Z900_12:
 	mvc     0(256,%r1),0(%r3)
-- 
2.3.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] S390: Use mvcle for copies > 1MB on 32bit with default memcpy variant.
@ 2016-04-26 12:08 Stefan Liebler
  2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Stefan Liebler @ 2016-04-26 12:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: carlos, Wilco.Dijkstra, neleai, Stefan Liebler

If more than 255 bytes should be copied, the algorithm jumps away.
Before this patch, it jumps to the mvc-loop (.L_G5_12).
Now it jumps first to the "> 1MB" check, which jumps away to
__memcpy_mvcle. Otherwise, the mvc-loop (.L_G5_12) copies the bytes.

ChangeLog:

	* sysdeps/s390/s390-32/memcpy.S (memcpy):
	Jump to 1MB check before executing mvc-loop.
---
 sysdeps/s390/s390-32/memcpy.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
index 62ecbbf..2ac51ab 100644
--- a/sysdeps/s390/s390-32/memcpy.S
+++ b/sysdeps/s390/s390-32/memcpy.S
@@ -42,7 +42,7 @@ ENTRY(memcpy)
 	srl     %r5,8
 	ltr     %r5,%r5
 	lr      %r1,%r2
-	jne     .L_G5_12
+	jne     .L_G5_13
 	ex      %r4,.L_G5_17-.L_G5_16(%r13)
 .L_G5_4:
 	l       %r13,52(%r15)
-- 
2.3.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-04-26 12:08 [PATCH 1/4] S390: Use mvcle for copies > 1MB on 32bit with default memcpy variant Stefan Liebler
@ 2016-04-26 12:08 ` Stefan Liebler
  2016-04-26 13:33   ` Adhemerval Zanella
  2016-05-04 13:20   ` Stefan Liebler
  2016-04-26 12:08 ` [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt Stefan Liebler
  2016-04-26 12:08 ` [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle Stefan Liebler
  2 siblings, 2 replies; 36+ messages in thread
From: Stefan Liebler @ 2016-04-26 12:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: carlos, Wilco.Dijkstra, neleai, Stefan Liebler

There exist optimized memcpy functions on s390, but no optimized mempcpy.
This patch adds mempcpy entry points in memcpy.S files, which
use the memcpy implementation. Now mempcpy itself is also an IFUNC function
as memcpy is and the variants are listed in ifunc-impl-list.c.

The s390 string.h defines _HAVE_STRING_ARCH_mempcpy and __mempcpy/mempcpy
definitions. If n is constant and small enough, __mempcpy/mempcpy is an inlined
function, which uses memcpy and returns dest + n. In this constant case,
GCC emits instructions like mvi or mvc and avoids the function call to memcpy.
If n is not constant, then __mempcpy is called.

ChangeLog:

	[BZ #19765]
	* sysdeps/s390/bits/string.h (mempcpy) Redirect to
	__mempcpy_inline_memcpy or __mempcpy_inline_mempcpy.
	(__mempcpy) Likewise.
	(__mempcpy_inline_mempcpy): New inline function.
	(__mempcpy_inline_memcpy): Likewise.
	(_HAVE_STRING_ARCH_mempcpy): New define.
	* sysdeps/s390/mempcpy.S: New File.
	* sysdeps/s390/multiarch/mempcpy.c: Likewise.
	* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add mempcpy.
	* sysdeps/s390/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Add mempcpy variants.
	* sysdeps/s390/s390-32/memcpy.S: Add mempcpy entry point.
	(memcpy): Adjust to be usable from mempcpy entry point.
	(__memcpy_mvcle): Likewise.
	* sysdeps/s390/s390-64/memcpy.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add entry points
	____mempcpy_z196, ____mempcpy_z10 and add __GI_ symbols for mempcpy.
	(__memcpy_z196): Adjust to be usable from mempcpy entry point.
	(__memcpy_z10): Likewise.
	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.
---
 sysdeps/s390/bits/string.h                    | 36 +++++++++++++++++++
 sysdeps/s390/mempcpy.S                        | 19 ++++++++++
 sysdeps/s390/multiarch/Makefile               |  3 +-
 sysdeps/s390/multiarch/ifunc-impl-list.c      |  7 ++++
 sysdeps/s390/multiarch/mempcpy.c              | 26 ++++++++++++++
 sysdeps/s390/s390-32/memcpy.S                 | 50 ++++++++++++++++-----------
 sysdeps/s390/s390-32/multiarch/memcpy-s390.S  | 31 +++++++++++++++--
 sysdeps/s390/s390-64/memcpy.S                 | 47 ++++++++++++++-----------
 sysdeps/s390/s390-64/multiarch/memcpy-s390x.S | 29 ++++++++++++++--
 9 files changed, 203 insertions(+), 45 deletions(-)
 create mode 100644 sysdeps/s390/mempcpy.S
 create mode 100644 sysdeps/s390/multiarch/mempcpy.c

diff --git a/sysdeps/s390/bits/string.h b/sysdeps/s390/bits/string.h
index 39e0b7f..eae3c8e 100644
--- a/sysdeps/s390/bits/string.h
+++ b/sysdeps/s390/bits/string.h
@@ -24,6 +24,42 @@
 /* Use the unaligned string inline ABI.  */
 #define _STRING_INLINE_unaligned 1
 
+#if defined __USE_GNU && defined __OPTIMIZE__ \
+    && defined __extern_always_inline && __GNUC_PREREQ (3,2)
+# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy
+
+/* Don't inline mempcpy into memcpy as s390 has an optimized mempcpy.  */
+#  define _HAVE_STRING_ARCH_mempcpy 1
+
+__extern_always_inline void *
+__mempcpy_inline_memcpy (void *__restrict __dest,
+			 const void *__restrict __src, size_t __n)
+{
+  return (char *) memcpy (__dest, __src, __n) + __n;
+}
+
+__extern_always_inline void *
+__mempcpy_inline_mempcpy (void *__restrict __dest,
+			 const void *__restrict __src, size_t __n)
+{
+  return __mempcpy (__dest, __src, __n);
+}
+
+/* If n is constant and small enough, __mempcpy/mempcpy is an inlined function,
+   which uses memcpy and returns dest + n. In this constant case, GCC emits
+   instructions like mvi or mvc and avoids the function call to memcpy.
+   If n is not constant, then __mempcpy is called.  */
+#  undef mempcpy
+#  undef __mempcpy
+#  define __mempcpy(dest, src, n)					\
+  (__extension__ (__builtin_constant_p (n) && n <= 65536		\
+		  ? __mempcpy_inline_memcpy (dest, src, n)		\
+		  : __mempcpy_inline_mempcpy (dest, src, n)))
+#  define mempcpy(dest, src, n) __mempcpy (dest, src, n)
+
+# endif
+#endif
+
 /* We only provide optimizations if the user selects them and if
    GNU CC is used.  */
 #if !defined __NO_STRING_INLINES && defined __USE_STRING_INLINES \
diff --git a/sysdeps/s390/mempcpy.S b/sysdeps/s390/mempcpy.S
new file mode 100644
index 0000000..c62074b
--- /dev/null
+++ b/sysdeps/s390/mempcpy.S
@@ -0,0 +1,19 @@
+/* CPU specific mempcpy without multiarch - 32/64 bit S/390 version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* mempcpy is implemented in memcpy.S.  */
diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
index 0805b07..89324ca 100644
--- a/sysdeps/s390/multiarch/Makefile
+++ b/sysdeps/s390/multiarch/Makefile
@@ -18,7 +18,8 @@ sysdep_routines += strlen strlen-vx strlen-c \
 		   memchr memchr-vx \
 		   rawmemchr rawmemchr-vx rawmemchr-c \
 		   memccpy memccpy-vx memccpy-c \
-		   memrchr memrchr-vx memrchr-c
+		   memrchr memrchr-vx memrchr-c \
+		   mempcpy
 endif
 
 ifeq ($(subdir),wcsmbs)
diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
index 62a4359..89898ec 100644
--- a/sysdeps/s390/multiarch/ifunc-impl-list.c
+++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
@@ -69,6 +69,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      S390_IS_Z10 (stfle_bits), __memcpy_z10)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_default))
 
+  IFUNC_IMPL (i, name, mempcpy,
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z196 (stfle_bits), ____mempcpy_z196)
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z10 (stfle_bits), ____mempcpy_z10)
+	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, ____mempcpy_default))
+
 #endif /* SHARED */
 
 #ifdef HAVE_S390_VX_ASM_SUPPORT
diff --git a/sysdeps/s390/multiarch/mempcpy.c b/sysdeps/s390/multiarch/mempcpy.c
new file mode 100644
index 0000000..34d8329
--- /dev/null
+++ b/sysdeps/s390/multiarch/mempcpy.c
@@ -0,0 +1,26 @@
+/* Multiple versions of mempcpy.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined SHARED && IS_IN (libc)
+# include <ifunc-resolve.h>
+s390_libc_ifunc (__mempcpy)
+
+__asm__ (".weak mempcpy\n\t"
+	 ".set mempcpy,__mempcpy\n\t");
+#endif
diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
index 2ac51ab..6be5104 100644
--- a/sysdeps/s390/s390-32/memcpy.S
+++ b/sysdeps/s390/s390-32/memcpy.S
@@ -25,12 +25,23 @@
      %r3 = address of source memory area
      %r4 = number of bytes to copy.  */
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
-ENTRY(memcpy)
+       .text
+ENTRY(__mempcpy)
+	.machine "g5"
+	lr      %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_G5_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
 #endif
+
+ENTRY(memcpy)
 	.machine "g5"
+	lr      %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_G5_start:
 	st      %r13,52(%r15)
 	.cfi_offset 13, -44
 	basr    %r13,0
@@ -41,14 +52,13 @@ ENTRY(memcpy)
 	lr      %r5,%r4
 	srl     %r5,8
 	ltr     %r5,%r5
-	lr      %r1,%r2
 	jne     .L_G5_13
 	ex      %r4,.L_G5_17-.L_G5_16(%r13)
 .L_G5_4:
 	l       %r13,52(%r15)
 	br      %r14
 .L_G5_13:
-	chi	%r5,4096             # Switch to mvcle for copies >1MB
+	chi	%r5,4096            # Switch to mvcle for copies >1MB
 	jh	__memcpy_mvcle
 .L_G5_12:
 	mvc     0(256,%r1),0(%r3)
@@ -59,24 +69,24 @@ ENTRY(memcpy)
 	j       .L_G5_4
 .L_G5_17:
 	mvc     0(1,%r1),0(%r3)
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       ahi     %r4,1               # length + 1
-       lr      %r5,%r4             # source length
-       lr      %r4,%r3             # source address
-       lr      %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lr      %r0,%r2             # backup return dest [ + n ]
+	ahi     %r4,1               # length + 1
+	lr      %r5,%r4             # source length
+	lr      %r4,%r3             # source address
+	lr      %r2,%r1             # destination address
+	lr      %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lr      %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lr      %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
index 92ffaea..297a894 100644
--- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
@@ -29,14 +29,23 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	llgfr   %r4,%r4
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -60,13 +69,22 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	llgfr   %r4,%r4
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -88,14 +106,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif
diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
index 9d60a14..e3ace0c 100644
--- a/sysdeps/s390/s390-64/memcpy.S
+++ b/sysdeps/s390/s390-64/memcpy.S
@@ -27,19 +27,27 @@
 
 
        .text
+ENTRY(__mempcpy)
+	.machine "z900"
+	lgr     %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_Z900_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
+#endif
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
 ENTRY(memcpy)
-#endif
 	.machine "z900"
+	lgr     %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_Z900_start:
 	ltgr    %r4,%r4
 	je      .L_Z900_4
 	aghi    %r4,-1
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
-	lgr     %r1,%r2
 	jne     .L_Z900_13
 .L_Z900_3:
 	larl    %r5,.L_Z900_15
@@ -57,25 +65,24 @@ ENTRY(memcpy)
 	j       .L_Z900_3
 .L_Z900_15:
 	mvc     0(1,%r1),0(%r3)
-
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       aghi    %r4,1               # length + 1
-       lgr     %r5,%r4             # source length
-       lgr     %r4,%r3             # source address
-       lgr     %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lgr     %r0,%r2             # backup return dest [ + n ]
+	aghi    %r4,1               # length + 1
+	lgr     %r5,%r4             # source length
+	lgr     %r4,%r3             # source address
+	lgr     %r2,%r1             # destination address
+	lgr     %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lgr     %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lgr     %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
index 8f54526..0f5a36e 100644
--- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
@@ -29,12 +29,20 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -58,11 +66,19 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -84,14 +100,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif
-- 
2.3.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-04-26 12:08 [PATCH 1/4] S390: Use mvcle for copies > 1MB on 32bit with default memcpy variant Stefan Liebler
  2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
@ 2016-04-26 12:08 ` Stefan Liebler
  2016-04-26 13:35   ` Adhemerval Zanella
  2016-04-26 12:08 ` [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle Stefan Liebler
  2 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-04-26 12:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: carlos, Wilco.Dijkstra, neleai, Stefan Liebler

On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
which are created with s390_libc_ifunc-macro.
This macro creates a __GI_ symbol which is set to the
ifunced symbol. Thus calls within libc.so to e.g. memcpy
result in a call to *ABS*+0x954c0@plt stub and afterwards
to the resolved memcpy-ifunc-variant.

This patch sets the __GI_ symbol to the default-ifunc-variant
to avoid the plt call. The __GI_ symbols are now created at the
default variant of ifunced function.

ChangeLog:

	* sysdeps/s390/multiarch/ifunc-resolve.h (s390_libc_ifunc):
	Remove __GI_ symbol.
	* sysdeps/s390/s390-32/multiarch/memcmp-s390.S: Add __GI_memcmp symbol.
	* sysdeps/s390/s390-64/multiarch/memcmp-s390x.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add __GI_memcpy symbol.
	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memset-s390.S: Add __GI_memset symbol.
	* sysdeps/s390/s390-64/multiarch/memset-s390x.S: Likewise.
---
 sysdeps/s390/multiarch/ifunc-resolve.h        | 4 +---
 sysdeps/s390/s390-32/multiarch/memcmp-s390.S  | 3 +++
 sysdeps/s390/s390-32/multiarch/memcpy-s390.S  | 5 ++++-
 sysdeps/s390/s390-32/multiarch/memset-s390.S  | 3 +++
 sysdeps/s390/s390-64/multiarch/memcmp-s390x.S | 3 +++
 sysdeps/s390/s390-64/multiarch/memcpy-s390x.S | 5 ++++-
 sysdeps/s390/s390-64/multiarch/memset-s390x.S | 3 +++
 7 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/sysdeps/s390/multiarch/ifunc-resolve.h b/sysdeps/s390/multiarch/ifunc-resolve.h
index 744a0d8..26e097a 100644
--- a/sysdeps/s390/multiarch/ifunc-resolve.h
+++ b/sysdeps/s390/multiarch/ifunc-resolve.h
@@ -44,9 +44,7 @@
 #define s390_libc_ifunc(FUNC)						\
   __asm__ (".globl " #FUNC "\n\t"					\
 	   ".type  " #FUNC ",@gnu_indirect_function\n\t"		\
-	   ".set   " #FUNC ",__resolve_" #FUNC "\n\t"			\
-	   ".globl __GI_" #FUNC "\n\t"					\
-	   ".set   __GI_" #FUNC "," #FUNC "\n");			\
+	   ".set   " #FUNC ",__resolve_" #FUNC "\n\t");			\
 									\
   /* Make the declarations of the optimized functions hidden in order
      to prevent GOT slots being generated for them. */			\
diff --git a/sysdeps/s390/s390-32/multiarch/memcmp-s390.S b/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
index e9ee6d2..a01f3b7 100644
--- a/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
@@ -101,4 +101,7 @@ END(__memcmp_z10)
 .set     memcmp,__memcmp_default
 .weak    bcmp
 .set	 bcmp,__memcmp_default
+#elif defined SHARED && IS_IN (libc)
+.globl   __GI_memcmp
+.set     __GI_memcmp,__memcmp_default
 #endif
diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
index 4e30cdf..92ffaea 100644
--- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
@@ -92,7 +92,10 @@ END(__memcpy_z10)
 
 #include "../memcpy.S"
 
-#if !defined SHARED || !IS_IN (libc)
+#if defined SHARED && IS_IN (libc)
+.globl   __GI_memcpy
+.set     __GI_memcpy,__memcpy_default
+#else
 .globl   memcpy
 .set     memcpy,__memcpy_default
 #endif
diff --git a/sysdeps/s390/s390-32/multiarch/memset-s390.S b/sysdeps/s390/s390-32/multiarch/memset-s390.S
index 47277c1..a2ddd98 100644
--- a/sysdeps/s390/s390-32/multiarch/memset-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memset-s390.S
@@ -110,4 +110,7 @@ END(__memset_mvcle)
 #if !IS_IN (libc)
 .globl   memset
 .set     memset,__memset_default
+#elif defined SHARED && IS_IN (libc)
+.globl   __GI_memset
+.set     __GI_memset,__memset_default
 #endif
diff --git a/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S b/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
index 2a4c0ae..b28ccaf 100644
--- a/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
@@ -98,4 +98,7 @@ END(__memcmp_z10)
 .set     memcmp,__memcmp_default
 .weak    bcmp
 .set	 bcmp,__memcmp_default
+#elif defined SHARED && IS_IN (libc)
+.globl   __GI_memcmp
+.set     __GI_memcmp,__memcmp_default
 #endif
diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
index 69fa562..8f54526 100644
--- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
@@ -88,7 +88,10 @@ END(__memcpy_z10)
 
 #include "../memcpy.S"
 
-#if !defined SHARED || !IS_IN (libc)
+#if defined SHARED && IS_IN (libc)
+.globl   __GI_memcpy
+.set     __GI_memcpy,__memcpy_default
+#else
 .globl   memcpy
 .set     memcpy,__memcpy_default
 #endif
diff --git a/sysdeps/s390/s390-64/multiarch/memset-s390x.S b/sysdeps/s390/s390-64/multiarch/memset-s390x.S
index 05e0682..a77e798 100644
--- a/sysdeps/s390/s390-64/multiarch/memset-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memset-s390x.S
@@ -106,4 +106,7 @@ END(__memset_mvcle)
 #if !IS_IN (libc)
 .globl   memset
 .set     memset,__memset_default
+#elif defined SHARED && IS_IN (libc)
+.globl   __GI_memset
+.set     __GI_memset,__memset_default
 #endif
-- 
2.3.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
@ 2016-04-26 13:33   ` Adhemerval Zanella
  2016-04-27  8:15     ` Stefan Liebler
  2016-05-04 13:20   ` Stefan Liebler
  1 sibling, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-04-26 13:33 UTC (permalink / raw)
  To: libc-alpha



On 26/04/2016 09:07, Stefan Liebler wrote:
> There exist optimized memcpy functions on s390, but no optimized mempcpy.
> This patch adds mempcpy entry points in memcpy.S files, which
> use the memcpy implementation. Now mempcpy itself is also an IFUNC function
> as memcpy is and the variants are listed in ifunc-impl-list.c.
> 
> The s390 string.h defines _HAVE_STRING_ARCH_mempcpy and __mempcpy/mempcpy
> definitions. If n is constant and small enough, __mempcpy/mempcpy is an inlined
> function, which uses memcpy and returns dest + n. In this constant case,
> GCC emits instructions like mvi or mvc and avoids the function call to memcpy.
> If n is not constant, then __mempcpy is called.
> 

AFAIk GLIBC does not have any ports that implements mempcpy and not
memcpy and IMHO the inline version to convert mempcpy to memcpy is 
always a gain.

So I would suggest to just avoid adding arch-specific bits and move the
mempcpy inline optimization at string/string.h to string/string2.h as

--

#define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
#define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)

__extern_always_inline void *
__mempcpy_inline (void *__restrict __dest,
                  const void *__restrict __src, size_t __n)
{
  return (char *) memcpy (__dest, __src, __n) + __n;
}

--

This is convert any mempcpy to mempcy plus length and provide
the symbol only for older binaries.

> ChangeLog:
> 
> 	[BZ #19765]
> 	* sysdeps/s390/bits/string.h (mempcpy) Redirect to
> 	__mempcpy_inline_memcpy or __mempcpy_inline_mempcpy.
> 	(__mempcpy) Likewise.
> 	(__mempcpy_inline_mempcpy): New inline function.
> 	(__mempcpy_inline_memcpy): Likewise.
> 	(_HAVE_STRING_ARCH_mempcpy): New define.
> 	* sysdeps/s390/mempcpy.S: New File.
> 	* sysdeps/s390/multiarch/mempcpy.c: Likewise.
> 	* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add mempcpy.
> 	* sysdeps/s390/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> 	Add mempcpy variants.
> 	* sysdeps/s390/s390-32/memcpy.S: Add mempcpy entry point.
> 	(memcpy): Adjust to be usable from mempcpy entry point.
> 	(__memcpy_mvcle): Likewise.
> 	* sysdeps/s390/s390-64/memcpy.S: Likewise.
> 	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add entry points
> 	____mempcpy_z196, ____mempcpy_z10 and add __GI_ symbols for mempcpy.
> 	(__memcpy_z196): Adjust to be usable from mempcpy entry point.
> 	(__memcpy_z10): Likewise.
> 	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.
> ---
>  sysdeps/s390/bits/string.h                    | 36 +++++++++++++++++++
>  sysdeps/s390/mempcpy.S                        | 19 ++++++++++
>  sysdeps/s390/multiarch/Makefile               |  3 +-
>  sysdeps/s390/multiarch/ifunc-impl-list.c      |  7 ++++
>  sysdeps/s390/multiarch/mempcpy.c              | 26 ++++++++++++++
>  sysdeps/s390/s390-32/memcpy.S                 | 50 ++++++++++++++++-----------
>  sysdeps/s390/s390-32/multiarch/memcpy-s390.S  | 31 +++++++++++++++--
>  sysdeps/s390/s390-64/memcpy.S                 | 47 ++++++++++++++-----------
>  sysdeps/s390/s390-64/multiarch/memcpy-s390x.S | 29 ++++++++++++++--
>  9 files changed, 203 insertions(+), 45 deletions(-)
>  create mode 100644 sysdeps/s390/mempcpy.S
>  create mode 100644 sysdeps/s390/multiarch/mempcpy.c
> 
> diff --git a/sysdeps/s390/bits/string.h b/sysdeps/s390/bits/string.h
> index 39e0b7f..eae3c8e 100644
> --- a/sysdeps/s390/bits/string.h
> +++ b/sysdeps/s390/bits/string.h
> @@ -24,6 +24,42 @@
>  /* Use the unaligned string inline ABI.  */
>  #define _STRING_INLINE_unaligned 1
>  
> +#if defined __USE_GNU && defined __OPTIMIZE__ \
> +    && defined __extern_always_inline && __GNUC_PREREQ (3,2)
> +# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy
> +
> +/* Don't inline mempcpy into memcpy as s390 has an optimized mempcpy.  */
> +#  define _HAVE_STRING_ARCH_mempcpy 1
> +
> +__extern_always_inline void *
> +__mempcpy_inline_memcpy (void *__restrict __dest,
> +			 const void *__restrict __src, size_t __n)
> +{
> +  return (char *) memcpy (__dest, __src, __n) + __n;
> +}
> +
> +__extern_always_inline void *
> +__mempcpy_inline_mempcpy (void *__restrict __dest,
> +			 const void *__restrict __src, size_t __n)
> +{
> +  return __mempcpy (__dest, __src, __n);
> +}
> +
> +/* If n is constant and small enough, __mempcpy/mempcpy is an inlined function,
> +   which uses memcpy and returns dest + n. In this constant case, GCC emits
> +   instructions like mvi or mvc and avoids the function call to memcpy.
> +   If n is not constant, then __mempcpy is called.  */
> +#  undef mempcpy
> +#  undef __mempcpy
> +#  define __mempcpy(dest, src, n)					\
> +  (__extension__ (__builtin_constant_p (n) && n <= 65536		\
> +		  ? __mempcpy_inline_memcpy (dest, src, n)		\
> +		  : __mempcpy_inline_mempcpy (dest, src, n)))
> +#  define mempcpy(dest, src, n) __mempcpy (dest, src, n)
> +
> +# endif
> +#endif
> +
>  /* We only provide optimizations if the user selects them and if
>     GNU CC is used.  */
>  #if !defined __NO_STRING_INLINES && defined __USE_STRING_INLINES \
> diff --git a/sysdeps/s390/mempcpy.S b/sysdeps/s390/mempcpy.S
> new file mode 100644
> index 0000000..c62074b
> --- /dev/null
> +++ b/sysdeps/s390/mempcpy.S
> @@ -0,0 +1,19 @@
> +/* CPU specific mempcpy without multiarch - 32/64 bit S/390 version.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* mempcpy is implemented in memcpy.S.  */
> diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
> index 0805b07..89324ca 100644
> --- a/sysdeps/s390/multiarch/Makefile
> +++ b/sysdeps/s390/multiarch/Makefile
> @@ -18,7 +18,8 @@ sysdep_routines += strlen strlen-vx strlen-c \
>  		   memchr memchr-vx \
>  		   rawmemchr rawmemchr-vx rawmemchr-c \
>  		   memccpy memccpy-vx memccpy-c \
> -		   memrchr memrchr-vx memrchr-c
> +		   memrchr memrchr-vx memrchr-c \
> +		   mempcpy
>  endif
>  
>  ifeq ($(subdir),wcsmbs)
> diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
> index 62a4359..89898ec 100644
> --- a/sysdeps/s390/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
> @@ -69,6 +69,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  			      S390_IS_Z10 (stfle_bits), __memcpy_z10)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_default))
>  
> +  IFUNC_IMPL (i, name, mempcpy,
> +	      IFUNC_IMPL_ADD (array, i, mempcpy,
> +			      S390_IS_Z196 (stfle_bits), ____mempcpy_z196)
> +	      IFUNC_IMPL_ADD (array, i, mempcpy,
> +			      S390_IS_Z10 (stfle_bits), ____mempcpy_z10)
> +	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, ____mempcpy_default))
> +
>  #endif /* SHARED */
>  
>  #ifdef HAVE_S390_VX_ASM_SUPPORT
> diff --git a/sysdeps/s390/multiarch/mempcpy.c b/sysdeps/s390/multiarch/mempcpy.c
> new file mode 100644
> index 0000000..34d8329
> --- /dev/null
> +++ b/sysdeps/s390/multiarch/mempcpy.c
> @@ -0,0 +1,26 @@
> +/* Multiple versions of mempcpy.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +
> +#if defined SHARED && IS_IN (libc)
> +# include <ifunc-resolve.h>
> +s390_libc_ifunc (__mempcpy)
> +
> +__asm__ (".weak mempcpy\n\t"
> +	 ".set mempcpy,__mempcpy\n\t");
> +#endif
> diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
> index 2ac51ab..6be5104 100644
> --- a/sysdeps/s390/s390-32/memcpy.S
> +++ b/sysdeps/s390/s390-32/memcpy.S
> @@ -25,12 +25,23 @@
>       %r3 = address of source memory area
>       %r4 = number of bytes to copy.  */
>  
> -#ifdef USE_MULTIARCH
> -ENTRY(__memcpy_default)
> -#else
> -ENTRY(memcpy)
> +       .text
> +ENTRY(__mempcpy)
> +	.machine "g5"
> +	lr      %r1,%r2             # Use as dest
> +	la      %r2,0(%r4,%r2)      # Return dest + n
> +	j	.L_G5_start
> +END(__mempcpy)
> +#ifndef USE_MULTIARCH
> +libc_hidden_def (__mempcpy)
> +weak_alias (__mempcpy, mempcpy)
> +libc_hidden_builtin_def (mempcpy)
>  #endif
> +
> +ENTRY(memcpy)
>  	.machine "g5"
> +	lr      %r1,%r2             # r1: Use as dest ; r2: Return dest
> +.L_G5_start:
>  	st      %r13,52(%r15)
>  	.cfi_offset 13, -44
>  	basr    %r13,0
> @@ -41,14 +52,13 @@ ENTRY(memcpy)
>  	lr      %r5,%r4
>  	srl     %r5,8
>  	ltr     %r5,%r5
> -	lr      %r1,%r2
>  	jne     .L_G5_13
>  	ex      %r4,.L_G5_17-.L_G5_16(%r13)
>  .L_G5_4:
>  	l       %r13,52(%r15)
>  	br      %r14
>  .L_G5_13:
> -	chi	%r5,4096             # Switch to mvcle for copies >1MB
> +	chi	%r5,4096            # Switch to mvcle for copies >1MB
>  	jh	__memcpy_mvcle
>  .L_G5_12:
>  	mvc     0(256,%r1),0(%r3)
> @@ -59,24 +69,24 @@ ENTRY(memcpy)
>  	j       .L_G5_4
>  .L_G5_17:
>  	mvc     0(1,%r1),0(%r3)
> -#ifdef USE_MULTIARCH
> -END(__memcpy_default)
> -#else
>  END(memcpy)
> +#ifndef USE_MULTIARCH
>  libc_hidden_builtin_def (memcpy)
>  #endif
>  
>  ENTRY(__memcpy_mvcle)
> -       # Using as standalone function will result in unexpected
> -       # results since the length field is incremented by 1 in order to
> -       # compensate the changes already done in the functions above.
> -       ahi     %r4,1               # length + 1
> -       lr      %r5,%r4             # source length
> -       lr      %r4,%r3             # source address
> -       lr      %r3,%r5             # destination length = source length
> +	# Using as standalone function will result in unexpected
> +	# results since the length field is incremented by 1 in order to
> +	# compensate the changes already done in the functions above.
> +	lr      %r0,%r2             # backup return dest [ + n ]
> +	ahi     %r4,1               # length + 1
> +	lr      %r5,%r4             # source length
> +	lr      %r4,%r3             # source address
> +	lr      %r2,%r1             # destination address
> +	lr      %r3,%r5             # destination length = source length
>  .L_MVCLE_1:
> -       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> -       jo      .L_MVCLE_1
> -       lr      %r2,%r1             # return destination address
> -       br      %r14
> +	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> +	jo      .L_MVCLE_1
> +	lr      %r2,%r0             # return destination address
> +	br      %r14
>  END(__memcpy_mvcle)
> diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> index 92ffaea..297a894 100644
> --- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> +++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> @@ -29,14 +29,23 @@
>  
>  #if defined SHARED && IS_IN (libc)
>  
> +ENTRY(____mempcpy_z196)
> +	.machine "z196"
> +	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z196_start
> +END(____mempcpy_z196)
> +
>  ENTRY(__memcpy_z196)
>  	.machine "z196"
>  	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z196_start:
>  	llgfr   %r4,%r4
>  	ltgr    %r4,%r4
>  	je      .L_Z196_4
>  	aghi    %r4,-1
> -	lr      %r1,%r2
>  	srlg    %r5,%r4,8
>  	ltgr    %r5,%r5
>  	jne     .L_Z196_5
> @@ -60,13 +69,22 @@ ENTRY(__memcpy_z196)
>  	mvc     0(1,%r1),0(%r3)
>  END(__memcpy_z196)
>  
> +ENTRY(____mempcpy_z10)
> +	.machine "z10"
> +	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z10_start
> +END(____mempcpy_z10)
> +
>  ENTRY(__memcpy_z10)
>  	.machine "z10"
>  	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z10_start:
>  	llgfr   %r4,%r4
>  	cgije   %r4,0,.L_Z10_4
>  	aghi    %r4,-1
> -	lr      %r1,%r2
>  	srlg    %r5,%r4,8
>  	cgijlh  %r5,0,.L_Z10_13
>  .L_Z10_3:
> @@ -88,14 +106,23 @@ ENTRY(__memcpy_z10)
>  	mvc     0(1,%r1),0(%r3)
>  END(__memcpy_z10)
>  
> +# define __mempcpy ____mempcpy_default
>  #endif /* SHARED && IS_IN (libc) */
>  
> +#define memcpy __memcpy_default
>  #include "../memcpy.S"
> +#undef memcpy
>  
>  #if defined SHARED && IS_IN (libc)
>  .globl   __GI_memcpy
>  .set     __GI_memcpy,__memcpy_default
> +.globl   __GI_mempcpy
> +.set     __GI_mempcpy,____mempcpy_default
> +.globl   __GI___mempcpy
> +.set     __GI___mempcpy,____mempcpy_default
>  #else
>  .globl   memcpy
>  .set     memcpy,__memcpy_default
> +.weak    mempcpy
> +.set     mempcpy,__mempcpy
>  #endif
> diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
> index 9d60a14..e3ace0c 100644
> --- a/sysdeps/s390/s390-64/memcpy.S
> +++ b/sysdeps/s390/s390-64/memcpy.S
> @@ -27,19 +27,27 @@
>  
>  
>         .text
> +ENTRY(__mempcpy)
> +	.machine "z900"
> +	lgr     %r1,%r2             # Use as dest
> +	la      %r2,0(%r4,%r2)      # Return dest + n
> +	j	.L_Z900_start
> +END(__mempcpy)
> +#ifndef USE_MULTIARCH
> +libc_hidden_def (__mempcpy)
> +weak_alias (__mempcpy, mempcpy)
> +libc_hidden_builtin_def (mempcpy)
> +#endif
>  
> -#ifdef USE_MULTIARCH
> -ENTRY(__memcpy_default)
> -#else
>  ENTRY(memcpy)
> -#endif
>  	.machine "z900"
> +	lgr     %r1,%r2             # r1: Use as dest ; r2: Return dest
> +.L_Z900_start:
>  	ltgr    %r4,%r4
>  	je      .L_Z900_4
>  	aghi    %r4,-1
>  	srlg    %r5,%r4,8
>  	ltgr    %r5,%r5
> -	lgr     %r1,%r2
>  	jne     .L_Z900_13
>  .L_Z900_3:
>  	larl    %r5,.L_Z900_15
> @@ -57,25 +65,24 @@ ENTRY(memcpy)
>  	j       .L_Z900_3
>  .L_Z900_15:
>  	mvc     0(1,%r1),0(%r3)
> -
> -#ifdef USE_MULTIARCH
> -END(__memcpy_default)
> -#else
>  END(memcpy)
> +#ifndef USE_MULTIARCH
>  libc_hidden_builtin_def (memcpy)
>  #endif
>  
>  ENTRY(__memcpy_mvcle)
> -       # Using as standalone function will result in unexpected
> -       # results since the length field is incremented by 1 in order to
> -       # compensate the changes already done in the functions above.
> -       aghi    %r4,1               # length + 1
> -       lgr     %r5,%r4             # source length
> -       lgr     %r4,%r3             # source address
> -       lgr     %r3,%r5             # destination length = source length
> +	# Using as standalone function will result in unexpected
> +	# results since the length field is incremented by 1 in order to
> +	# compensate the changes already done in the functions above.
> +	lgr     %r0,%r2             # backup return dest [ + n ]
> +	aghi    %r4,1               # length + 1
> +	lgr     %r5,%r4             # source length
> +	lgr     %r4,%r3             # source address
> +	lgr     %r2,%r1             # destination address
> +	lgr     %r3,%r5             # destination length = source length
>  .L_MVCLE_1:
> -       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> -       jo      .L_MVCLE_1
> -       lgr     %r2,%r1             # return destination address
> -       br      %r14
> +	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> +	jo      .L_MVCLE_1
> +	lgr     %r2,%r0             # return destination address
> +	br      %r14
>  END(__memcpy_mvcle)
> diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> index 8f54526..0f5a36e 100644
> --- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> +++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> @@ -29,12 +29,20 @@
>  
>  #if defined SHARED && IS_IN (libc)
>  
> +ENTRY(____mempcpy_z196)
> +	.machine "z196"
> +	lgr     %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z196_start
> +END(____mempcpy_z196)
> +
>  ENTRY(__memcpy_z196)
>  	.machine "z196"
> +	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z196_start:
>  	ltgr    %r4,%r4
>  	je      .L_Z196_4
>  	aghi    %r4,-1
> -	lgr     %r1,%r2
>  	srlg    %r5,%r4,8
>  	ltgr    %r5,%r5
>  	jne     .L_Z196_5
> @@ -58,11 +66,19 @@ ENTRY(__memcpy_z196)
>  	mvc     0(1,%r1),0(%r3)
>  END(__memcpy_z196)
>  
> +ENTRY(____mempcpy_z10)
> +	.machine "z10"
> +	lgr     %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z10_start
> +END(____mempcpy_z10)
> +
>  ENTRY(__memcpy_z10)
>  	.machine "z10"
> +	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z10_start:
>  	cgije   %r4,0,.L_Z10_4
>  	aghi    %r4,-1
> -	lgr     %r1,%r2
>  	srlg    %r5,%r4,8
>  	cgijlh  %r5,0,.L_Z10_13
>  .L_Z10_3:
> @@ -84,14 +100,23 @@ ENTRY(__memcpy_z10)
>  	mvc     0(1,%r1),0(%r3)
>  END(__memcpy_z10)
>  
> +# define __mempcpy ____mempcpy_default
>  #endif /* SHARED && IS_IN (libc) */
>  
> +#define memcpy __memcpy_default
>  #include "../memcpy.S"
> +#undef memcpy
>  
>  #if defined SHARED && IS_IN (libc)
>  .globl   __GI_memcpy
>  .set     __GI_memcpy,__memcpy_default
> +.globl   __GI_mempcpy
> +.set     __GI_mempcpy,____mempcpy_default
> +.globl   __GI___mempcpy
> +.set     __GI___mempcpy,____mempcpy_default
>  #else
>  .globl   memcpy
>  .set     memcpy,__memcpy_default
> +.weak    mempcpy
> +.set     mempcpy,__mempcpy
>  #endif
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-04-26 12:08 ` [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt Stefan Liebler
@ 2016-04-26 13:35   ` Adhemerval Zanella
  2016-04-27  8:20     ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-04-26 13:35 UTC (permalink / raw)
  To: libc-alpha



On 26/04/2016 09:07, Stefan Liebler wrote:
> On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
> which are created with s390_libc_ifunc-macro.
> This macro creates a __GI_ symbol which is set to the
> ifunced symbol. Thus calls within libc.so to e.g. memcpy
> result in a call to *ABS*+0x954c0@plt stub and afterwards
> to the resolved memcpy-ifunc-variant.
> 
> This patch sets the __GI_ symbol to the default-ifunc-variant
> to avoid the plt call. The __GI_ symbols are now created at the
> default variant of ifunced function.

Is the internal ifunc plt usage leading to a failure in s390/s390x
(as for powerpc32 and i686) or is it an optimization fix?

> 
> ChangeLog:
> 
> 	* sysdeps/s390/multiarch/ifunc-resolve.h (s390_libc_ifunc):
> 	Remove __GI_ symbol.
> 	* sysdeps/s390/s390-32/multiarch/memcmp-s390.S: Add __GI_memcmp symbol.
> 	* sysdeps/s390/s390-64/multiarch/memcmp-s390x.S: Likewise.
> 	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add __GI_memcpy symbol.
> 	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.
> 	* sysdeps/s390/s390-32/multiarch/memset-s390.S: Add __GI_memset symbol.
> 	* sysdeps/s390/s390-64/multiarch/memset-s390x.S: Likewise.
> ---
>  sysdeps/s390/multiarch/ifunc-resolve.h        | 4 +---
>  sysdeps/s390/s390-32/multiarch/memcmp-s390.S  | 3 +++
>  sysdeps/s390/s390-32/multiarch/memcpy-s390.S  | 5 ++++-
>  sysdeps/s390/s390-32/multiarch/memset-s390.S  | 3 +++
>  sysdeps/s390/s390-64/multiarch/memcmp-s390x.S | 3 +++
>  sysdeps/s390/s390-64/multiarch/memcpy-s390x.S | 5 ++++-
>  sysdeps/s390/s390-64/multiarch/memset-s390x.S | 3 +++
>  7 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/sysdeps/s390/multiarch/ifunc-resolve.h b/sysdeps/s390/multiarch/ifunc-resolve.h
> index 744a0d8..26e097a 100644
> --- a/sysdeps/s390/multiarch/ifunc-resolve.h
> +++ b/sysdeps/s390/multiarch/ifunc-resolve.h
> @@ -44,9 +44,7 @@
>  #define s390_libc_ifunc(FUNC)						\
>    __asm__ (".globl " #FUNC "\n\t"					\
>  	   ".type  " #FUNC ",@gnu_indirect_function\n\t"		\
> -	   ".set   " #FUNC ",__resolve_" #FUNC "\n\t"			\
> -	   ".globl __GI_" #FUNC "\n\t"					\
> -	   ".set   __GI_" #FUNC "," #FUNC "\n");			\
> +	   ".set   " #FUNC ",__resolve_" #FUNC "\n\t");			\
>  									\
>    /* Make the declarations of the optimized functions hidden in order
>       to prevent GOT slots being generated for them. */			\
> diff --git a/sysdeps/s390/s390-32/multiarch/memcmp-s390.S b/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
> index e9ee6d2..a01f3b7 100644
> --- a/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
> +++ b/sysdeps/s390/s390-32/multiarch/memcmp-s390.S
> @@ -101,4 +101,7 @@ END(__memcmp_z10)
>  .set     memcmp,__memcmp_default
>  .weak    bcmp
>  .set	 bcmp,__memcmp_default
> +#elif defined SHARED && IS_IN (libc)
> +.globl   __GI_memcmp
> +.set     __GI_memcmp,__memcmp_default
>  #endif
> diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> index 4e30cdf..92ffaea 100644
> --- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> +++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> @@ -92,7 +92,10 @@ END(__memcpy_z10)
>  
>  #include "../memcpy.S"
>  
> -#if !defined SHARED || !IS_IN (libc)
> +#if defined SHARED && IS_IN (libc)
> +.globl   __GI_memcpy
> +.set     __GI_memcpy,__memcpy_default
> +#else
>  .globl   memcpy
>  .set     memcpy,__memcpy_default
>  #endif
> diff --git a/sysdeps/s390/s390-32/multiarch/memset-s390.S b/sysdeps/s390/s390-32/multiarch/memset-s390.S
> index 47277c1..a2ddd98 100644
> --- a/sysdeps/s390/s390-32/multiarch/memset-s390.S
> +++ b/sysdeps/s390/s390-32/multiarch/memset-s390.S
> @@ -110,4 +110,7 @@ END(__memset_mvcle)
>  #if !IS_IN (libc)
>  .globl   memset
>  .set     memset,__memset_default
> +#elif defined SHARED && IS_IN (libc)
> +.globl   __GI_memset
> +.set     __GI_memset,__memset_default
>  #endif
> diff --git a/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S b/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
> index 2a4c0ae..b28ccaf 100644
> --- a/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
> +++ b/sysdeps/s390/s390-64/multiarch/memcmp-s390x.S
> @@ -98,4 +98,7 @@ END(__memcmp_z10)
>  .set     memcmp,__memcmp_default
>  .weak    bcmp
>  .set	 bcmp,__memcmp_default
> +#elif defined SHARED && IS_IN (libc)
> +.globl   __GI_memcmp
> +.set     __GI_memcmp,__memcmp_default
>  #endif
> diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> index 69fa562..8f54526 100644
> --- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> +++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> @@ -88,7 +88,10 @@ END(__memcpy_z10)
>  
>  #include "../memcpy.S"
>  
> -#if !defined SHARED || !IS_IN (libc)
> +#if defined SHARED && IS_IN (libc)
> +.globl   __GI_memcpy
> +.set     __GI_memcpy,__memcpy_default
> +#else
>  .globl   memcpy
>  .set     memcpy,__memcpy_default
>  #endif
> diff --git a/sysdeps/s390/s390-64/multiarch/memset-s390x.S b/sysdeps/s390/s390-64/multiarch/memset-s390x.S
> index 05e0682..a77e798 100644
> --- a/sysdeps/s390/s390-64/multiarch/memset-s390x.S
> +++ b/sysdeps/s390/s390-64/multiarch/memset-s390x.S
> @@ -106,4 +106,7 @@ END(__memset_mvcle)
>  #if !IS_IN (libc)
>  .globl   memset
>  .set     memset,__memset_default
> +#elif defined SHARED && IS_IN (libc)
> +.globl   __GI_memset
> +.set     __GI_memset,__memset_default
>  #endif
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle.
  2016-04-26 12:08 ` [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle Stefan Liebler
@ 2016-04-26 14:16   ` Florian Weimer
  2016-04-26 14:20     ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Florian Weimer @ 2016-04-26 14:16 UTC (permalink / raw)
  To: Stefan Liebler, libc-alpha; +Cc: carlos, Wilco.Dijkstra, neleai

On 04/26/2016 02:07 PM, Stefan Liebler wrote:
> The __memcpy_default variant on s390 64bit calculates the number of
> 256byte blocks in a 64bit register and checks, if they exceed 1MB
> to jump to mvcle. Otherwise a mvc-loop is used. The compare-instruction
> only checks a 32bit value.
> This patch uses a 64bit compare.

This is purely an optimization, right?  Did the previous implementation 
still perform a complete copy (although perhaps in a less efficient way)?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle.
  2016-04-26 14:16   ` Florian Weimer
@ 2016-04-26 14:20     ` Stefan Liebler
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Liebler @ 2016-04-26 14:20 UTC (permalink / raw)
  To: libc-alpha

On 04/26/2016 04:16 PM, Florian Weimer wrote:
> On 04/26/2016 02:07 PM, Stefan Liebler wrote:
>> The __memcpy_default variant on s390 64bit calculates the number of
>> 256byte blocks in a 64bit register and checks, if they exceed 1MB
>> to jump to mvcle. Otherwise a mvc-loop is used. The compare-instruction
>> only checks a 32bit value.
>> This patch uses a 64bit compare.
>
> This is purely an optimization, right?  Did the previous implementation
> still perform a complete copy (although perhaps in a less efficient way)?
>
> Thanks,
> Florian
>
>
Yes it performs a complete copy with mcv-loop instead of mvcle.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-04-26 13:33   ` Adhemerval Zanella
@ 2016-04-27  8:15     ` Stefan Liebler
  2016-05-04 15:42       ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-04-27  8:15 UTC (permalink / raw)
  To: libc-alpha

On 04/26/2016 03:32 PM, Adhemerval Zanella wrote:
>
>
> On 26/04/2016 09:07, Stefan Liebler wrote:
>> There exist optimized memcpy functions on s390, but no optimized mempcpy.
>> This patch adds mempcpy entry points in memcpy.S files, which
>> use the memcpy implementation. Now mempcpy itself is also an IFUNC function
>> as memcpy is and the variants are listed in ifunc-impl-list.c.
>>
>> The s390 string.h defines _HAVE_STRING_ARCH_mempcpy and __mempcpy/mempcpy
>> definitions. If n is constant and small enough, __mempcpy/mempcpy is an inlined
>> function, which uses memcpy and returns dest + n. In this constant case,
>> GCC emits instructions like mvi or mvc and avoids the function call to memcpy.
>> If n is not constant, then __mempcpy is called.
>>
>
> AFAIk GLIBC does not have any ports that implements mempcpy and not
> memcpy and IMHO the inline version to convert mempcpy to memcpy is
> always a gain.
>
> So I would suggest to just avoid adding arch-specific bits and move the
> mempcpy inline optimization at string/string.h to string/string2.h as
>
> --
>
> #define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
> #define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
>
> __extern_always_inline void *
> __mempcpy_inline (void *__restrict __dest,
>                    const void *__restrict __src, size_t __n)
> {
>    return (char *) memcpy (__dest, __src, __n) + __n;
> }
>
> --
>
> This is convert any mempcpy to mempcy plus length and provide
> the symbol only for older binaries.
>

According to the current behaviour on s390, a call to memcpy with 
constant byte-length below 65536 is done by emitting mvc, mvi or
a suitable instruction sequence.
If gcc don't know about a constant byte-length, then memcpy function
in glibc is called.

For mempcpy cascades like in iconv/gconv_conf.c:
466:  __mempcpy (__mempcpy (__mempcpy (gconv_path, __gconv_path_envvar,
					   user_len),
				":", 1),
		     default_gconv_path, sizeof (default_gconv_path));

The current behaviour is to perform a call to memcpy(), add user_len to
the return value of the call and use mvi, mvc for the two calls with 
constant byte-lengths.

If only _HAVE_STRING_ARCH_mempcpy is defined, there would be three calls
to mempcpy. Here you are right, this is not what you want.

The behaviour with this patch is to call mempcpy() once and use mvi, mvc
directly without the need to add user_len after the mempcpy call.
This way, there is no need to store user_len in an extra register or on 
stack.

I don't know the conditions on other archs, whether memcpy is inlined or 
called, thus I added the s390-specific condition to s390-string.h.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-04-26 13:35   ` Adhemerval Zanella
@ 2016-04-27  8:20     ` Stefan Liebler
  2016-05-04 15:22       ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-04-27  8:20 UTC (permalink / raw)
  To: libc-alpha

On 04/26/2016 03:35 PM, Adhemerval Zanella wrote:
>
>
> On 26/04/2016 09:07, Stefan Liebler wrote:
>> On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
>> which are created with s390_libc_ifunc-macro.
>> This macro creates a __GI_ symbol which is set to the
>> ifunced symbol. Thus calls within libc.so to e.g. memcpy
>> result in a call to *ABS*+0x954c0@plt stub and afterwards
>> to the resolved memcpy-ifunc-variant.
>>
>> This patch sets the __GI_ symbol to the default-ifunc-variant
>> to avoid the plt call. The __GI_ symbols are now created at the
>> default variant of ifunced function.
>
> Is the internal ifunc plt usage leading to a failure in s390/s390x
> (as for powerpc32 and i686) or is it an optimization fix?

No it does not lead to a failure on s390.
It is an optimization to avoid the extra call to the plt-stub if called 
within libc.so.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
  2016-04-26 13:33   ` Adhemerval Zanella
@ 2016-05-04 13:20   ` Stefan Liebler
  1 sibling, 0 replies; 36+ messages in thread
From: Stefan Liebler @ 2016-05-04 13:20 UTC (permalink / raw)
  To: libc-alpha

ping.

Any objection to the mempcpy macro in sysdeps/s390/bits/string.h?

On 04/26/2016 02:07 PM, Stefan Liebler wrote:
> There exist optimized memcpy functions on s390, but no optimized mempcpy.
> This patch adds mempcpy entry points in memcpy.S files, which
> use the memcpy implementation. Now mempcpy itself is also an IFUNC function
> as memcpy is and the variants are listed in ifunc-impl-list.c.
>
> The s390 string.h defines _HAVE_STRING_ARCH_mempcpy and __mempcpy/mempcpy
> definitions. If n is constant and small enough, __mempcpy/mempcpy is an inlined
> function, which uses memcpy and returns dest + n. In this constant case,
> GCC emits instructions like mvi or mvc and avoids the function call to memcpy.
> If n is not constant, then __mempcpy is called.
>
> ChangeLog:
>
> 	[BZ #19765]
> 	* sysdeps/s390/bits/string.h (mempcpy) Redirect to
> 	__mempcpy_inline_memcpy or __mempcpy_inline_mempcpy.
> 	(__mempcpy) Likewise.
> 	(__mempcpy_inline_mempcpy): New inline function.
> 	(__mempcpy_inline_memcpy): Likewise.
> 	(_HAVE_STRING_ARCH_mempcpy): New define.
> 	* sysdeps/s390/mempcpy.S: New File.
> 	* sysdeps/s390/multiarch/mempcpy.c: Likewise.
> 	* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add mempcpy.
> 	* sysdeps/s390/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> 	Add mempcpy variants.
> 	* sysdeps/s390/s390-32/memcpy.S: Add mempcpy entry point.
> 	(memcpy): Adjust to be usable from mempcpy entry point.
> 	(__memcpy_mvcle): Likewise.
> 	* sysdeps/s390/s390-64/memcpy.S: Likewise.
> 	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add entry points
> 	____mempcpy_z196, ____mempcpy_z10 and add __GI_ symbols for mempcpy.
> 	(__memcpy_z196): Adjust to be usable from mempcpy entry point.
> 	(__memcpy_z10): Likewise.
> 	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.
> ---
>   sysdeps/s390/bits/string.h                    | 36 +++++++++++++++++++
>   sysdeps/s390/mempcpy.S                        | 19 ++++++++++
>   sysdeps/s390/multiarch/Makefile               |  3 +-
>   sysdeps/s390/multiarch/ifunc-impl-list.c      |  7 ++++
>   sysdeps/s390/multiarch/mempcpy.c              | 26 ++++++++++++++
>   sysdeps/s390/s390-32/memcpy.S                 | 50 ++++++++++++++++-----------
>   sysdeps/s390/s390-32/multiarch/memcpy-s390.S  | 31 +++++++++++++++--
>   sysdeps/s390/s390-64/memcpy.S                 | 47 ++++++++++++++-----------
>   sysdeps/s390/s390-64/multiarch/memcpy-s390x.S | 29 ++++++++++++++--
>   9 files changed, 203 insertions(+), 45 deletions(-)
>   create mode 100644 sysdeps/s390/mempcpy.S
>   create mode 100644 sysdeps/s390/multiarch/mempcpy.c
>
> diff --git a/sysdeps/s390/bits/string.h b/sysdeps/s390/bits/string.h
> index 39e0b7f..eae3c8e 100644
> --- a/sysdeps/s390/bits/string.h
> +++ b/sysdeps/s390/bits/string.h
> @@ -24,6 +24,42 @@
>   /* Use the unaligned string inline ABI.  */
>   #define _STRING_INLINE_unaligned 1
>
> +#if defined __USE_GNU && defined __OPTIMIZE__ \
> +    && defined __extern_always_inline && __GNUC_PREREQ (3,2)
> +# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy
> +
> +/* Don't inline mempcpy into memcpy as s390 has an optimized mempcpy.  */
> +#  define _HAVE_STRING_ARCH_mempcpy 1
> +
> +__extern_always_inline void *
> +__mempcpy_inline_memcpy (void *__restrict __dest,
> +			 const void *__restrict __src, size_t __n)
> +{
> +  return (char *) memcpy (__dest, __src, __n) + __n;
> +}
> +
> +__extern_always_inline void *
> +__mempcpy_inline_mempcpy (void *__restrict __dest,
> +			 const void *__restrict __src, size_t __n)
> +{
> +  return __mempcpy (__dest, __src, __n);
> +}
> +
> +/* If n is constant and small enough, __mempcpy/mempcpy is an inlined function,
> +   which uses memcpy and returns dest + n. In this constant case, GCC emits
> +   instructions like mvi or mvc and avoids the function call to memcpy.
> +   If n is not constant, then __mempcpy is called.  */
> +#  undef mempcpy
> +#  undef __mempcpy
> +#  define __mempcpy(dest, src, n)					\
> +  (__extension__ (__builtin_constant_p (n) && n <= 65536		\
> +		  ? __mempcpy_inline_memcpy (dest, src, n)		\
> +		  : __mempcpy_inline_mempcpy (dest, src, n)))
> +#  define mempcpy(dest, src, n) __mempcpy (dest, src, n)
> +
> +# endif
> +#endif
> +
>   /* We only provide optimizations if the user selects them and if
>      GNU CC is used.  */
>   #if !defined __NO_STRING_INLINES && defined __USE_STRING_INLINES \
> diff --git a/sysdeps/s390/mempcpy.S b/sysdeps/s390/mempcpy.S
> new file mode 100644
> index 0000000..c62074b
> --- /dev/null
> +++ b/sysdeps/s390/mempcpy.S
> @@ -0,0 +1,19 @@
> +/* CPU specific mempcpy without multiarch - 32/64 bit S/390 version.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* mempcpy is implemented in memcpy.S.  */
> diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
> index 0805b07..89324ca 100644
> --- a/sysdeps/s390/multiarch/Makefile
> +++ b/sysdeps/s390/multiarch/Makefile
> @@ -18,7 +18,8 @@ sysdep_routines += strlen strlen-vx strlen-c \
>   		   memchr memchr-vx \
>   		   rawmemchr rawmemchr-vx rawmemchr-c \
>   		   memccpy memccpy-vx memccpy-c \
> -		   memrchr memrchr-vx memrchr-c
> +		   memrchr memrchr-vx memrchr-c \
> +		   mempcpy
>   endif
>
>   ifeq ($(subdir),wcsmbs)
> diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
> index 62a4359..89898ec 100644
> --- a/sysdeps/s390/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
> @@ -69,6 +69,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>   			      S390_IS_Z10 (stfle_bits), __memcpy_z10)
>   	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_default))
>
> +  IFUNC_IMPL (i, name, mempcpy,
> +	      IFUNC_IMPL_ADD (array, i, mempcpy,
> +			      S390_IS_Z196 (stfle_bits), ____mempcpy_z196)
> +	      IFUNC_IMPL_ADD (array, i, mempcpy,
> +			      S390_IS_Z10 (stfle_bits), ____mempcpy_z10)
> +	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, ____mempcpy_default))
> +
>   #endif /* SHARED */
>
>   #ifdef HAVE_S390_VX_ASM_SUPPORT
> diff --git a/sysdeps/s390/multiarch/mempcpy.c b/sysdeps/s390/multiarch/mempcpy.c
> new file mode 100644
> index 0000000..34d8329
> --- /dev/null
> +++ b/sysdeps/s390/multiarch/mempcpy.c
> @@ -0,0 +1,26 @@
> +/* Multiple versions of mempcpy.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +
> +#if defined SHARED && IS_IN (libc)
> +# include <ifunc-resolve.h>
> +s390_libc_ifunc (__mempcpy)
> +
> +__asm__ (".weak mempcpy\n\t"
> +	 ".set mempcpy,__mempcpy\n\t");
> +#endif
> diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
> index 2ac51ab..6be5104 100644
> --- a/sysdeps/s390/s390-32/memcpy.S
> +++ b/sysdeps/s390/s390-32/memcpy.S
> @@ -25,12 +25,23 @@
>        %r3 = address of source memory area
>        %r4 = number of bytes to copy.  */
>
> -#ifdef USE_MULTIARCH
> -ENTRY(__memcpy_default)
> -#else
> -ENTRY(memcpy)
> +       .text
> +ENTRY(__mempcpy)
> +	.machine "g5"
> +	lr      %r1,%r2             # Use as dest
> +	la      %r2,0(%r4,%r2)      # Return dest + n
> +	j	.L_G5_start
> +END(__mempcpy)
> +#ifndef USE_MULTIARCH
> +libc_hidden_def (__mempcpy)
> +weak_alias (__mempcpy, mempcpy)
> +libc_hidden_builtin_def (mempcpy)
>   #endif
> +
> +ENTRY(memcpy)
>   	.machine "g5"
> +	lr      %r1,%r2             # r1: Use as dest ; r2: Return dest
> +.L_G5_start:
>   	st      %r13,52(%r15)
>   	.cfi_offset 13, -44
>   	basr    %r13,0
> @@ -41,14 +52,13 @@ ENTRY(memcpy)
>   	lr      %r5,%r4
>   	srl     %r5,8
>   	ltr     %r5,%r5
> -	lr      %r1,%r2
>   	jne     .L_G5_13
>   	ex      %r4,.L_G5_17-.L_G5_16(%r13)
>   .L_G5_4:
>   	l       %r13,52(%r15)
>   	br      %r14
>   .L_G5_13:
> -	chi	%r5,4096             # Switch to mvcle for copies >1MB
> +	chi	%r5,4096            # Switch to mvcle for copies >1MB
>   	jh	__memcpy_mvcle
>   .L_G5_12:
>   	mvc     0(256,%r1),0(%r3)
> @@ -59,24 +69,24 @@ ENTRY(memcpy)
>   	j       .L_G5_4
>   .L_G5_17:
>   	mvc     0(1,%r1),0(%r3)
> -#ifdef USE_MULTIARCH
> -END(__memcpy_default)
> -#else
>   END(memcpy)
> +#ifndef USE_MULTIARCH
>   libc_hidden_builtin_def (memcpy)
>   #endif
>
>   ENTRY(__memcpy_mvcle)
> -       # Using as standalone function will result in unexpected
> -       # results since the length field is incremented by 1 in order to
> -       # compensate the changes already done in the functions above.
> -       ahi     %r4,1               # length + 1
> -       lr      %r5,%r4             # source length
> -       lr      %r4,%r3             # source address
> -       lr      %r3,%r5             # destination length = source length
> +	# Using as standalone function will result in unexpected
> +	# results since the length field is incremented by 1 in order to
> +	# compensate the changes already done in the functions above.
> +	lr      %r0,%r2             # backup return dest [ + n ]
> +	ahi     %r4,1               # length + 1
> +	lr      %r5,%r4             # source length
> +	lr      %r4,%r3             # source address
> +	lr      %r2,%r1             # destination address
> +	lr      %r3,%r5             # destination length = source length
>   .L_MVCLE_1:
> -       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> -       jo      .L_MVCLE_1
> -       lr      %r2,%r1             # return destination address
> -       br      %r14
> +	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> +	jo      .L_MVCLE_1
> +	lr      %r2,%r0             # return destination address
> +	br      %r14
>   END(__memcpy_mvcle)
> diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> index 92ffaea..297a894 100644
> --- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> +++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
> @@ -29,14 +29,23 @@
>
>   #if defined SHARED && IS_IN (libc)
>
> +ENTRY(____mempcpy_z196)
> +	.machine "z196"
> +	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z196_start
> +END(____mempcpy_z196)
> +
>   ENTRY(__memcpy_z196)
>   	.machine "z196"
>   	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z196_start:
>   	llgfr   %r4,%r4
>   	ltgr    %r4,%r4
>   	je      .L_Z196_4
>   	aghi    %r4,-1
> -	lr      %r1,%r2
>   	srlg    %r5,%r4,8
>   	ltgr    %r5,%r5
>   	jne     .L_Z196_5
> @@ -60,13 +69,22 @@ ENTRY(__memcpy_z196)
>   	mvc     0(1,%r1),0(%r3)
>   END(__memcpy_z196)
>
> +ENTRY(____mempcpy_z10)
> +	.machine "z10"
> +	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z10_start
> +END(____mempcpy_z10)
> +
>   ENTRY(__memcpy_z10)
>   	.machine "z10"
>   	.machinemode "zarch_nohighgprs"
> +	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z10_start:
>   	llgfr   %r4,%r4
>   	cgije   %r4,0,.L_Z10_4
>   	aghi    %r4,-1
> -	lr      %r1,%r2
>   	srlg    %r5,%r4,8
>   	cgijlh  %r5,0,.L_Z10_13
>   .L_Z10_3:
> @@ -88,14 +106,23 @@ ENTRY(__memcpy_z10)
>   	mvc     0(1,%r1),0(%r3)
>   END(__memcpy_z10)
>
> +# define __mempcpy ____mempcpy_default
>   #endif /* SHARED && IS_IN (libc) */
>
> +#define memcpy __memcpy_default
>   #include "../memcpy.S"
> +#undef memcpy
>
>   #if defined SHARED && IS_IN (libc)
>   .globl   __GI_memcpy
>   .set     __GI_memcpy,__memcpy_default
> +.globl   __GI_mempcpy
> +.set     __GI_mempcpy,____mempcpy_default
> +.globl   __GI___mempcpy
> +.set     __GI___mempcpy,____mempcpy_default
>   #else
>   .globl   memcpy
>   .set     memcpy,__memcpy_default
> +.weak    mempcpy
> +.set     mempcpy,__mempcpy
>   #endif
> diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
> index 9d60a14..e3ace0c 100644
> --- a/sysdeps/s390/s390-64/memcpy.S
> +++ b/sysdeps/s390/s390-64/memcpy.S
> @@ -27,19 +27,27 @@
>
>
>          .text
> +ENTRY(__mempcpy)
> +	.machine "z900"
> +	lgr     %r1,%r2             # Use as dest
> +	la      %r2,0(%r4,%r2)      # Return dest + n
> +	j	.L_Z900_start
> +END(__mempcpy)
> +#ifndef USE_MULTIARCH
> +libc_hidden_def (__mempcpy)
> +weak_alias (__mempcpy, mempcpy)
> +libc_hidden_builtin_def (mempcpy)
> +#endif
>
> -#ifdef USE_MULTIARCH
> -ENTRY(__memcpy_default)
> -#else
>   ENTRY(memcpy)
> -#endif
>   	.machine "z900"
> +	lgr     %r1,%r2             # r1: Use as dest ; r2: Return dest
> +.L_Z900_start:
>   	ltgr    %r4,%r4
>   	je      .L_Z900_4
>   	aghi    %r4,-1
>   	srlg    %r5,%r4,8
>   	ltgr    %r5,%r5
> -	lgr     %r1,%r2
>   	jne     .L_Z900_13
>   .L_Z900_3:
>   	larl    %r5,.L_Z900_15
> @@ -57,25 +65,24 @@ ENTRY(memcpy)
>   	j       .L_Z900_3
>   .L_Z900_15:
>   	mvc     0(1,%r1),0(%r3)
> -
> -#ifdef USE_MULTIARCH
> -END(__memcpy_default)
> -#else
>   END(memcpy)
> +#ifndef USE_MULTIARCH
>   libc_hidden_builtin_def (memcpy)
>   #endif
>
>   ENTRY(__memcpy_mvcle)
> -       # Using as standalone function will result in unexpected
> -       # results since the length field is incremented by 1 in order to
> -       # compensate the changes already done in the functions above.
> -       aghi    %r4,1               # length + 1
> -       lgr     %r5,%r4             # source length
> -       lgr     %r4,%r3             # source address
> -       lgr     %r3,%r5             # destination length = source length
> +	# Using as standalone function will result in unexpected
> +	# results since the length field is incremented by 1 in order to
> +	# compensate the changes already done in the functions above.
> +	lgr     %r0,%r2             # backup return dest [ + n ]
> +	aghi    %r4,1               # length + 1
> +	lgr     %r5,%r4             # source length
> +	lgr     %r4,%r3             # source address
> +	lgr     %r2,%r1             # destination address
> +	lgr     %r3,%r5             # destination length = source length
>   .L_MVCLE_1:
> -       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> -       jo      .L_MVCLE_1
> -       lgr     %r2,%r1             # return destination address
> -       br      %r14
> +	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
> +	jo      .L_MVCLE_1
> +	lgr     %r2,%r0             # return destination address
> +	br      %r14
>   END(__memcpy_mvcle)
> diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> index 8f54526..0f5a36e 100644
> --- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> +++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
> @@ -29,12 +29,20 @@
>
>   #if defined SHARED && IS_IN (libc)
>
> +ENTRY(____mempcpy_z196)
> +	.machine "z196"
> +	lgr     %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z196_start
> +END(____mempcpy_z196)
> +
>   ENTRY(__memcpy_z196)
>   	.machine "z196"
> +	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z196_start:
>   	ltgr    %r4,%r4
>   	je      .L_Z196_4
>   	aghi    %r4,-1
> -	lgr     %r1,%r2
>   	srlg    %r5,%r4,8
>   	ltgr    %r5,%r5
>   	jne     .L_Z196_5
> @@ -58,11 +66,19 @@ ENTRY(__memcpy_z196)
>   	mvc     0(1,%r1),0(%r3)
>   END(__memcpy_z196)
>
> +ENTRY(____mempcpy_z10)
> +	.machine "z10"
> +	lgr     %r1,%r2         # Use as dest
> +	la      %r2,0(%r4,%r2)  # Return dest + n
> +	j	.L_Z10_start
> +END(____mempcpy_z10)
> +
>   ENTRY(__memcpy_z10)
>   	.machine "z10"
> +	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
> +.L_Z10_start:
>   	cgije   %r4,0,.L_Z10_4
>   	aghi    %r4,-1
> -	lgr     %r1,%r2
>   	srlg    %r5,%r4,8
>   	cgijlh  %r5,0,.L_Z10_13
>   .L_Z10_3:
> @@ -84,14 +100,23 @@ ENTRY(__memcpy_z10)
>   	mvc     0(1,%r1),0(%r3)
>   END(__memcpy_z10)
>
> +# define __mempcpy ____mempcpy_default
>   #endif /* SHARED && IS_IN (libc) */
>
> +#define memcpy __memcpy_default
>   #include "../memcpy.S"
> +#undef memcpy
>
>   #if defined SHARED && IS_IN (libc)
>   .globl   __GI_memcpy
>   .set     __GI_memcpy,__memcpy_default
> +.globl   __GI_mempcpy
> +.set     __GI_mempcpy,____mempcpy_default
> +.globl   __GI___mempcpy
> +.set     __GI___mempcpy,____mempcpy_default
>   #else
>   .globl   memcpy
>   .set     memcpy,__memcpy_default
> +.weak    mempcpy
> +.set     mempcpy,__mempcpy
>   #endif
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-04-27  8:20     ` Stefan Liebler
@ 2016-05-04 15:22       ` Adhemerval Zanella
  2016-05-11 13:41         ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-04 15:22 UTC (permalink / raw)
  To: libc-alpha



On 27/04/2016 05:14, Stefan Liebler wrote:
> On 04/26/2016 03:35 PM, Adhemerval Zanella wrote:
>>
>>
>> On 26/04/2016 09:07, Stefan Liebler wrote:
>>> On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
>>> which are created with s390_libc_ifunc-macro.
>>> This macro creates a __GI_ symbol which is set to the
>>> ifunced symbol. Thus calls within libc.so to e.g. memcpy
>>> result in a call to *ABS*+0x954c0@plt stub and afterwards
>>> to the resolved memcpy-ifunc-variant.
>>>
>>> This patch sets the __GI_ symbol to the default-ifunc-variant
>>> to avoid the plt call. The __GI_ symbols are now created at the
>>> default variant of ifunced function.
>>
>> Is the internal ifunc plt usage leading to a failure in s390/s390x
>> (as for powerpc32 and i686) or is it an optimization fix?
> 
> No it does not lead to a failure on s390.
> It is an optimization to avoid the extra call to the plt-stub if called within libc.so.
> 

Right because on 19c4bec0f43599eecc2f32de96ae179cd7d64053 I did the
exact opposite because for POWER it shows that the gains of using
optimized version over default one shows a good improvement in
algorithms that use these symbol internally (like regex).

It might be the case where the default s390 version is good enough
and shows no performance difference.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-04-27  8:15     ` Stefan Liebler
@ 2016-05-04 15:42       ` Adhemerval Zanella
  0 siblings, 0 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-04 15:42 UTC (permalink / raw)
  To: libc-alpha



On 27/04/2016 05:14, Stefan Liebler wrote:
> On 04/26/2016 03:32 PM, Adhemerval Zanella wrote:
>>
>>
>> On 26/04/2016 09:07, Stefan Liebler wrote:
>>> There exist optimized memcpy functions on s390, but no optimized mempcpy.
>>> This patch adds mempcpy entry points in memcpy.S files, which
>>> use the memcpy implementation. Now mempcpy itself is also an IFUNC function
>>> as memcpy is and the variants are listed in ifunc-impl-list.c.
>>>
>>> The s390 string.h defines _HAVE_STRING_ARCH_mempcpy and __mempcpy/mempcpy
>>> definitions. If n is constant and small enough, __mempcpy/mempcpy is an inlined
>>> function, which uses memcpy and returns dest + n. In this constant case,
>>> GCC emits instructions like mvi or mvc and avoids the function call to memcpy.
>>> If n is not constant, then __mempcpy is called.
>>>
>>
>> AFAIk GLIBC does not have any ports that implements mempcpy and not
>> memcpy and IMHO the inline version to convert mempcpy to memcpy is
>> always a gain.
>>
>> So I would suggest to just avoid adding arch-specific bits and move the
>> mempcpy inline optimization at string/string.h to string/string2.h as
>>
>> -- 
>>
>> #define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
>> #define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
>>
>> __extern_always_inline void *
>> __mempcpy_inline (void *__restrict __dest,
>>                    const void *__restrict __src, size_t __n)
>> {
>>    return (char *) memcpy (__dest, __src, __n) + __n;
>> }
>>
>> -- 
>>
>> This is convert any mempcpy to mempcy plus length and provide
>> the symbol only for older binaries.
>>
> 
> According to the current behaviour on s390, a call to memcpy with constant byte-length below 65536 is done by emitting mvc, mvi or
> a suitable instruction sequence.
> If gcc don't know about a constant byte-length, then memcpy function
> in glibc is called.
> 
> For mempcpy cascades like in iconv/gconv_conf.c:
> 466:  __mempcpy (__mempcpy (__mempcpy (gconv_path, __gconv_path_envvar,
>                        user_len),
>                 ":", 1),
>              default_gconv_path, sizeof (default_gconv_path));
> 
> The current behaviour is to perform a call to memcpy(), add user_len to
> the return value of the call and use mvi, mvc for the two calls with constant byte-lengths.
> 
> If only _HAVE_STRING_ARCH_mempcpy is defined, there would be three calls
> to mempcpy. Here you are right, this is not what you want.
> 
> The behaviour with this patch is to call mempcpy() once and use mvi, mvc
> directly without the need to add user_len after the mempcpy call.
> This way, there is no need to store user_len in an extra register or on stack.
> 
> I don't know the conditions on other archs, whether memcpy is inlined or called, thus I added the s390-specific condition to s390-string.h.
> 

Right, but I *think* compiler would be smart enough to just avoid the
extra spilling. Take this example for instance [1], using GCC 5.3
for s390x I see no difference in generated assembly if I the strategy
I proposed (-DMEMPCPY_TO_MEMCPY) to the s390 specific you are suggesting.

In the end, I am proposing that architecture specific micro-optimization
should be avoid in favor of a more specific one.  Specially the one that
tend to avoid one or two extra spilling based on quite complex macro
expansion.

[1] http://pastie.org/10824072

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-05-04 15:22       ` Adhemerval Zanella
@ 2016-05-11 13:41         ` Stefan Liebler
  2016-05-11 19:27           ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-11 13:41 UTC (permalink / raw)
  To: libc-alpha

On 05/04/2016 05:22 PM, Adhemerval Zanella wrote:
>
>
> On 27/04/2016 05:14, Stefan Liebler wrote:
>> On 04/26/2016 03:35 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 26/04/2016 09:07, Stefan Liebler wrote:
>>>> On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
>>>> which are created with s390_libc_ifunc-macro.
>>>> This macro creates a __GI_ symbol which is set to the
>>>> ifunced symbol. Thus calls within libc.so to e.g. memcpy
>>>> result in a call to *ABS*+0x954c0@plt stub and afterwards
>>>> to the resolved memcpy-ifunc-variant.
>>>>
>>>> This patch sets the __GI_ symbol to the default-ifunc-variant
>>>> to avoid the plt call. The __GI_ symbols are now created at the
>>>> default variant of ifunced function.
>>>
>>> Is the internal ifunc plt usage leading to a failure in s390/s390x
>>> (as for powerpc32 and i686) or is it an optimization fix?
>>
>> No it does not lead to a failure on s390.
>> It is an optimization to avoid the extra call to the plt-stub if called within libc.so.
>>
>
> Right because on 19c4bec0f43599eecc2f32de96ae179cd7d64053 I did the
> exact opposite because for POWER it shows that the gains of using
> optimized version over default one shows a good improvement in
> algorithms that use these symbol internally (like regex).
>
> It might be the case where the default s390 version is good enough
> and shows no performance difference.
>

Hi Adhemerval,

I've retested the patchset in context of regex with these functions 
called with/without ifunc-plt.
The version without ifunc-plt is slightly faster.

But you are right. This decision has to be redetermined with further 
variants or newer machines.

In case of the string-functions, calling them via ifunc within libc.so 
could gain improvement.

Bye
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt.
  2016-05-11 13:41         ` Stefan Liebler
@ 2016-05-11 19:27           ` Adhemerval Zanella
  0 siblings, 0 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-11 19:27 UTC (permalink / raw)
  To: libc-alpha



On 11/05/2016 10:40, Stefan Liebler wrote:
> On 05/04/2016 05:22 PM, Adhemerval Zanella wrote:
>>
>>
>> On 27/04/2016 05:14, Stefan Liebler wrote:
>>> On 04/26/2016 03:35 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 26/04/2016 09:07, Stefan Liebler wrote:
>>>>> On s390, the memcpy, memcmp, memset functions are IFUNC symbols,
>>>>> which are created with s390_libc_ifunc-macro.
>>>>> This macro creates a __GI_ symbol which is set to the
>>>>> ifunced symbol. Thus calls within libc.so to e.g. memcpy
>>>>> result in a call to *ABS*+0x954c0@plt stub and afterwards
>>>>> to the resolved memcpy-ifunc-variant.
>>>>>
>>>>> This patch sets the __GI_ symbol to the default-ifunc-variant
>>>>> to avoid the plt call. The __GI_ symbols are now created at the
>>>>> default variant of ifunced function.
>>>>
>>>> Is the internal ifunc plt usage leading to a failure in s390/s390x
>>>> (as for powerpc32 and i686) or is it an optimization fix?
>>>
>>> No it does not lead to a failure on s390.
>>> It is an optimization to avoid the extra call to the plt-stub if called within libc.so.
>>>
>>
>> Right because on 19c4bec0f43599eecc2f32de96ae179cd7d64053 I did the
>> exact opposite because for POWER it shows that the gains of using
>> optimized version over default one shows a good improvement in
>> algorithms that use these symbol internally (like regex).
>>
>> It might be the case where the default s390 version is good enough
>> and shows no performance difference.
>>
> 
> Hi Adhemerval,
> 
> I've retested the patchset in context of regex with these functions called with/without ifunc-plt.
> The version without ifunc-plt is slightly faster.
> 
> But you are right. This decision has to be redetermined with further variants or newer machines.
> 
> In case of the string-functions, calling them via ifunc within libc.so could gain improvement.
> 
> Bye
> Stefan
> 

Fair enough them.  I would expect without ifunc-plt to be slower in newer
architectures, but it could be case where the performance differences
between the base implementation and the arch optimized one is that large
(for instance powerpc64 memcpy the base one only uses 8-bytes store/loads,
which show large differences depending of the workload compared to power7
one).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-18 15:25                               ` Stefan Liebler
@ 2016-05-24  9:11                                 ` Stefan Liebler
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Liebler @ 2016-05-24  9:11 UTC (permalink / raw)
  To: libc-alpha

On 05/18/2016 05:19 PM, Stefan Liebler wrote:
> On 05/13/2016 05:30 PM, Stefan Liebler wrote:
>>
>>
>> On 05/13/2016 05:06 PM, H.J. Lu wrote:
>>> On Fri, May 13, 2016 at 7:59 AM, Stefan Liebler
>>> <stli@linux.vnet.ibm.com> wrote:
>>>>
>>>>
>>>> On 05/13/2016 04:49 PM, H.J. Lu wrote:
>>>>>
>>>>> On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler
>>>>> <stli@linux.vnet.ibm.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But my point is all the architectures which provide an
>>>>>>>>>>>>>>>> optimized
>>>>>>>>>>>>>>>> mempcpy is
>>>>>>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> this patchset),
>>>>>>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>>>>>>> 3. using a similar strategy for both implementations
>>>>>>>>>>>>>>>> (powerpc).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>>>>>>> required because both
>>>>>>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.
>>>>>>>>>>>>>>>> Based
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> assumption that
>>>>>>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there
>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>> to just add
>>>>>>>>>>>>>>>> this micro-optimization to only s390, but rather make is
>>>>>>>>>>>>>>>> general.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> GLIBC already has this optimization in the generic string
>>>>>>>>>>>>>>> header,
>>>>>>>>>>>>>>> it's just that s390 wants
>>>>>>>>>>>>>>> to do something different again. As long as GCC isn't
>>>>>>>>>>>>>>> fixed this
>>>>>>>>>>>>>>> isn't possible to support
>>>>>>>>>>>>>>> s390 without this header workaround. And we need GCC to
>>>>>>>>>>>>>>> improve
>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>> things work
>>>>>>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.
>>>>>>>>>>>>>> What
>>>>>>>>>>>>>> I am trying to argue it
>>>>>>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and
>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> as default for all
>>>>>>>>>>>>>> ports.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please don't enable it for x86.  Calling memcpy means we
>>>>>>>>>>>>> have to
>>>>>>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, direct call will require save and restore the size for
>>>>>>>>>>>> further
>>>>>>>>>>>> add
>>>>>>>>>>>> and this is true for most architectures.  My question is if
>>>>>>>>>>>> does
>>>>>>>>>>>> this
>>>>>>>>>>>> really matter in currently GLIBC internal usage and on programs
>>>>>>>>>>>> that
>>>>>>>>>>>> might use it compared against the burden of keeping the various
>>>>>>>>>>>> string*.h header in check for multiple architectures or
>>>>>>>>>>>> adding this
>>>>>>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>>>>>>> inline mempcpy for x86.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In fact I am objecting all the bits GLIBC added on string*.h that
>>>>>>>>>> only
>>>>>>>>>> adds complexity for some micro-optimizations. For x86 I do
>>>>>>>>>> agree that
>>>>>>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>>>>>>
>>>>>>>>>> My rationale is to avoid add even more arch-specific bits in
>>>>>>>>>> installed
>>>>>>>>>> headers to add such optimizations.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe most of those micro-optimizations belong to GCC, not
>>>>>>>>> glibc.
>>>>>>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>>>>>>> should avoid adding new ones.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Does this mean, the proposed way is to not add a macro for
>>>>>>>> mempcpy in
>>>>>>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>>>>>>
>>>>>>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will
>>>>>>>> always
>>>>>>>> use the macro defined in string/string.h, which inlines memcpy +
>>>>>>>> n and
>>>>>>>> the
>>>>>>>> mempcpy-function is not called. The memcpy is transformed to mvc,
>>>>>>>> mvhi
>>>>>>>> for
>>>>>>>> e.g. constant lengths. Otherwise, the memcpy-function is called
>>>>>>>> and the
>>>>>>>> length will be added after the call.
>>>>>>>>
>>>>>>>> According to PR70140
>>>>>>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>>>>>>> string/string.h will be removed after this bug is fixed in GCC?
>>>>>>>> Then I think the decision, if mempcpy or memcpy is called or an
>>>>>>>> inline
>>>>>>>> version is emitted, will be done per architecture backend?
>>>>>>>>
>>>>>>>> If this is all true, then the mempcpy-function will be called in
>>>>>>>> future
>>>>>>>> if it makes sense!?
>>>>>>>>
>>>>>>>>
>>>>>>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>>>>>>> mempcpy-function is always called - even for cases with constant
>>>>>>>> length
>>>>>>>> where mvc or mvhi is emitted.
>>>>>>>> This is definitely not the intention of this patch.
>>>>>>>> After fixing PR70140, GCC will correctly handle the constant length
>>>>>>>> cases!?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What I am proposing is to avoid add more arch-specific
>>>>>>> optimization on
>>>>>>> installed
>>>>>>> string*.h headers and instead work on adding them on compiler
>>>>>>> side.  My
>>>>>>> view is
>>>>>>> we should cleanup as much as possible the string headers and only
>>>>>>> add
>>>>>>> optimization
>>>>>>> that are more architecture neutral.
>>>>>>>
>>>>>>> Related to patch, my understanding is s390x does not really
>>>>>>> provide an
>>>>>>> optimized
>>>>>>> mempcpy (it uses the default mempcpy.c) and I think a better
>>>>>>> approach
>>>>>>> would be
>>>>>>> to add an optimized mempcpy like x88_64
>>>>>>> (c365e615f7429aee302f8af7bf07ae262278febb).
>>>>>>> The mempcpy won't be optimized directly to mvc instruction, but at
>>>>>>> least
>>>>>>> it will
>>>>>>> call the optimized memcpy.  The full optimization will be done by
>>>>>>> correctly
>>>>>>> handling the transformation on compiler size (the PR#70140 as you
>>>>>>> referred).
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Okay. Here is the updated patch. It implements mempcpy with memcpy as
>>>>>> before, but does not change the s390-specific string.h and
>>>>>> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment
>>>>>> mempcpy()
>>>>>> won't
>>>>>> be called, but instead memcpy + n is inlined by the mempcpy macro in
>>>>>> string/string.h. After fxing PR70140, either the mempcpy macro in
>>>>>> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has
>>>>>> to be
>>>>>> defined for S390.
>>>>>>
>>>>>> Okay to commit?
>>>>>
>>>>>
>>>>> +__asm__ (".weak mempcpy\n\t"
>>>>> + ".set mempcpy,__mempcpy\n\t");
>>>>>
>>>>> Don't we have a macro for this?
>>>>>
>>>> Yes that is true, but I can't use it in mempcpy.c in this case,
>>>> because the macro s390_libc_ifunc (__mempcpy) uses an inline assembly
>>>> to implement the __mempcpy ifunc symbol.
>>>> If I use the weak_alias macro here in the c-file, gcc fails because
>>>> it has
>>>> no knowledge of symbol __mempcpy.
>>>>
>>>
>>> Can you make s390_libc_ifunc similar to libc_ifunc to expose
>>> FUNC to gcc?
>>>
>>
>> No I won't do that.
>> At the moment s390_vx_libc_ifunc is implemented in this way,
>> then the weak_alias works.
>> But I have to remove the asm(FUNCNAME) in
>> "extern void *resolverfunction(..) asm (FUNCNAME)"!
>>
>> The dwarf information for the resolverfunction has a DW_AT_linkage_name
>> entry, which shows to FUNCNAME. If you do an inferior function call to
>> FUNCNAME in lldb, then it fails due to something like that:
>> "error: no matching function for call to 'FUNCNAME'
>> candidate function not viable: no known conversion from 'const char [6]'
>> to 'unsigned long' for 1st argument"
>> (The unsigned long is the dl_hwcap argument of the resolverfunction).
>>
>> Here is the dwarf information for e.g. __resolve___strlen:
>> <1><1e6424>: Abbrev Number: 43 (DW_TAG_subprogram)
>>      <1e6425>   DW_AT_external    : 1
>>      <1e6425>   DW_AT_name        : (indirect string, offset: 0x1146e):
>> __resolve___strlen
>>      <1e6429>   DW_AT_decl_file   : 1
>>      <1e642a>   DW_AT_decl_line   : 23
>>      <1e642b>   DW_AT_linkage_name: (indirect string, offset: 0x1147a):
>> strlen
>>      <1e642f>   DW_AT_prototyped  : 1
>>      <1e642f>   DW_AT_type        : <0x1e4ccd>
>>      <1e6433>   DW_AT_low_pc      : 0x998e0
>>      <1e643b>   DW_AT_high_pc     : 0x16
>>      <1e6443>   DW_AT_frame_base  : 1 byte block: 9c
>> (DW_OP_call_frame_cfa)
>>      <1e6445>   DW_AT_GNU_all_call_sites: 1
>>      <1e6445>   DW_AT_sibling     : <0x1e6459>
>>   <2><1e6449>: Abbrev Number: 44 (DW_TAG_formal_parameter)
>>      <1e644a>   DW_AT_name        : (indirect string, offset: 0x1845):
>> dl_hwcap
>>      <1e644e>   DW_AT_decl_file   : 1
>>      <1e644f>   DW_AT_decl_line   : 23
>>      <1e6450>   DW_AT_type        : <0x1e4c8d>
>>      <1e6454>   DW_AT_location    : 0x122115 (location list)
>>
>>
>> Without the asm (FUNCNAME), there is no DW_AT_linkage_name entry and the
>> inferior function call is working.
>>
>> Do you have any better idea?
>>
>>
> Any objection?
> Otherwise I'll commit the patch series soon.
>
> Bye
> Stefan
>
>
Committed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-13 15:31                             ` Stefan Liebler
@ 2016-05-18 15:25                               ` Stefan Liebler
  2016-05-24  9:11                                 ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-18 15:25 UTC (permalink / raw)
  To: libc-alpha

On 05/13/2016 05:30 PM, Stefan Liebler wrote:
>
>
> On 05/13/2016 05:06 PM, H.J. Lu wrote:
>> On Fri, May 13, 2016 at 7:59 AM, Stefan Liebler
>> <stli@linux.vnet.ibm.com> wrote:
>>>
>>>
>>> On 05/13/2016 04:49 PM, H.J. Lu wrote:
>>>>
>>>> On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler
>>>> <stli@linux.vnet.ibm.com>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But my point is all the architectures which provide an
>>>>>>>>>>>>>>> optimized
>>>>>>>>>>>>>>> mempcpy is
>>>>>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390
>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> this patchset),
>>>>>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the
>>>>>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>>>>>> 3. using a similar strategy for both implementations
>>>>>>>>>>>>>>> (powerpc).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>>>>>> required because both
>>>>>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.
>>>>>>>>>>>>>>> Based
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> assumption that
>>>>>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> to just add
>>>>>>>>>>>>>>> this micro-optimization to only s390, but rather make is
>>>>>>>>>>>>>>> general.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> GLIBC already has this optimization in the generic string
>>>>>>>>>>>>>> header,
>>>>>>>>>>>>>> it's just that s390 wants
>>>>>>>>>>>>>> to do something different again. As long as GCC isn't
>>>>>>>>>>>>>> fixed this
>>>>>>>>>>>>>> isn't possible to support
>>>>>>>>>>>>>> s390 without this header workaround. And we need GCC to
>>>>>>>>>>>>>> improve
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>> things work
>>>>>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.
>>>>>>>>>>>>> What
>>>>>>>>>>>>> I am trying to argue it
>>>>>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and
>>>>>>>>>>>>> enable
>>>>>>>>>>>>> it
>>>>>>>>>>>>> as default for all
>>>>>>>>>>>>> ports.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please don't enable it for x86.  Calling memcpy means we
>>>>>>>>>>>> have to
>>>>>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, direct call will require save and restore the size for
>>>>>>>>>>> further
>>>>>>>>>>> add
>>>>>>>>>>> and this is true for most architectures.  My question is if does
>>>>>>>>>>> this
>>>>>>>>>>> really matter in currently GLIBC internal usage and on programs
>>>>>>>>>>> that
>>>>>>>>>>> might use it compared against the burden of keeping the various
>>>>>>>>>>> string*.h header in check for multiple architectures or
>>>>>>>>>>> adding this
>>>>>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>>>>>> inline mempcpy for x86.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In fact I am objecting all the bits GLIBC added on string*.h that
>>>>>>>>> only
>>>>>>>>> adds complexity for some micro-optimizations. For x86 I do
>>>>>>>>> agree that
>>>>>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>>>>>
>>>>>>>>> My rationale is to avoid add even more arch-specific bits in
>>>>>>>>> installed
>>>>>>>>> headers to add such optimizations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe most of those micro-optimizations belong to GCC, not
>>>>>>>> glibc.
>>>>>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>>>>>> should avoid adding new ones.
>>>>>>>>
>>>>>>>
>>>>>>> Does this mean, the proposed way is to not add a macro for
>>>>>>> mempcpy in
>>>>>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>>>>>
>>>>>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will
>>>>>>> always
>>>>>>> use the macro defined in string/string.h, which inlines memcpy +
>>>>>>> n and
>>>>>>> the
>>>>>>> mempcpy-function is not called. The memcpy is transformed to mvc,
>>>>>>> mvhi
>>>>>>> for
>>>>>>> e.g. constant lengths. Otherwise, the memcpy-function is called
>>>>>>> and the
>>>>>>> length will be added after the call.
>>>>>>>
>>>>>>> According to PR70140
>>>>>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>>>>>> string/string.h will be removed after this bug is fixed in GCC?
>>>>>>> Then I think the decision, if mempcpy or memcpy is called or an
>>>>>>> inline
>>>>>>> version is emitted, will be done per architecture backend?
>>>>>>>
>>>>>>> If this is all true, then the mempcpy-function will be called in
>>>>>>> future
>>>>>>> if it makes sense!?
>>>>>>>
>>>>>>>
>>>>>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>>>>>> mempcpy-function is always called - even for cases with constant
>>>>>>> length
>>>>>>> where mvc or mvhi is emitted.
>>>>>>> This is definitely not the intention of this patch.
>>>>>>> After fixing PR70140, GCC will correctly handle the constant length
>>>>>>> cases!?
>>>>>>
>>>>>>
>>>>>>
>>>>>> What I am proposing is to avoid add more arch-specific
>>>>>> optimization on
>>>>>> installed
>>>>>> string*.h headers and instead work on adding them on compiler
>>>>>> side.  My
>>>>>> view is
>>>>>> we should cleanup as much as possible the string headers and only add
>>>>>> optimization
>>>>>> that are more architecture neutral.
>>>>>>
>>>>>> Related to patch, my understanding is s390x does not really
>>>>>> provide an
>>>>>> optimized
>>>>>> mempcpy (it uses the default mempcpy.c) and I think a better approach
>>>>>> would be
>>>>>> to add an optimized mempcpy like x88_64
>>>>>> (c365e615f7429aee302f8af7bf07ae262278febb).
>>>>>> The mempcpy won't be optimized directly to mvc instruction, but at
>>>>>> least
>>>>>> it will
>>>>>> call the optimized memcpy.  The full optimization will be done by
>>>>>> correctly
>>>>>> handling the transformation on compiler size (the PR#70140 as you
>>>>>> referred).
>>>>>>
>>>>>>
>>>>>
>>>>> Okay. Here is the updated patch. It implements mempcpy with memcpy as
>>>>> before, but does not change the s390-specific string.h and
>>>>> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy()
>>>>> won't
>>>>> be called, but instead memcpy + n is inlined by the mempcpy macro in
>>>>> string/string.h. After fxing PR70140, either the mempcpy macro in
>>>>> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has
>>>>> to be
>>>>> defined for S390.
>>>>>
>>>>> Okay to commit?
>>>>
>>>>
>>>> +__asm__ (".weak mempcpy\n\t"
>>>> + ".set mempcpy,__mempcpy\n\t");
>>>>
>>>> Don't we have a macro for this?
>>>>
>>> Yes that is true, but I can't use it in mempcpy.c in this case,
>>> because the macro s390_libc_ifunc (__mempcpy) uses an inline assembly
>>> to implement the __mempcpy ifunc symbol.
>>> If I use the weak_alias macro here in the c-file, gcc fails because
>>> it has
>>> no knowledge of symbol __mempcpy.
>>>
>>
>> Can you make s390_libc_ifunc similar to libc_ifunc to expose
>> FUNC to gcc?
>>
>
> No I won't do that.
> At the moment s390_vx_libc_ifunc is implemented in this way,
> then the weak_alias works.
> But I have to remove the asm(FUNCNAME) in
> "extern void *resolverfunction(..) asm (FUNCNAME)"!
>
> The dwarf information for the resolverfunction has a DW_AT_linkage_name
> entry, which shows to FUNCNAME. If you do an inferior function call to
> FUNCNAME in lldb, then it fails due to something like that:
> "error: no matching function for call to 'FUNCNAME'
> candidate function not viable: no known conversion from 'const char [6]'
> to 'unsigned long' for 1st argument"
> (The unsigned long is the dl_hwcap argument of the resolverfunction).
>
> Here is the dwarf information for e.g. __resolve___strlen:
> <1><1e6424>: Abbrev Number: 43 (DW_TAG_subprogram)
>      <1e6425>   DW_AT_external    : 1
>      <1e6425>   DW_AT_name        : (indirect string, offset: 0x1146e):
> __resolve___strlen
>      <1e6429>   DW_AT_decl_file   : 1
>      <1e642a>   DW_AT_decl_line   : 23
>      <1e642b>   DW_AT_linkage_name: (indirect string, offset: 0x1147a):
> strlen
>      <1e642f>   DW_AT_prototyped  : 1
>      <1e642f>   DW_AT_type        : <0x1e4ccd>
>      <1e6433>   DW_AT_low_pc      : 0x998e0
>      <1e643b>   DW_AT_high_pc     : 0x16
>      <1e6443>   DW_AT_frame_base  : 1 byte block: 9c
> (DW_OP_call_frame_cfa)
>      <1e6445>   DW_AT_GNU_all_call_sites: 1
>      <1e6445>   DW_AT_sibling     : <0x1e6459>
>   <2><1e6449>: Abbrev Number: 44 (DW_TAG_formal_parameter)
>      <1e644a>   DW_AT_name        : (indirect string, offset: 0x1845):
> dl_hwcap
>      <1e644e>   DW_AT_decl_file   : 1
>      <1e644f>   DW_AT_decl_line   : 23
>      <1e6450>   DW_AT_type        : <0x1e4c8d>
>      <1e6454>   DW_AT_location    : 0x122115 (location list)
>
>
> Without the asm (FUNCNAME), there is no DW_AT_linkage_name entry and the
> inferior function call is working.
>
> Do you have any better idea?
>
>
Any objection?
Otherwise I'll commit the patch series soon.

Bye
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-13 15:06                           ` H.J. Lu
@ 2016-05-13 15:31                             ` Stefan Liebler
  2016-05-18 15:25                               ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-13 15:31 UTC (permalink / raw)
  To: libc-alpha



On 05/13/2016 05:06 PM, H.J. Lu wrote:
> On Fri, May 13, 2016 at 7:59 AM, Stefan Liebler <stli@linux.vnet.ibm.com> wrote:
>>
>>
>> On 05/13/2016 04:49 PM, H.J. Lu wrote:
>>>
>>> On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler <stli@linux.vnet.ibm.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>>>>
>>>>>>
>>>>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But my point is all the architectures which provide an
>>>>>>>>>>>>>> optimized
>>>>>>>>>>>>>> mempcpy is
>>>>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> this patchset),
>>>>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the
>>>>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>>>>> required because both
>>>>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> assumption that
>>>>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> to just add
>>>>>>>>>>>>>> this micro-optimization to only s390, but rather make is
>>>>>>>>>>>>>> general.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> GLIBC already has this optimization in the generic string
>>>>>>>>>>>>> header,
>>>>>>>>>>>>> it's just that s390 wants
>>>>>>>>>>>>> to do something different again. As long as GCC isn't fixed this
>>>>>>>>>>>>> isn't possible to support
>>>>>>>>>>>>> s390 without this header workaround. And we need GCC to improve
>>>>>>>>>>>>> so
>>>>>>>>>>>>> things work
>>>>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.
>>>>>>>>>>>> What
>>>>>>>>>>>> I am trying to argue it
>>>>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable
>>>>>>>>>>>> it
>>>>>>>>>>>> as default for all
>>>>>>>>>>>> ports.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, direct call will require save and restore the size for further
>>>>>>>>>> add
>>>>>>>>>> and this is true for most architectures.  My question is if does
>>>>>>>>>> this
>>>>>>>>>> really matter in currently GLIBC internal usage and on programs
>>>>>>>>>> that
>>>>>>>>>> might use it compared against the burden of keeping the various
>>>>>>>>>> string*.h header in check for multiple architectures or adding this
>>>>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>>>>> inline mempcpy for x86.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In fact I am objecting all the bits GLIBC added on string*.h that
>>>>>>>> only
>>>>>>>> adds complexity for some micro-optimizations. For x86 I do agree that
>>>>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>>>>
>>>>>>>> My rationale is to avoid add even more arch-specific bits in
>>>>>>>> installed
>>>>>>>> headers to add such optimizations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I believe most of those micro-optimizations belong to GCC, not glibc.
>>>>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>>>>> should avoid adding new ones.
>>>>>>>
>>>>>>
>>>>>> Does this mean, the proposed way is to not add a macro for mempcpy in
>>>>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>>>>
>>>>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will
>>>>>> always
>>>>>> use the macro defined in string/string.h, which inlines memcpy + n and
>>>>>> the
>>>>>> mempcpy-function is not called. The memcpy is transformed to mvc, mvhi
>>>>>> for
>>>>>> e.g. constant lengths. Otherwise, the memcpy-function is called and the
>>>>>> length will be added after the call.
>>>>>>
>>>>>> According to PR70140
>>>>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>>>>> string/string.h will be removed after this bug is fixed in GCC?
>>>>>> Then I think the decision, if mempcpy or memcpy is called or an inline
>>>>>> version is emitted, will be done per architecture backend?
>>>>>>
>>>>>> If this is all true, then the mempcpy-function will be called in future
>>>>>> if it makes sense!?
>>>>>>
>>>>>>
>>>>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>>>>> mempcpy-function is always called - even for cases with constant length
>>>>>> where mvc or mvhi is emitted.
>>>>>> This is definitely not the intention of this patch.
>>>>>> After fixing PR70140, GCC will correctly handle the constant length
>>>>>> cases!?
>>>>>
>>>>>
>>>>>
>>>>> What I am proposing is to avoid add more arch-specific optimization on
>>>>> installed
>>>>> string*.h headers and instead work on adding them on compiler side.  My
>>>>> view is
>>>>> we should cleanup as much as possible the string headers and only add
>>>>> optimization
>>>>> that are more architecture neutral.
>>>>>
>>>>> Related to patch, my understanding is s390x does not really provide an
>>>>> optimized
>>>>> mempcpy (it uses the default mempcpy.c) and I think a better approach
>>>>> would be
>>>>> to add an optimized mempcpy like x88_64
>>>>> (c365e615f7429aee302f8af7bf07ae262278febb).
>>>>> The mempcpy won't be optimized directly to mvc instruction, but at least
>>>>> it will
>>>>> call the optimized memcpy.  The full optimization will be done by
>>>>> correctly
>>>>> handling the transformation on compiler size (the PR#70140 as you
>>>>> referred).
>>>>>
>>>>>
>>>>
>>>> Okay. Here is the updated patch. It implements mempcpy with memcpy as
>>>> before, but does not change the s390-specific string.h and
>>>> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy()
>>>> won't
>>>> be called, but instead memcpy + n is inlined by the mempcpy macro in
>>>> string/string.h. After fxing PR70140, either the mempcpy macro in
>>>> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be
>>>> defined for S390.
>>>>
>>>> Okay to commit?
>>>
>>>
>>> +__asm__ (".weak mempcpy\n\t"
>>> + ".set mempcpy,__mempcpy\n\t");
>>>
>>> Don't we have a macro for this?
>>>
>> Yes that is true, but I can't use it in mempcpy.c in this case,
>> because the macro s390_libc_ifunc (__mempcpy) uses an inline assembly
>> to implement the __mempcpy ifunc symbol.
>> If I use the weak_alias macro here in the c-file, gcc fails because it has
>> no knowledge of symbol __mempcpy.
>>
>
> Can you make s390_libc_ifunc similar to libc_ifunc to expose
> FUNC to gcc?
>

No I won't do that.
At the moment s390_vx_libc_ifunc is implemented in this way,
then the weak_alias works.
But I have to remove the asm(FUNCNAME) in
"extern void *resolverfunction(..) asm (FUNCNAME)"!

The dwarf information for the resolverfunction has a DW_AT_linkage_name 
entry, which shows to FUNCNAME. If you do an inferior function call to 
FUNCNAME in lldb, then it fails due to something like that:
"error: no matching function for call to 'FUNCNAME'
candidate function not viable: no known conversion from 'const char [6]' 
to 'unsigned long' for 1st argument"
(The unsigned long is the dl_hwcap argument of the resolverfunction).

Here is the dwarf information for e.g. __resolve___strlen:
<1><1e6424>: Abbrev Number: 43 (DW_TAG_subprogram)
     <1e6425>   DW_AT_external    : 1
     <1e6425>   DW_AT_name        : (indirect string, offset: 0x1146e): 
__resolve___strlen
     <1e6429>   DW_AT_decl_file   : 1
     <1e642a>   DW_AT_decl_line   : 23
     <1e642b>   DW_AT_linkage_name: (indirect string, offset: 0x1147a): 
strlen
     <1e642f>   DW_AT_prototyped  : 1
     <1e642f>   DW_AT_type        : <0x1e4ccd>
     <1e6433>   DW_AT_low_pc      : 0x998e0
     <1e643b>   DW_AT_high_pc     : 0x16
     <1e6443>   DW_AT_frame_base  : 1 byte block: 
9c     (DW_OP_call_frame_cfa)
     <1e6445>   DW_AT_GNU_all_call_sites: 1
     <1e6445>   DW_AT_sibling     : <0x1e6459>
  <2><1e6449>: Abbrev Number: 44 (DW_TAG_formal_parameter)
     <1e644a>   DW_AT_name        : (indirect string, offset: 0x1845): 
dl_hwcap
     <1e644e>   DW_AT_decl_file   : 1
     <1e644f>   DW_AT_decl_line   : 23
     <1e6450>   DW_AT_type        : <0x1e4c8d>
     <1e6454>   DW_AT_location    : 0x122115 (location list)


Without the asm (FUNCNAME), there is no DW_AT_linkage_name entry and the 
inferior function call is working.

Do you have any better idea?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-13 14:59                         ` Stefan Liebler
@ 2016-05-13 15:06                           ` H.J. Lu
  2016-05-13 15:31                             ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-13 15:06 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: GNU C Library

On Fri, May 13, 2016 at 7:59 AM, Stefan Liebler <stli@linux.vnet.ibm.com> wrote:
>
>
> On 05/13/2016 04:49 PM, H.J. Lu wrote:
>>
>> On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler <stli@linux.vnet.ibm.com>
>> wrote:
>>>
>>>
>>>
>>> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>>>
>>>>>
>>>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> But my point is all the architectures which provide an
>>>>>>>>>>>>> optimized
>>>>>>>>>>>>> mempcpy is
>>>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case
>>>>>>>>>>>>> for
>>>>>>>>>>>>> this patchset),
>>>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the
>>>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>>>
>>>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>>>> required because both
>>>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based
>>>>>>>>>>>>> on
>>>>>>>>>>>>> assumption that
>>>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no
>>>>>>>>>>>>> need
>>>>>>>>>>>>> to just add
>>>>>>>>>>>>> this micro-optimization to only s390, but rather make is
>>>>>>>>>>>>> general.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> GLIBC already has this optimization in the generic string
>>>>>>>>>>>> header,
>>>>>>>>>>>> it's just that s390 wants
>>>>>>>>>>>> to do something different again. As long as GCC isn't fixed this
>>>>>>>>>>>> isn't possible to support
>>>>>>>>>>>> s390 without this header workaround. And we need GCC to improve
>>>>>>>>>>>> so
>>>>>>>>>>>> things work
>>>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.
>>>>>>>>>>> What
>>>>>>>>>>> I am trying to argue it
>>>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable
>>>>>>>>>>> it
>>>>>>>>>>> as default for all
>>>>>>>>>>> ports.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, direct call will require save and restore the size for further
>>>>>>>>> add
>>>>>>>>> and this is true for most architectures.  My question is if does
>>>>>>>>> this
>>>>>>>>> really matter in currently GLIBC internal usage and on programs
>>>>>>>>> that
>>>>>>>>> might use it compared against the burden of keeping the various
>>>>>>>>> string*.h header in check for multiple architectures or adding this
>>>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>>>> inline mempcpy for x86.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In fact I am objecting all the bits GLIBC added on string*.h that
>>>>>>> only
>>>>>>> adds complexity for some micro-optimizations. For x86 I do agree that
>>>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>>>
>>>>>>> My rationale is to avoid add even more arch-specific bits in
>>>>>>> installed
>>>>>>> headers to add such optimizations.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I believe most of those micro-optimizations belong to GCC, not glibc.
>>>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>>>> should avoid adding new ones.
>>>>>>
>>>>>
>>>>> Does this mean, the proposed way is to not add a macro for mempcpy in
>>>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>>>
>>>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will
>>>>> always
>>>>> use the macro defined in string/string.h, which inlines memcpy + n and
>>>>> the
>>>>> mempcpy-function is not called. The memcpy is transformed to mvc, mvhi
>>>>> for
>>>>> e.g. constant lengths. Otherwise, the memcpy-function is called and the
>>>>> length will be added after the call.
>>>>>
>>>>> According to PR70140
>>>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>>>> string/string.h will be removed after this bug is fixed in GCC?
>>>>> Then I think the decision, if mempcpy or memcpy is called or an inline
>>>>> version is emitted, will be done per architecture backend?
>>>>>
>>>>> If this is all true, then the mempcpy-function will be called in future
>>>>> if it makes sense!?
>>>>>
>>>>>
>>>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>>>> mempcpy-function is always called - even for cases with constant length
>>>>> where mvc or mvhi is emitted.
>>>>> This is definitely not the intention of this patch.
>>>>> After fixing PR70140, GCC will correctly handle the constant length
>>>>> cases!?
>>>>
>>>>
>>>>
>>>> What I am proposing is to avoid add more arch-specific optimization on
>>>> installed
>>>> string*.h headers and instead work on adding them on compiler side.  My
>>>> view is
>>>> we should cleanup as much as possible the string headers and only add
>>>> optimization
>>>> that are more architecture neutral.
>>>>
>>>> Related to patch, my understanding is s390x does not really provide an
>>>> optimized
>>>> mempcpy (it uses the default mempcpy.c) and I think a better approach
>>>> would be
>>>> to add an optimized mempcpy like x88_64
>>>> (c365e615f7429aee302f8af7bf07ae262278febb).
>>>> The mempcpy won't be optimized directly to mvc instruction, but at least
>>>> it will
>>>> call the optimized memcpy.  The full optimization will be done by
>>>> correctly
>>>> handling the transformation on compiler size (the PR#70140 as you
>>>> referred).
>>>>
>>>>
>>>
>>> Okay. Here is the updated patch. It implements mempcpy with memcpy as
>>> before, but does not change the s390-specific string.h and
>>> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy()
>>> won't
>>> be called, but instead memcpy + n is inlined by the mempcpy macro in
>>> string/string.h. After fxing PR70140, either the mempcpy macro in
>>> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be
>>> defined for S390.
>>>
>>> Okay to commit?
>>
>>
>> +__asm__ (".weak mempcpy\n\t"
>> + ".set mempcpy,__mempcpy\n\t");
>>
>> Don't we have a macro for this?
>>
> Yes that is true, but I can't use it in mempcpy.c in this case,
> because the macro s390_libc_ifunc (__mempcpy) uses an inline assembly
> to implement the __mempcpy ifunc symbol.
> If I use the weak_alias macro here in the c-file, gcc fails because it has
> no knowledge of symbol __mempcpy.
>

Can you make s390_libc_ifunc similar to libc_ifunc to expose
FUNC to gcc?

-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-13 14:49                       ` H.J. Lu
@ 2016-05-13 14:59                         ` Stefan Liebler
  2016-05-13 15:06                           ` H.J. Lu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-13 14:59 UTC (permalink / raw)
  To: libc-alpha



On 05/13/2016 04:49 PM, H.J. Lu wrote:
> On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler <stli@linux.vnet.ibm.com> wrote:
>>
>>
>> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>>
>>>
>>>
>>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>>
>>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>>
>>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>>
>>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> But my point is all the architectures which provide an optimized
>>>>>>>>>>>> mempcpy is
>>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for
>>>>>>>>>>>> this patchset),
>>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the
>>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>>
>>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>>> required because both
>>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on
>>>>>>>>>>>> assumption that
>>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no need
>>>>>>>>>>>> to just add
>>>>>>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> GLIBC already has this optimization in the generic string header,
>>>>>>>>>>> it's just that s390 wants
>>>>>>>>>>> to do something different again. As long as GCC isn't fixed this
>>>>>>>>>>> isn't possible to support
>>>>>>>>>>> s390 without this header workaround. And we need GCC to improve so
>>>>>>>>>>> things work
>>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What
>>>>>>>>>> I am trying to argue it
>>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it
>>>>>>>>>> as default for all
>>>>>>>>>> ports.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, direct call will require save and restore the size for further
>>>>>>>> add
>>>>>>>> and this is true for most architectures.  My question is if does this
>>>>>>>> really matter in currently GLIBC internal usage and on programs that
>>>>>>>> might use it compared against the burden of keeping the various
>>>>>>>> string*.h header in check for multiple architectures or adding this
>>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>>
>>>>>>>
>>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>>> inline mempcpy for x86.
>>>>>>
>>>>>>
>>>>>> In fact I am objecting all the bits GLIBC added on string*.h that only
>>>>>> adds complexity for some micro-optimizations. For x86 I do agree that
>>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>>
>>>>>> My rationale is to avoid add even more arch-specific bits in installed
>>>>>> headers to add such optimizations.
>>>>>
>>>>>
>>>>> I believe most of those micro-optimizations belong to GCC, not glibc.
>>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>>> should avoid adding new ones.
>>>>>
>>>>
>>>> Does this mean, the proposed way is to not add a macro for mempcpy in
>>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>>
>>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will always
>>>> use the macro defined in string/string.h, which inlines memcpy + n and the
>>>> mempcpy-function is not called. The memcpy is transformed to mvc, mvhi for
>>>> e.g. constant lengths. Otherwise, the memcpy-function is called and the
>>>> length will be added after the call.
>>>>
>>>> According to PR70140
>>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>>> string/string.h will be removed after this bug is fixed in GCC?
>>>> Then I think the decision, if mempcpy or memcpy is called or an inline
>>>> version is emitted, will be done per architecture backend?
>>>>
>>>> If this is all true, then the mempcpy-function will be called in future
>>>> if it makes sense!?
>>>>
>>>>
>>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>>> mempcpy-function is always called - even for cases with constant length
>>>> where mvc or mvhi is emitted.
>>>> This is definitely not the intention of this patch.
>>>> After fixing PR70140, GCC will correctly handle the constant length
>>>> cases!?
>>>
>>>
>>> What I am proposing is to avoid add more arch-specific optimization on
>>> installed
>>> string*.h headers and instead work on adding them on compiler side.  My
>>> view is
>>> we should cleanup as much as possible the string headers and only add
>>> optimization
>>> that are more architecture neutral.
>>>
>>> Related to patch, my understanding is s390x does not really provide an
>>> optimized
>>> mempcpy (it uses the default mempcpy.c) and I think a better approach
>>> would be
>>> to add an optimized mempcpy like x88_64
>>> (c365e615f7429aee302f8af7bf07ae262278febb).
>>> The mempcpy won't be optimized directly to mvc instruction, but at least
>>> it will
>>> call the optimized memcpy.  The full optimization will be done by
>>> correctly
>>> handling the transformation on compiler size (the PR#70140 as you
>>> referred).
>>>
>>>
>>
>> Okay. Here is the updated patch. It implements mempcpy with memcpy as
>> before, but does not change the s390-specific string.h and
>> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy() won't
>> be called, but instead memcpy + n is inlined by the mempcpy macro in
>> string/string.h. After fxing PR70140, either the mempcpy macro in
>> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be
>> defined for S390.
>>
>> Okay to commit?
>
> +__asm__ (".weak mempcpy\n\t"
> + ".set mempcpy,__mempcpy\n\t");
>
> Don't we have a macro for this?
>
Yes that is true, but I can't use it in mempcpy.c in this case,
because the macro s390_libc_ifunc (__mempcpy) uses an inline assembly
to implement the __mempcpy ifunc symbol.
If I use the weak_alias macro here in the c-file, gcc fails because it 
has no knowledge of symbol __mempcpy.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-13 14:43                     ` Stefan Liebler
@ 2016-05-13 14:49                       ` H.J. Lu
  2016-05-13 14:59                         ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-13 14:49 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: GNU C Library

On Fri, May 13, 2016 at 7:42 AM, Stefan Liebler <stli@linux.vnet.ibm.com> wrote:
>
>
> On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>>
>>
>>
>> On 09/05/2016 11:15, Stefan Liebler wrote:
>>>
>>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>>>
>>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But my point is all the architectures which provide an optimized
>>>>>>>>>>> mempcpy is
>>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for
>>>>>>>>>>> this patchset),
>>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the
>>>>>>>>>>> pointers (x86_64) or
>>>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>>
>>>>>>>>>>> So for this change I am proposing compiler support won't be
>>>>>>>>>>> required because both
>>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on
>>>>>>>>>>> assumption that
>>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no need
>>>>>>>>>>> to just add
>>>>>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> GLIBC already has this optimization in the generic string header,
>>>>>>>>>> it's just that s390 wants
>>>>>>>>>> to do something different again. As long as GCC isn't fixed this
>>>>>>>>>> isn't possible to support
>>>>>>>>>> s390 without this header workaround. And we need GCC to improve so
>>>>>>>>>> things work
>>>>>>>>>> better for all the other C libraries...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But the current one at string/string.h is only enabled with
>>>>>>>>> !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What
>>>>>>>>> I am trying to argue it
>>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it
>>>>>>>>> as default for all
>>>>>>>>> ports.
>>>>>>>>
>>>>>>>>
>>>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>>
>>>>>>>
>>>>>>> Yes, direct call will require save and restore the size for further
>>>>>>> add
>>>>>>> and this is true for most architectures.  My question is if does this
>>>>>>> really matter in currently GLIBC internal usage and on programs that
>>>>>>> might use it compared against the burden of keeping the various
>>>>>>> string*.h header in check for multiple architectures or adding this
>>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>>
>>>>>>
>>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>>> inline mempcpy for x86.
>>>>>
>>>>>
>>>>> In fact I am objecting all the bits GLIBC added on string*.h that only
>>>>> adds complexity for some micro-optimizations. For x86 I do agree that
>>>>> transforming mempcpy to memcpy is no the best strategy.
>>>>>
>>>>> My rationale is to avoid add even more arch-specific bits in installed
>>>>> headers to add such optimizations.
>>>>
>>>>
>>>> I believe most of those micro-optimizations belong to GCC, not glibc.
>>>> Of course, we should keep the existing ones for older GCCs.  We
>>>> should avoid adding new ones.
>>>>
>>>
>>> Does this mean, the proposed way is to not add a macro for mempcpy in
>>> sysdeps/s390/bits/string.h or e.g. for another architecture?
>>>
>>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will always
>>> use the macro defined in string/string.h, which inlines memcpy + n and the
>>> mempcpy-function is not called. The memcpy is transformed to mvc, mvhi for
>>> e.g. constant lengths. Otherwise, the memcpy-function is called and the
>>> length will be added after the call.
>>>
>>> According to PR70140
>>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in
>>> string/string.h will be removed after this bug is fixed in GCC?
>>> Then I think the decision, if mempcpy or memcpy is called or an inline
>>> version is emitted, will be done per architecture backend?
>>>
>>> If this is all true, then the mempcpy-function will be called in future
>>> if it makes sense!?
>>>
>>>
>>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then
>>> mempcpy-function is always called - even for cases with constant length
>>> where mvc or mvhi is emitted.
>>> This is definitely not the intention of this patch.
>>> After fixing PR70140, GCC will correctly handle the constant length
>>> cases!?
>>
>>
>> What I am proposing is to avoid add more arch-specific optimization on
>> installed
>> string*.h headers and instead work on adding them on compiler side.  My
>> view is
>> we should cleanup as much as possible the string headers and only add
>> optimization
>> that are more architecture neutral.
>>
>> Related to patch, my understanding is s390x does not really provide an
>> optimized
>> mempcpy (it uses the default mempcpy.c) and I think a better approach
>> would be
>> to add an optimized mempcpy like x88_64
>> (c365e615f7429aee302f8af7bf07ae262278febb).
>> The mempcpy won't be optimized directly to mvc instruction, but at least
>> it will
>> call the optimized memcpy.  The full optimization will be done by
>> correctly
>> handling the transformation on compiler size (the PR#70140 as you
>> referred).
>>
>>
>
> Okay. Here is the updated patch. It implements mempcpy with memcpy as
> before, but does not change the s390-specific string.h and
> _HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy() won't
> be called, but instead memcpy + n is inlined by the mempcpy macro in
> string/string.h. After fxing PR70140, either the mempcpy macro in
> string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be
> defined for S390.
>
> Okay to commit?

+__asm__ (".weak mempcpy\n\t"
+ ".set mempcpy,__mempcpy\n\t");

Don't we have a macro for this?

-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-12 14:11                   ` Adhemerval Zanella
@ 2016-05-13 14:43                     ` Stefan Liebler
  2016-05-13 14:49                       ` H.J. Lu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-13 14:43 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 5781 bytes --]



On 05/12/2016 04:10 PM, Adhemerval Zanella wrote:
>
>
> On 09/05/2016 11:15, Stefan Liebler wrote:
>> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>>> <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>
>>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>
>>>>>>
>>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>>
>>>>>>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>>
>>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>>
>>>>>>>>>> So for this change I am proposing compiler support won't be required because both
>>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>>>>
>>>>>>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>>>>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>>>>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>>>>>>> better for all the other C libraries...
>>>>>>>>
>>>>>>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>>>>>>> ports.
>>>>>>>
>>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>>> save and restore 2 registers for no good reasons.
>>>>>>
>>>>>> Yes, direct call will require save and restore the size for further add
>>>>>> and this is true for most architectures.  My question is if does this
>>>>>> really matter in currently GLIBC internal usage and on programs that
>>>>>> might use it compared against the burden of keeping the various
>>>>>> string*.h header in check for multiple architectures or adding this
>>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>>
>>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>>> inline mempcpy for x86.
>>>>
>>>> In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.
>>>>
>>>> My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.
>>>
>>> I believe most of those micro-optimizations belong to GCC, not glibc.
>>> Of course, we should keep the existing ones for older GCCs.  We
>>> should avoid adding new ones.
>>>
>>
>> Does this mean, the proposed way is to not add a macro for mempcpy in sysdeps/s390/bits/string.h or e.g. for another architecture?
>>
>> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will always use the macro defined in string/string.h, which inlines memcpy + n and the mempcpy-function is not called. The memcpy is transformed to mvc, mvhi for e.g. constant lengths. Otherwise, the memcpy-function is called and the length will be added after the call.
>>
>> According to PR70140 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in string/string.h will be removed after this bug is fixed in GCC?
>> Then I think the decision, if mempcpy or memcpy is called or an inline version is emitted, will be done per architecture backend?
>>
>> If this is all true, then the mempcpy-function will be called in future if it makes sense!?
>>
>>
>> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then mempcpy-function is always called - even for cases with constant length where mvc or mvhi is emitted.
>> This is definitely not the intention of this patch.
>> After fixing PR70140, GCC will correctly handle the constant length cases!?
>
> What I am proposing is to avoid add more arch-specific optimization on installed
> string*.h headers and instead work on adding them on compiler side.  My view is
> we should cleanup as much as possible the string headers and only add optimization
> that are more architecture neutral.
>
> Related to patch, my understanding is s390x does not really provide an optimized
> mempcpy (it uses the default mempcpy.c) and I think a better approach would be
> to add an optimized mempcpy like x88_64 (c365e615f7429aee302f8af7bf07ae262278febb).
> The mempcpy won't be optimized directly to mvc instruction, but at least it will
> call the optimized memcpy.  The full optimization will be done by correctly
> handling the transformation on compiler size (the PR#70140 as you referred).
>
>

Okay. Here is the updated patch. It implements mempcpy with memcpy as 
before, but does not change the s390-specific string.h and 
_HAVE_STRING_ARCH_mempcpy is not defined. Thus at the moment mempcpy() 
won't be called, but instead memcpy + n is inlined by the mempcpy macro 
in string/string.h. After fxing PR70140, either the mempcpy macro in 
string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to be 
defined for S390.

Okay to commit?

[-- Attachment #2: 20160513_Implement_mempcpy_with_memcpy.patch --]
[-- Type: text/x-patch, Size: 14555 bytes --]

commit 4017e270314135561e309170ddb547857f9cc16b
Author: Stefan Liebler <stli@linux.vnet.ibm.com>
Date:   Tue Apr 26 13:11:23 2016 +0200

    S390: Implement mempcpy with help of memcpy. [BZ #19765]
    
    There exist optimized memcpy functions on s390, but no optimized mempcpy.
    This patch adds mempcpy entry points in memcpy.S files, which
    use the memcpy implementation. Now mempcpy itself is also an IFUNC function
    as memcpy is and the variants are listed in ifunc-impl-list.c.
    
    The s390 string.h does not define _HAVE_STRING_ARCH_mempcpy.
    Instead mempcpy string/string.h inlines memcpy() + n.
    If n is constant and small enough, GCC emits instructions like mvi or mvc
    and avoids the function call to memcpy.
    If n is not constant, then memcpy is called and n is added afterwards.
    If _HAVE_STRING_ARCH_mempcpy would be defined, mempcpy would be called in
    every case.
    
    According to PR70140 "Inefficient expansion of __builtin_mempcpy"
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140) GCC should handle a
    call to mempcpy in the same way as memcpy. Then either the mempcpy macro
    in string/string.h has to be removed or _HAVE_STRING_ARCH_mempcpy has to
    be defined for S390.
    
    ChangeLog:
    
    	[BZ #19765]
    	* sysdeps/s390/mempcpy.S: New File.
    	* sysdeps/s390/multiarch/mempcpy.c: Likewise.
    	* sysdeps/s390/multiarch/Makefile (sysdep_routines): Add mempcpy.
    	* sysdeps/s390/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
    	Add mempcpy variants.
    	* sysdeps/s390/s390-32/memcpy.S: Add mempcpy entry point.
    	(memcpy): Adjust to be usable from mempcpy entry point.
    	(__memcpy_mvcle): Likewise.
    	* sysdeps/s390/s390-64/memcpy.S: Likewise.
    	* sysdeps/s390/s390-32/multiarch/memcpy-s390.S: Add entry points
    	____mempcpy_z196, ____mempcpy_z10 and add __GI_ symbols for mempcpy.
    	(__memcpy_z196): Adjust to be usable from mempcpy entry point.
    	(__memcpy_z10): Likewise.
    	* sysdeps/s390/s390-64/multiarch/memcpy-s390x.S: Likewise.

diff --git a/sysdeps/s390/mempcpy.S b/sysdeps/s390/mempcpy.S
new file mode 100644
index 0000000..c62074b
--- /dev/null
+++ b/sysdeps/s390/mempcpy.S
@@ -0,0 +1,19 @@
+/* CPU specific mempcpy without multiarch - 32/64 bit S/390 version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* mempcpy is implemented in memcpy.S.  */
diff --git a/sysdeps/s390/multiarch/Makefile b/sysdeps/s390/multiarch/Makefile
index 0805b07..89324ca 100644
--- a/sysdeps/s390/multiarch/Makefile
+++ b/sysdeps/s390/multiarch/Makefile
@@ -18,7 +18,8 @@ sysdep_routines += strlen strlen-vx strlen-c \
 		   memchr memchr-vx \
 		   rawmemchr rawmemchr-vx rawmemchr-c \
 		   memccpy memccpy-vx memccpy-c \
-		   memrchr memrchr-vx memrchr-c
+		   memrchr memrchr-vx memrchr-c \
+		   mempcpy
 endif
 
 ifeq ($(subdir),wcsmbs)
diff --git a/sysdeps/s390/multiarch/ifunc-impl-list.c b/sysdeps/s390/multiarch/ifunc-impl-list.c
index 62a4359..89898ec 100644
--- a/sysdeps/s390/multiarch/ifunc-impl-list.c
+++ b/sysdeps/s390/multiarch/ifunc-impl-list.c
@@ -69,6 +69,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      S390_IS_Z10 (stfle_bits), __memcpy_z10)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_default))
 
+  IFUNC_IMPL (i, name, mempcpy,
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z196 (stfle_bits), ____mempcpy_z196)
+	      IFUNC_IMPL_ADD (array, i, mempcpy,
+			      S390_IS_Z10 (stfle_bits), ____mempcpy_z10)
+	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, ____mempcpy_default))
+
 #endif /* SHARED */
 
 #ifdef HAVE_S390_VX_ASM_SUPPORT
diff --git a/sysdeps/s390/multiarch/mempcpy.c b/sysdeps/s390/multiarch/mempcpy.c
new file mode 100644
index 0000000..34d8329
--- /dev/null
+++ b/sysdeps/s390/multiarch/mempcpy.c
@@ -0,0 +1,26 @@
+/* Multiple versions of mempcpy.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#if defined SHARED && IS_IN (libc)
+# include <ifunc-resolve.h>
+s390_libc_ifunc (__mempcpy)
+
+__asm__ (".weak mempcpy\n\t"
+	 ".set mempcpy,__mempcpy\n\t");
+#endif
diff --git a/sysdeps/s390/s390-32/memcpy.S b/sysdeps/s390/s390-32/memcpy.S
index 2ac51ab..6be5104 100644
--- a/sysdeps/s390/s390-32/memcpy.S
+++ b/sysdeps/s390/s390-32/memcpy.S
@@ -25,12 +25,23 @@
      %r3 = address of source memory area
      %r4 = number of bytes to copy.  */
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
-ENTRY(memcpy)
+       .text
+ENTRY(__mempcpy)
+	.machine "g5"
+	lr      %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_G5_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
 #endif
+
+ENTRY(memcpy)
 	.machine "g5"
+	lr      %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_G5_start:
 	st      %r13,52(%r15)
 	.cfi_offset 13, -44
 	basr    %r13,0
@@ -41,14 +52,13 @@ ENTRY(memcpy)
 	lr      %r5,%r4
 	srl     %r5,8
 	ltr     %r5,%r5
-	lr      %r1,%r2
 	jne     .L_G5_13
 	ex      %r4,.L_G5_17-.L_G5_16(%r13)
 .L_G5_4:
 	l       %r13,52(%r15)
 	br      %r14
 .L_G5_13:
-	chi	%r5,4096             # Switch to mvcle for copies >1MB
+	chi	%r5,4096            # Switch to mvcle for copies >1MB
 	jh	__memcpy_mvcle
 .L_G5_12:
 	mvc     0(256,%r1),0(%r3)
@@ -59,24 +69,24 @@ ENTRY(memcpy)
 	j       .L_G5_4
 .L_G5_17:
 	mvc     0(1,%r1),0(%r3)
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       ahi     %r4,1               # length + 1
-       lr      %r5,%r4             # source length
-       lr      %r4,%r3             # source address
-       lr      %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lr      %r0,%r2             # backup return dest [ + n ]
+	ahi     %r4,1               # length + 1
+	lr      %r5,%r4             # source length
+	lr      %r4,%r3             # source address
+	lr      %r2,%r1             # destination address
+	lr      %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lr      %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lr      %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
index 92ffaea..297a894 100644
--- a/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
+++ b/sysdeps/s390/s390-32/multiarch/memcpy-s390.S
@@ -29,14 +29,23 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	llgfr   %r4,%r4
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -60,13 +69,22 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
 	.machinemode "zarch_nohighgprs"
+	lr      %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	llgfr   %r4,%r4
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lr      %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -88,14 +106,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif
diff --git a/sysdeps/s390/s390-64/memcpy.S b/sysdeps/s390/s390-64/memcpy.S
index 9d60a14..e3ace0c 100644
--- a/sysdeps/s390/s390-64/memcpy.S
+++ b/sysdeps/s390/s390-64/memcpy.S
@@ -27,19 +27,27 @@
 
 
        .text
+ENTRY(__mempcpy)
+	.machine "z900"
+	lgr     %r1,%r2             # Use as dest
+	la      %r2,0(%r4,%r2)      # Return dest + n
+	j	.L_Z900_start
+END(__mempcpy)
+#ifndef USE_MULTIARCH
+libc_hidden_def (__mempcpy)
+weak_alias (__mempcpy, mempcpy)
+libc_hidden_builtin_def (mempcpy)
+#endif
 
-#ifdef USE_MULTIARCH
-ENTRY(__memcpy_default)
-#else
 ENTRY(memcpy)
-#endif
 	.machine "z900"
+	lgr     %r1,%r2             # r1: Use as dest ; r2: Return dest
+.L_Z900_start:
 	ltgr    %r4,%r4
 	je      .L_Z900_4
 	aghi    %r4,-1
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
-	lgr     %r1,%r2
 	jne     .L_Z900_13
 .L_Z900_3:
 	larl    %r5,.L_Z900_15
@@ -57,25 +65,24 @@ ENTRY(memcpy)
 	j       .L_Z900_3
 .L_Z900_15:
 	mvc     0(1,%r1),0(%r3)
-
-#ifdef USE_MULTIARCH
-END(__memcpy_default)
-#else
 END(memcpy)
+#ifndef USE_MULTIARCH
 libc_hidden_builtin_def (memcpy)
 #endif
 
 ENTRY(__memcpy_mvcle)
-       # Using as standalone function will result in unexpected
-       # results since the length field is incremented by 1 in order to
-       # compensate the changes already done in the functions above.
-       aghi    %r4,1               # length + 1
-       lgr     %r5,%r4             # source length
-       lgr     %r4,%r3             # source address
-       lgr     %r3,%r5             # destination length = source length
+	# Using as standalone function will result in unexpected
+	# results since the length field is incremented by 1 in order to
+	# compensate the changes already done in the functions above.
+	lgr     %r0,%r2             # backup return dest [ + n ]
+	aghi    %r4,1               # length + 1
+	lgr     %r5,%r4             # source length
+	lgr     %r4,%r3             # source address
+	lgr     %r2,%r1             # destination address
+	lgr     %r3,%r5             # destination length = source length
 .L_MVCLE_1:
-       mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
-       jo      .L_MVCLE_1
-       lgr     %r2,%r1             # return destination address
-       br      %r14
+	mvcle   %r2,%r4,0           # thats it, MVCLE is your friend
+	jo      .L_MVCLE_1
+	lgr     %r2,%r0             # return destination address
+	br      %r14
 END(__memcpy_mvcle)
diff --git a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
index 8f54526..0f5a36e 100644
--- a/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
+++ b/sysdeps/s390/s390-64/multiarch/memcpy-s390x.S
@@ -29,12 +29,20 @@
 
 #if defined SHARED && IS_IN (libc)
 
+ENTRY(____mempcpy_z196)
+	.machine "z196"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z196_start
+END(____mempcpy_z196)
+
 ENTRY(__memcpy_z196)
 	.machine "z196"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z196_start:
 	ltgr    %r4,%r4
 	je      .L_Z196_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	ltgr    %r5,%r5
 	jne     .L_Z196_5
@@ -58,11 +66,19 @@ ENTRY(__memcpy_z196)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z196)
 
+ENTRY(____mempcpy_z10)
+	.machine "z10"
+	lgr     %r1,%r2         # Use as dest
+	la      %r2,0(%r4,%r2)  # Return dest + n
+	j	.L_Z10_start
+END(____mempcpy_z10)
+
 ENTRY(__memcpy_z10)
 	.machine "z10"
+	lgr     %r1,%r2         # r1: Use as dest ; r2: Return dest
+.L_Z10_start:
 	cgije   %r4,0,.L_Z10_4
 	aghi    %r4,-1
-	lgr     %r1,%r2
 	srlg    %r5,%r4,8
 	cgijlh  %r5,0,.L_Z10_13
 .L_Z10_3:
@@ -84,14 +100,23 @@ ENTRY(__memcpy_z10)
 	mvc     0(1,%r1),0(%r3)
 END(__memcpy_z10)
 
+# define __mempcpy ____mempcpy_default
 #endif /* SHARED && IS_IN (libc) */
 
+#define memcpy __memcpy_default
 #include "../memcpy.S"
+#undef memcpy
 
 #if defined SHARED && IS_IN (libc)
 .globl   __GI_memcpy
 .set     __GI_memcpy,__memcpy_default
+.globl   __GI_mempcpy
+.set     __GI_mempcpy,____mempcpy_default
+.globl   __GI___mempcpy
+.set     __GI___mempcpy,____mempcpy_default
 #else
 .globl   memcpy
 .set     memcpy,__memcpy_default
+.weak    mempcpy
+.set     mempcpy,__mempcpy
 #endif

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-09 14:39                 ` Stefan Liebler
@ 2016-05-12 14:11                   ` Adhemerval Zanella
  2016-05-13 14:43                     ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-12 14:11 UTC (permalink / raw)
  To: libc-alpha



On 09/05/2016 11:15, Stefan Liebler wrote:
> On 05/05/2016 06:36 PM, H.J. Lu wrote:
>> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
>> <adhemerval.zanella@linaro.org> wrote:
>>>
>>>
>>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>>
>>>>>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>>
>>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>>
>>>>>>>>> So for this change I am proposing compiler support won't be required because both
>>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>>>
>>>>>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>>>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>>>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>>>>>> better for all the other C libraries...
>>>>>>>
>>>>>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>>>>>> ports.
>>>>>>
>>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>>> save and restore 2 registers for no good reasons.
>>>>>
>>>>> Yes, direct call will require save and restore the size for further add
>>>>> and this is true for most architectures.  My question is if does this
>>>>> really matter in currently GLIBC internal usage and on programs that
>>>>> might use it compared against the burden of keeping the various
>>>>> string*.h header in check for multiple architectures or adding this
>>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>>
>>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>>> inline mempcpy for x86.
>>>
>>> In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.
>>>
>>> My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.
>>
>> I believe most of those micro-optimizations belong to GCC, not glibc.
>> Of course, we should keep the existing ones for older GCCs.  We
>> should avoid adding new ones.
>>
> 
> Does this mean, the proposed way is to not add a macro for mempcpy in sysdeps/s390/bits/string.h or e.g. for another architecture?
> 
> If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will always use the macro defined in string/string.h, which inlines memcpy + n and the mempcpy-function is not called. The memcpy is transformed to mvc, mvhi for e.g. constant lengths. Otherwise, the memcpy-function is called and the length will be added after the call.
> 
> According to PR70140 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in string/string.h will be removed after this bug is fixed in GCC?
> Then I think the decision, if mempcpy or memcpy is called or an inline version is emitted, will be done per architecture backend?
> 
> If this is all true, then the mempcpy-function will be called in future if it makes sense!?
> 
> 
> If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then mempcpy-function is always called - even for cases with constant length where mvc or mvhi is emitted.
> This is definitely not the intention of this patch.
> After fixing PR70140, GCC will correctly handle the constant length cases!?

What I am proposing is to avoid add more arch-specific optimization on installed
string*.h headers and instead work on adding them on compiler side.  My view is
we should cleanup as much as possible the string headers and only add optimization
that are more architecture neutral.

Related to patch, my understanding is s390x does not really provide an optimized
mempcpy (it uses the default mempcpy.c) and I think a better approach would be
to add an optimized mempcpy like x88_64 (c365e615f7429aee302f8af7bf07ae262278febb).
The mempcpy won't be optimized directly to mvc instruction, but at least it will
call the optimized memcpy.  The full optimization will be done by correctly
handling the transformation on compiler size (the PR#70140 as you referred).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-05 16:36               ` H.J. Lu
@ 2016-05-09 14:39                 ` Stefan Liebler
  2016-05-12 14:11                   ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Liebler @ 2016-05-09 14:39 UTC (permalink / raw)
  To: libc-alpha

On 05/05/2016 06:36 PM, H.J. Lu wrote:
> On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>>> <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>
>>>>>>
>>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>>> Adhemerval Zanella wrote:
>>>>>>>>
>>>>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>>
>>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>>
>>>>>>>> So for this change I am proposing compiler support won't be required because both
>>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>>
>>>>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>>>>> better for all the other C libraries...
>>>>>>
>>>>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>>>>> ports.
>>>>>
>>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>>> save and restore 2 registers for no good reasons.
>>>>
>>>> Yes, direct call will require save and restore the size for further add
>>>> and this is true for most architectures.  My question is if does this
>>>> really matter in currently GLIBC internal usage and on programs that
>>>> might use it compared against the burden of keeping the various
>>>> string*.h header in check for multiple architectures or adding this
>>>> logic (mempcpy transformation to memcpy) on compiler.
>>>
>>> What burden? There is nothing to do in glibc for x86.  GCC can
>>> inline mempcpy for x86.
>>
>> In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.
>>
>> My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.
>
> I believe most of those micro-optimizations belong to GCC, not glibc.
> Of course, we should keep the existing ones for older GCCs.  We
> should avoid adding new ones.
>

Does this mean, the proposed way is to not add a macro for mempcpy in 
sysdeps/s390/bits/string.h or e.g. for another architecture?

If _HAVE_STRING_ARCH_mempcpy is not defined for s390, then it will 
always use the macro defined in string/string.h, which inlines memcpy + 
n and the mempcpy-function is not called. The memcpy is transformed to 
mvc, mvhi for e.g. constant lengths. Otherwise, the memcpy-function is 
called and the length will be added after the call.

According to PR70140 
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140), the macro in 
string/string.h will be removed after this bug is fixed in GCC?
Then I think the decision, if mempcpy or memcpy is called or an inline 
version is emitted, will be done per architecture backend?

If this is all true, then the mempcpy-function will be called in future 
if it makes sense!?


If _HAVE_STRING_ARCH_mempcpy will be defined for s390, then 
mempcpy-function is always called - even for cases with constant length 
where mvc or mvhi is emitted.
This is definitely not the intention of this patch.
After fixing PR70140, GCC will correctly handle the constant length cases!?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-05 16:34             ` Adhemerval Zanella
@ 2016-05-05 16:36               ` H.J. Lu
  2016-05-09 14:39                 ` Stefan Liebler
  0 siblings, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-05 16:36 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Wilco Dijkstra, nd, GNU C Library

On Thu, May 5, 2016 at 9:34 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
>> <adhemerval.zanella@linaro.org> wrote:
>>>
>>>
>>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>>> Adhemerval Zanella wrote:
>>>>>>>
>>>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>>>
>>>>>> Indeed, which of those are used doesn't matter much.
>>>>>>
>>>>>>> So for this change I am proposing compiler support won't be required because both
>>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>>>
>>>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>>>> better for all the other C libraries...
>>>>>
>>>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>>>> ports.
>>>>
>>>> Please don't enable it for x86.  Calling memcpy means we have to
>>>> save and restore 2 registers for no good reasons.
>>>
>>> Yes, direct call will require save and restore the size for further add
>>> and this is true for most architectures.  My question is if does this
>>> really matter in currently GLIBC internal usage and on programs that
>>> might use it compared against the burden of keeping the various
>>> string*.h header in check for multiple architectures or adding this
>>> logic (mempcpy transformation to memcpy) on compiler.
>>
>> What burden? There is nothing to do in glibc for x86.  GCC can
>> inline mempcpy for x86.
>
> In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.
>
> My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.

I believe most of those micro-optimizations belong to GCC, not glibc.
Of course, we should keep the existing ones for older GCCs.  We
should avoid adding new ones.

-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-05 14:45           ` H.J. Lu
@ 2016-05-05 16:34             ` Adhemerval Zanella
  2016-05-05 16:36               ` H.J. Lu
  0 siblings, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-05 16:34 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Wilco Dijkstra, nd, GNU C Library



> On May 5, 2016, at 11:45, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>> 
>> 
>>> On 05/05/2016 10:37, H.J. Lu wrote:
>>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>>> <adhemerval.zanella@linaro.org> wrote:
>>>> 
>>>> 
>>>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>>> Adhemerval Zanella wrote:
>>>>>> 
>>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>> 
>>>>> Indeed, which of those are used doesn't matter much.
>>>>> 
>>>>>> So for this change I am proposing compiler support won't be required because both
>>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>>> this micro-optimization to only s390, but rather make is general.
>>>>> 
>>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>>> better for all the other C libraries...
>>>> 
>>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>>> ports.
>>> 
>>> Please don't enable it for x86.  Calling memcpy means we have to
>>> save and restore 2 registers for no good reasons.
>> 
>> Yes, direct call will require save and restore the size for further add
>> and this is true for most architectures.  My question is if does this
>> really matter in currently GLIBC internal usage and on programs that
>> might use it compared against the burden of keeping the various
>> string*.h header in check for multiple architectures or adding this
>> logic (mempcpy transformation to memcpy) on compiler.
> 
> What burden? There is nothing to do in glibc for x86.  GCC can
> inline mempcpy for x86.

In fact I am objecting all the bits GLIBC added on string*.h that only adds complexity for some micro-optimizations. For x86 I do agree that transforming mempcpy to memcpy is no the best strategy.

My rationale is to avoid add even more arch-specific bits in installed headers to add such optimizations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-05 14:16         ` Adhemerval Zanella
@ 2016-05-05 14:45           ` H.J. Lu
  2016-05-05 16:34             ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-05 14:45 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Wilco Dijkstra, nd, GNU C Library

On Thu, May 5, 2016 at 7:15 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 05/05/2016 10:37, H.J. Lu wrote:
>> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
>> <adhemerval.zanella@linaro.org> wrote:
>>>
>>>
>>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>>> Adhemerval Zanella wrote:
>>>>>
>>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>>> 3. using a similar strategy for both implementations (powerpc).
>>>>
>>>> Indeed, which of those are used doesn't matter much.
>>>>
>>>>> So for this change I am proposing compiler support won't be required because both
>>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>>> this micro-optimization to only s390, but rather make is general.
>>>>
>>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>>> s390 without this header workaround. And we need GCC to improve so things work
>>>> better for all the other C libraries...
>>>
>>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>>> ports.
>>
>> Please don't enable it for x86.  Calling memcpy means we have to
>> save and restore 2 registers for no good reasons.
>
> Yes, direct call will require save and restore the size for further add
> and this is true for most architectures.  My question is if does this
> really matter in currently GLIBC internal usage and on programs that
> might use it compared against the burden of keeping the various
> string*.h header in check for multiple architectures or adding this
> logic (mempcpy transformation to memcpy) on compiler.
>

What burden? There is nothing to do in glibc for x86.  GCC can
inline mempcpy for x86.

-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-05 13:37       ` H.J. Lu
@ 2016-05-05 14:16         ` Adhemerval Zanella
  2016-05-05 14:45           ` H.J. Lu
  0 siblings, 1 reply; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-05 14:16 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Wilco Dijkstra, nd, GNU C Library



On 05/05/2016 10:37, H.J. Lu wrote:
> On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>>> Adhemerval Zanella wrote:
>>>>
>>>> But my point is all the architectures which provide an optimized mempcpy is
>>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>>> 3. using a similar strategy for both implementations (powerpc).
>>>
>>> Indeed, which of those are used doesn't matter much.
>>>
>>>> So for this change I am proposing compiler support won't be required because both
>>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>>> this micro-optimization to only s390, but rather make is general.
>>>
>>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>>> s390 without this header workaround. And we need GCC to improve so things work
>>> better for all the other C libraries...
>>
>> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
>> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
>> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
>> ports.
> 
> Please don't enable it for x86.  Calling memcpy means we have to
> save and restore 2 registers for no good reasons.

Yes, direct call will require save and restore the size for further add
and this is true for most architectures.  My question is if does this
really matter in currently GLIBC internal usage and on programs that
might use it compared against the burden of keeping the various
string*.h header in check for multiple architectures or adding this
logic (mempcpy transformation to memcpy) on compiler.

But I understand in the end is up to arch maintainer to add or not such
micro-optimizations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 20:58     ` Adhemerval Zanella
  2016-05-05 13:24       ` Wilco Dijkstra
@ 2016-05-05 13:37       ` H.J. Lu
  2016-05-05 14:16         ` Adhemerval Zanella
  1 sibling, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-05 13:37 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Wilco Dijkstra, nd, GNU C Library

On Wed, May 4, 2016 at 1:58 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/05/2016 17:51, Wilco Dijkstra wrote:
>> Adhemerval Zanella wrote:
>>>
>>> But my point is all the architectures which provide an optimized mempcpy is
>>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>>> 3. using a similar strategy for both implementations (powerpc).
>>
>> Indeed, which of those are used doesn't matter much.
>>
>>> So for this change I am proposing compiler support won't be required because both
>>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>>> memcpy is fast as mempcpy implementation I think there is no need to just add
>>> this micro-optimization to only s390, but rather make is general.
>>
>> GLIBC already has this optimization in the generic string header, it's just that s390 wants
>> to do something different again. As long as GCC isn't fixed this isn't possible to support
>> s390 without this header workaround. And we need GCC to improve so things work
>> better for all the other C libraries...
>
> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
> ports.

Please don't enable it for x86.  Calling memcpy means we have to
save and restore 2 registers for no good reasons.


-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 20:58     ` Adhemerval Zanella
@ 2016-05-05 13:24       ` Wilco Dijkstra
  2016-05-05 13:37       ` H.J. Lu
  1 sibling, 0 replies; 36+ messages in thread
From: Wilco Dijkstra @ 2016-05-05 13:24 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: nd, 'GNU C Library'

Adhemerval Zanella wrote:
> But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
> so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
> to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
> ports.

I'd be happy with treating mempcpy like bzero/bcopy - old nonstandard ABIs that should be
discouraged in new code and removed from existing code. But could we reach consensus on that?

Wilco

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 20:51   ` Wilco Dijkstra
@ 2016-05-04 20:58     ` Adhemerval Zanella
  2016-05-05 13:24       ` Wilco Dijkstra
  2016-05-05 13:37       ` H.J. Lu
  0 siblings, 2 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-04 20:58 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: nd, 'GNU C Library'



On 04/05/2016 17:51, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>>
>> But my point is all the architectures which provide an optimized mempcpy is
>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
>> 3. using a similar strategy for both implementations (powerpc).
> 
> Indeed, which of those are used doesn't matter much.
> 
>> So for this change I am proposing compiler support won't be required because both
>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>> memcpy is fast as mempcpy implementation I think there is no need to just add
>> this micro-optimization to only s390, but rather make is general.
> 
> GLIBC already has this optimization in the generic string header, it's just that s390 wants
> to do something different again. As long as GCC isn't fixed this isn't possible to support
> s390 without this header workaround. And we need GCC to improve so things work
> better for all the other C libraries...

But the current one at string/string.h is only enabled with !defined _HAVE_STRING_ARCH_mempcpy,
so if a port actually adds a mempcpy one it won't be enabled.  What I am trying to argue it
to just remove the !defined _HAVE_STRING_ARCH_mempcpy and enable it as default for all
ports.

> 
> In an ideal world mempcy and friends should never have been introduced as micro
> optimizations, they just cause performance and maintenance headaches for no actual
> benefit...
> 

Agreed ...

> Wilco
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 18:13 ` Adhemerval Zanella
  2016-05-04 18:21   ` H.J. Lu
@ 2016-05-04 20:51   ` Wilco Dijkstra
  2016-05-04 20:58     ` Adhemerval Zanella
  1 sibling, 1 reply; 36+ messages in thread
From: Wilco Dijkstra @ 2016-05-04 20:51 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: nd, 'GNU C Library'

Adhemerval Zanella wrote:
>
> But my point is all the architectures which provide an optimized mempcpy is
> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
> 3. using a similar strategy for both implementations (powerpc).

Indeed, which of those are used doesn't matter much.

> So for this change I am proposing compiler support won't be required because both
> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
> memcpy is fast as mempcpy implementation I think there is no need to just add
> this micro-optimization to only s390, but rather make is general.

GLIBC already has this optimization in the generic string header, it's just that s390 wants
to do something different again. As long as GCC isn't fixed this isn't possible to support
s390 without this header workaround. And we need GCC to improve so things work
better for all the other C libraries...

In an ideal world mempcy and friends should never have been introduced as micro
optimizations, they just cause performance and maintenance headaches for no actual
benefit...

Wilco

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 18:21   ` H.J. Lu
@ 2016-05-04 18:23     ` Adhemerval Zanella
  0 siblings, 0 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-04 18:23 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Wilco Dijkstra, nd, GNU C Library



On 04/05/2016 15:20, H.J. Lu wrote:
> On Wed, May 4, 2016 at 11:12 AM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>> On 04/05/2016 14:17, Wilco Dijkstra wrote:
>>> Adhemerval Zanella wrote:
>>>> Right, but I *think* compiler would be smart enough to just avoid the extra spilling.
>>>> Take this example for instance [1], using GCC 5.3 for s390x I see no difference in
>>>> generated assembly if I the strategy I proposed (-DMEMPCPY_TO_MEMCPY) to
>>>> the s390 specific you are suggesting.  In the end, I am proposing that architecture
>>>> specific micro-optimization should be avoid in favor of a more specific one.
>>>> Specially the one that tend to avoid one or two extra spilling based on quite complex
>>>> macro expansion.  [1] http://pastie.org/10824072
>>>
>>> You need to use something like this to show the difference:
>>>
>>> return __mempcpy (__mempcpy (__mempcpy (p1, s, len), p2, 1), p3, 16);
>>>
>>> GCC doesn't even optimize mempcpy of constant size (PR70140), so if you do have
>>> an optimized mempcpy like s390 here, you *still* need to use memcpy for small immediate
>>> sizes (so they get inlined), and only use mempcpy for unknown or very large sizes.
>>>
>>> We end up having to do these header tricks because GCC doesn't implement mempcpy
>>> as a first-class builtin or allow targets to defer to memcpy.
>>>
>>> There are similar issues with strchr (s, 0) being used instead of the faster strlen (s) + s.
>>>
>>> Wilco
>>
>> But my point is all the architectures which provide an optimized mempcpy is
>> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
>> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or
> 
> X86-64 doesn't do that after
> 
> commit c365e615f7429aee302f8af7bf07ae262278febb
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Mon Mar 28 13:13:36 2016 -0700
> 
>     Implement x86-64 multiarch mempcpy in memcpy
> 
>     Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
>     reduces code size of libc.so.
> 
>       [BZ #18858]

Right, so it follows s390 strategy as well.

> 
>> 3. using a similar strategy for both implementations (powerpc).
>>
>> So for this change I am proposing compiler support won't be required because both
>> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
>> memcpy is fast as mempcpy implementation I think there is no need to just add
>> this micro-optimization to only s390, but rather make is general.
> 
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 18:13 ` Adhemerval Zanella
@ 2016-05-04 18:21   ` H.J. Lu
  2016-05-04 18:23     ` Adhemerval Zanella
  2016-05-04 20:51   ` Wilco Dijkstra
  1 sibling, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2016-05-04 18:21 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Wilco Dijkstra, nd, GNU C Library

On Wed, May 4, 2016 at 11:12 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/05/2016 14:17, Wilco Dijkstra wrote:
>> Adhemerval Zanella wrote:
>>> Right, but I *think* compiler would be smart enough to just avoid the extra spilling.
>>> Take this example for instance [1], using GCC 5.3 for s390x I see no difference in
>>> generated assembly if I the strategy I proposed (-DMEMPCPY_TO_MEMCPY) to
>>> the s390 specific you are suggesting.  In the end, I am proposing that architecture
>>> specific micro-optimization should be avoid in favor of a more specific one.
>>> Specially the one that tend to avoid one or two extra spilling based on quite complex
>>> macro expansion.  [1] http://pastie.org/10824072
>>
>> You need to use something like this to show the difference:
>>
>> return __mempcpy (__mempcpy (__mempcpy (p1, s, len), p2, 1), p3, 16);
>>
>> GCC doesn't even optimize mempcpy of constant size (PR70140), so if you do have
>> an optimized mempcpy like s390 here, you *still* need to use memcpy for small immediate
>> sizes (so they get inlined), and only use mempcpy for unknown or very large sizes.
>>
>> We end up having to do these header tricks because GCC doesn't implement mempcpy
>> as a first-class builtin or allow targets to defer to memcpy.
>>
>> There are similar issues with strchr (s, 0) being used instead of the faster strlen (s) + s.
>>
>> Wilco
>
> But my point is all the architectures which provide an optimized mempcpy is
> though either 1. jump directly to optimized memcpy (s390 case for this patchset),
> 2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or

X86-64 doesn't do that after

commit c365e615f7429aee302f8af7bf07ae262278febb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Mar 28 13:13:36 2016 -0700

    Implement x86-64 multiarch mempcpy in memcpy

    Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
    reduces code size of libc.so.

      [BZ #18858]

> 3. using a similar strategy for both implementations (powerpc).
>
> So for this change I am proposing compiler support won't be required because both
> memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
> memcpy is fast as mempcpy implementation I think there is no need to just add
> this micro-optimization to only s390, but rather make is general.



-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
  2016-05-04 17:17 [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Wilco Dijkstra
@ 2016-05-04 18:13 ` Adhemerval Zanella
  2016-05-04 18:21   ` H.J. Lu
  2016-05-04 20:51   ` Wilco Dijkstra
  0 siblings, 2 replies; 36+ messages in thread
From: Adhemerval Zanella @ 2016-05-04 18:13 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: nd, 'GNU C Library'



On 04/05/2016 14:17, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>> Right, but I *think* compiler would be smart enough to just avoid the extra spilling. 
>> Take this example for instance [1], using GCC 5.3 for s390x I see no difference in
>> generated assembly if I the strategy I proposed (-DMEMPCPY_TO_MEMCPY) to
>> the s390 specific you are suggesting.  In the end, I am proposing that architecture
>> specific micro-optimization should be avoid in favor of a more specific one.  
>> Specially the one that tend to avoid one or two extra spilling based on quite complex
>> macro expansion.  [1] http://pastie.org/10824072
> 
> You need to use something like this to show the difference:
> 
> return __mempcpy (__mempcpy (__mempcpy (p1, s, len), p2, 1), p3, 16);
> 
> GCC doesn't even optimize mempcpy of constant size (PR70140), so if you do have
> an optimized mempcpy like s390 here, you *still* need to use memcpy for small immediate
> sizes (so they get inlined), and only use mempcpy for unknown or very large sizes.
> 
> We end up having to do these header tricks because GCC doesn't implement mempcpy
> as a first-class builtin or allow targets to defer to memcpy.
> 
> There are similar issues with strchr (s, 0) being used instead of the faster strlen (s) + s.
> 
> Wilco

But my point is all the architectures which provide an optimized mempcpy is
though either 1. jump directly to optimized memcpy (s390 case for this patchset),
2. clonning the same memcpy implementation and adjusting the pointers (x86_64) or 
3. using a similar strategy for both implementations (powerpc).

So for this change I am proposing compiler support won't be required because both
memcpy and __mempcpy will be transformed to memcpy + s.  Based on assumption that
memcpy is fast as mempcpy implementation I think there is no need to just add
this micro-optimization to only s390, but rather make is general.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765]
@ 2016-05-04 17:17 Wilco Dijkstra
  2016-05-04 18:13 ` Adhemerval Zanella
  0 siblings, 1 reply; 36+ messages in thread
From: Wilco Dijkstra @ 2016-05-04 17:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: nd, 'GNU C Library'

Adhemerval Zanella wrote:
> Right, but I *think* compiler would be smart enough to just avoid the extra spilling. 
> Take this example for instance [1], using GCC 5.3 for s390x I see no difference in
> generated assembly if I the strategy I proposed (-DMEMPCPY_TO_MEMCPY) to
> the s390 specific you are suggesting.  In the end, I am proposing that architecture
> specific micro-optimization should be avoid in favor of a more specific one.  
> Specially the one that tend to avoid one or two extra spilling based on quite complex
> macro expansion.  [1] http://pastie.org/10824072

You need to use something like this to show the difference:

return __mempcpy (__mempcpy (__mempcpy (p1, s, len), p2, 1), p3, 16);

GCC doesn't even optimize mempcpy of constant size (PR70140), so if you do have
an optimized mempcpy like s390 here, you *still* need to use memcpy for small immediate
sizes (so they get inlined), and only use mempcpy for unknown or very large sizes.

We end up having to do these header tricks because GCC doesn't implement mempcpy
as a first-class builtin or allow targets to defer to memcpy.

There are similar issues with strchr (s, 0) being used instead of the faster strlen (s) + s.

Wilco

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-05-24  8:55 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-26 12:08 [PATCH 1/4] S390: Use mvcle for copies > 1MB on 32bit with default memcpy variant Stefan Liebler
2016-04-26 12:08 ` [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Stefan Liebler
2016-04-26 13:33   ` Adhemerval Zanella
2016-04-27  8:15     ` Stefan Liebler
2016-05-04 15:42       ` Adhemerval Zanella
2016-05-04 13:20   ` Stefan Liebler
2016-04-26 12:08 ` [PATCH 3/4] S390: Do not call memcpy, memcmp, memset within libc.so via ifunc-plt Stefan Liebler
2016-04-26 13:35   ` Adhemerval Zanella
2016-04-27  8:20     ` Stefan Liebler
2016-05-04 15:22       ` Adhemerval Zanella
2016-05-11 13:41         ` Stefan Liebler
2016-05-11 19:27           ` Adhemerval Zanella
2016-04-26 12:08 ` [PATCH 2/4] S390: Use 64bit instruction to check for copies of > 1MB with mvcle Stefan Liebler
2016-04-26 14:16   ` Florian Weimer
2016-04-26 14:20     ` Stefan Liebler
2016-05-04 17:17 [PATCH 4/4] S390: Implement mempcpy with help of memcpy. [BZ #19765] Wilco Dijkstra
2016-05-04 18:13 ` Adhemerval Zanella
2016-05-04 18:21   ` H.J. Lu
2016-05-04 18:23     ` Adhemerval Zanella
2016-05-04 20:51   ` Wilco Dijkstra
2016-05-04 20:58     ` Adhemerval Zanella
2016-05-05 13:24       ` Wilco Dijkstra
2016-05-05 13:37       ` H.J. Lu
2016-05-05 14:16         ` Adhemerval Zanella
2016-05-05 14:45           ` H.J. Lu
2016-05-05 16:34             ` Adhemerval Zanella
2016-05-05 16:36               ` H.J. Lu
2016-05-09 14:39                 ` Stefan Liebler
2016-05-12 14:11                   ` Adhemerval Zanella
2016-05-13 14:43                     ` Stefan Liebler
2016-05-13 14:49                       ` H.J. Lu
2016-05-13 14:59                         ` Stefan Liebler
2016-05-13 15:06                           ` H.J. Lu
2016-05-13 15:31                             ` Stefan Liebler
2016-05-18 15:25                               ` Stefan Liebler
2016-05-24  9:11                                 ` Stefan Liebler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).