public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
@ 2016-03-17 10:52 Pawar, Amit
  2016-03-17 11:53 ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-17 10:52 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

This patch is fix for https://sourceware.org/bugzilla/show_bug.cgi?id=19583

As per the suggestion, defining new bit_arch_Prefer_Fast_Copy_Backward and index_arch_Prefer_Fast_Copy_Backward feature bit macros to update IFUNC selection order for memcpy, memcpy_chk, mempcpy, mempcpy_chk, memmove and memmove_chk functions without affecting other targets. PFA patch and ChangeLog files to this mail. If OK please commit it from my side.


Thanks,
Amit.

[-- Attachment #2: 0001-x86_64-Update-memcpy-mempcpy-and-memmove-selection-o.patch --]
[-- Type: application/octet-stream, Size: 7266 bytes --]

From d55a2156aece42785ac9f2c768775a3825366b19 Mon Sep 17 00:00:00 2001
From: Amit Pawar <Amit.Pawar@amd.com>
Date: Thu, 17 Mar 2016 16:08:36 +0530
Subject: [PATCH] x86_64 Update memcpy, mempcpy and memmove selection order for
 Excavator CPU.

Performance of Fast_Copy_Backward based implementations are better compare to
AVX_Fast_Unaligned_Load based implementation of memcpy, memcpy_chk, mempcpy,
mempcpy_chk, memmove and memmove_chk functions on Excavator cpu. New feature
bit is required to fix this issue in IFUNC functions without affecting other
targets. So defining new bit_arch_Prefer_Fast_Copy_Backward and
index_arch_Prefer_Fast_Copy_Backward feature bits and selection
order for these functions are updated using this feature bit.

	[BZ #19583]
	* sysdeps/x86/cpu-features.h: Define new
	bit_arch_Prefer_Fast_Copy_Backward and
	index_arch_Prefer_Fast_Copy_Backward feature bit macros.
	* sysdeps/x86/cpu-features.c: Enable Fast_Copy_Backward and
	Prefer_Fast_Copy_Backward for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S: Check Prefer_Fast_Copy_Backward to
	enable __memcpy_ssse3_back.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Check Prefer_Fast_Copy_Backward
	to enable __memcpy_chk_ssse3_back.
	* sysdeps/x86_64/multiarch/mempcpy.S: Check Prefer_Fast_Copy_Backward to
	enable __memcppy_ssse3_back.
	* sysdeps/x86_64/multiarch/mempcpy.S_chk: Check
	Prefer_Fast_Copy_Backward to enable __mempcpy_chk_ssse3_back.
	* sysdeps/x86_64/multiarch/memmove.c: Check Prefer_Fast_Copy_Backward to
	enable __memmove_ssse3_back.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Check
	Prefer_Fast_Copy_Backward to enable __memmove_chk_ssse3_back.
---
 sysdeps/x86/cpu-features.c             | 13 +++++++++++--
 sysdeps/x86/cpu-features.h             |  3 +++
 sysdeps/x86_64/multiarch/memcpy.S      |  5 ++++-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |  3 +++
 sysdeps/x86_64/multiarch/memmove.c     |  3 ++-
 sysdeps/x86_64/multiarch/memmove_chk.c |  4 +++-
 sysdeps/x86_64/multiarch/mempcpy.S     |  3 +++
 sysdeps/x86_64/multiarch/mempcpy_chk.S |  3 +++
 8 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 1787716..84a84cc 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -160,8 +160,17 @@ init_cpu_features (struct cpu_features *cpu_features)
 	{
 	  /* "Excavator"   */
 	  if (model >= 0x60 && model <= 0x7f)
-	    cpu_features->feature[index_arch_Fast_Unaligned_Load]
-	      |= bit_arch_Fast_Unaligned_Load;
+	    {
+	      cpu_features->feature[index_arch_Fast_Unaligned_Load]
+		|= bit_arch_Fast_Unaligned_Load;
+
+	      cpu_features->feature[index_arch_Fast_Copy_Backward]
+		|= bit_arch_Fast_Copy_Backward;
+
+	      cpu_features->feature[index_arch_Prefer_Fast_Copy_Backward]
+		|= bit_arch_Prefer_Fast_Copy_Backward;
+
+	    }
 	}
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index 0624a92..9750f2f 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686				(1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC		(1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER		(1 << 17)
+#define bit_arch_Prefer_Fast_Copy_Backward	(1 << 18)
 
 /* CPUID Feature flags.  */
 
@@ -101,6 +102,7 @@
 # define index_arch_I686		FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Prefer_Fast_Copy_Backward FEATURE_INDEX_1*FEATURE_SIZE
 
 
 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -259,6 +261,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686		FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Prefer_Fast_Copy_Backward FEATURE_INDEX_1
 
 #endif	/* !__ASSEMBLER__ */
 
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..ca5235b 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -38,7 +38,10 @@ ENTRY(__new_memcpy)
 	lea    __memcpy_avx512_no_vzeroupper(%rip), %RAX_LP
 	ret
 #endif
-1:	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
+1:	lea	__memcpy_ssse3_back(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	jnz	2f
+	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jnz	2f
 	lea	__memcpy_sse2_unaligned(%rip), %RAX_LP
diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S
index 648217e..af64e15 100644
--- a/sysdeps/x86_64/multiarch/memcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/memcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__memcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz  2f
 	leaq    __memcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	jz  2f
+	leaq    __memcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__memcpy_chk)
 # else
diff --git a/sysdeps/x86_64/multiarch/memmove.c b/sysdeps/x86_64/multiarch/memmove.c
index 8da5640..dbd14f1 100644
--- a/sysdeps/x86_64/multiarch/memmove.c
+++ b/sysdeps/x86_64/multiarch/memmove.c
@@ -59,7 +59,8 @@ libc_ifunc (__libc_memmove,
 	    :
 #endif
 	    (HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	    ? __memmove_avx_unaligned
+	    ? (HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	       ? __memmove_ssse3_back : __memmove_avx_unaligned)
 	    : (HAS_CPU_FEATURE (SSSE3)
 	       ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	          ? __memmove_ssse3_back : __memmove_ssse3)
diff --git a/sysdeps/x86_64/multiarch/memmove_chk.c b/sysdeps/x86_64/multiarch/memmove_chk.c
index f64da63..7d38b99 100644
--- a/sysdeps/x86_64/multiarch/memmove_chk.c
+++ b/sysdeps/x86_64/multiarch/memmove_chk.c
@@ -39,7 +39,9 @@ libc_ifunc (__memmove_chk,
 	    ? __memmove_chk_avx512_no_vzeroupper
 	    :
 #endif
-	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load) ? __memmove_chk_avx_unaligned :
+	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	    ? (HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	       ? __memmove_chk_ssse3_back : __memmove_chk_avx_unaligned) :
 	    (HAS_CPU_FEATURE (SSSE3)
 	    ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	       ? __memmove_chk_ssse3_back : __memmove_chk_ssse3)
diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S
index ed78623..ddb4689 100644
--- a/sysdeps/x86_64/multiarch/mempcpy.S
+++ b/sysdeps/x86_64/multiarch/mempcpy.S
@@ -46,6 +46,9 @@ ENTRY(__mempcpy)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	jz	2f
+	leaq	__mempcpy_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy)
 
diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S
index 6e8a89d..f72ad18 100644
--- a/sysdeps/x86_64/multiarch/mempcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__mempcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Prefer_Fast_Copy_Backward)
+	jz	2f
+	leaq	__mempcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy_chk)
 # else
-- 
2.1.4


[-- Attachment #3: ChangeLog --]
[-- Type: application/octet-stream, Size: 954 bytes --]

2016-03-17  Amit Pawar  <Amit.Pawar@amd.com>

	[BZ #19583]
	* sysdeps/x86/cpu-features.h: Define new
	bit_arch_Prefer_Fast_Copy_Backward and
	index_arch_Prefer_Fast_Copy_Backward feature bit macros.
	* sysdeps/x86/cpu-features.c: Enable Fast_Copy_Backward and
	Prefer_Fast_Copy_Backward for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S: Check Prefer_Fast_Copy_Backward to
	enable __memcpy_ssse3_back.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Check Prefer_Fast_Copy_Backward
	to enable __memcpy_chk_ssse3_back.
	* sysdeps/x86_64/multiarch/mempcpy.S: Check Prefer_Fast_Copy_Backward to
	enable __memcppy_ssse3_back.
	* sysdeps/x86_64/multiarch/mempcpy.S_chk: Check 
	Prefer_Fast_Copy_Backward to enable __mempcpy_chk_ssse3_back.
	* sysdeps/x86_64/multiarch/memmove.c: Check Prefer_Fast_Copy_Backward to
	enable __memmove_ssse3_back.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Check 
	Prefer_Fast_Copy_Backward to enable __memmove_chk_ssse3_back.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-17 10:52 [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583 Pawar, Amit
@ 2016-03-17 11:53 ` H.J. Lu
  2016-03-17 14:16   ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-17 11:53 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Thu, Mar 17, 2016 at 3:52 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
> This patch is fix for https://sourceware.org/bugzilla/show_bug.cgi?id=19583
>
> As per the suggestion, defining new bit_arch_Prefer_Fast_Copy_Backward and index_arch_Prefer_Fast_Copy_Backward feature bit macros to update IFUNC selection order for memcpy, memcpy_chk, mempcpy, mempcpy_chk, memmove and memmove_chk functions without affecting other targets. PFA patch and ChangeLog files to this mail. If OK please commit it from my side.
>
>

A few comments:

1. Since there is bit_arch_Fast_Copy_Backward already, please
add bit_arch_Avoid_AVX_Fast_Unaligned_Load instead.
2. Please verify that index_arch_XXX are the same and use
one index_arch_XXX to set all bits.  There are examples in
sysdeps/x86/cpu-features.c.
3. Please use proper ChangeLog format:

        * file (name of function, macro, ...): What changed.

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-17 11:53 ` H.J. Lu
@ 2016-03-17 14:16   ` Pawar, Amit
  2016-03-17 14:46     ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-17 14:16 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>A few comments:
>
>1. Since there is bit_arch_Fast_Copy_Backward already, please add bit_arch_Avoid_AVX_Fast_Unaligned_Load instead.
Thought to ask for one more suggestion, bit_arch_Avoid_AVX_Fast_Unaligned_Load is more readable but what to select by avoiding it? Can it be used to set any feature bit and not only Fast_Copy_Backward right? 

>2. Please verify that index_arch_XXX are the same and use one index_arch_XXX to set all bits.  There are examples in sysdeps/x86/cpu-features.c.
My fault, didn’t notice that.

>3. Please use proper ChangeLog format:
>
>        * file (name of function, macro, ...): What changed.
>

Thanks,
Amit Pawar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-17 14:16   ` Pawar, Amit
@ 2016-03-17 14:46     ` H.J. Lu
  2016-03-18 11:43       ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-17 14:46 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Thu, Mar 17, 2016 at 7:16 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>A few comments:
>>
>>1. Since there is bit_arch_Fast_Copy_Backward already, please add bit_arch_Avoid_AVX_Fast_Unaligned_Load instead.
> Thought to ask for one more suggestion, bit_arch_Avoid_AVX_Fast_Unaligned_Load is more readable but what to select by avoiding it? Can it be used to set any feature bit and not only Fast_Copy_Backward right?

If bit_arch_Avoid_AVX_Fast_Unaligned_Load is set, the next best
one will be selected.   See bit_arch_Slow_SSE4_2 for example.

>>2. Please verify that index_arch_XXX are the same and use one index_arch_XXX to set all bits.  There are examples in sysdeps/x86/cpu-features.c.
> My fault, didn’t notice that.
>
>>3. Please use proper ChangeLog format:
>>
>>        * file (name of function, macro, ...): What changed.
>>
>
> Thanks,
> Amit Pawar



-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-17 14:46     ` H.J. Lu
@ 2016-03-18 11:43       ` Pawar, Amit
  2016-03-18 11:51         ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-18 11:43 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 88 bytes --]

As per the comments, attached files are updated. If OK please commit.

Thanks,
Amit

[-- Attachment #2: 0001-x86_64-Update-memcpy-mempcpy-and-memmove-selection-o.patch --]
[-- Type: application/octet-stream, Size: 7681 bytes --]

From 5e20f755973e0ae752c872f1c1215b3d53c05772 Mon Sep 17 00:00:00 2001
From: Amit Pawar <Amit.Pawar@amd.com>
Date: Fri, 18 Mar 2016 16:58:46 +0530
Subject: [PATCH] x86_64 Update memcpy, mempcpy and memmove selection order for
 Excavator CPU.

Performance of Fast_Copy_Backward based implementations are better compare to
AVX_Fast_Unaligned_Load based implementation of memcpy, memcpy_chk, mempcpy,
mempcpy_chk, memmove and memmove_chk functions on Excavator cpu. New feature
bit is required to fix this issue in IFUNC functions without affecting other
targets. So defining two new bit_arch_Avoid_AVX_Fast_Unaligned_Load and
index_arch_Avoid_AVX_Fast_Unaligned_Load feature bit macros and selection
order of these functions are updated.

	[BZ #19583]
	* sysdeps/x86/cpu-features.h (bit_arch_Avoid_AVX_Fast_Unaligned_Load):
	New.
	(index_arch_Avoid_AVX_Fast_Unaligned_Load) Likewise.
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Copy_Backward
	and Avoid_AVX_Fast_Unaligned_Load bits for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S
	(__new_memcpy, Avoid_AVX_Fast_Unaligned_Load): Add check for
	Avoid_AVX_Fast_Unaligned_Load bit and select on Excavator core instead
	of AVX_Fast_Unaligned_Load.
	* sysdeps/x86_64/multiarch/memcpy_chk.S
	(__memcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S
	(__mempcpy, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S
	(__mempcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c
	(__libc_memmove, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c
	(__memmove_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
---
 sysdeps/x86/cpu-features.c             | 15 +++++++++++++--
 sysdeps/x86/cpu-features.h             |  3 +++
 sysdeps/x86_64/multiarch/memcpy.S      |  6 +++++-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |  3 +++
 sysdeps/x86_64/multiarch/memmove.c     |  3 ++-
 sysdeps/x86_64/multiarch/memmove_chk.c |  4 +++-
 sysdeps/x86_64/multiarch/mempcpy.S     |  3 +++
 sysdeps/x86_64/multiarch/mempcpy_chk.S |  3 +++
 8 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 1787716..f0baf0e 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -160,8 +160,19 @@ init_cpu_features (struct cpu_features *cpu_features)
 	{
 	  /* "Excavator"   */
 	  if (model >= 0x60 && model <= 0x7f)
-	    cpu_features->feature[index_arch_Fast_Unaligned_Load]
-	      |= bit_arch_Fast_Unaligned_Load;
+	    {
+#if index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+# error index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+#endif
+#if index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Load
+# error index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Load
+#endif
+	      cpu_features->feature[index_arch_Avoid_AVX_Fast_Unaligned_Load]
+		|= (bit_arch_Fast_Unaligned_Load
+		    | bit_arch_Fast_Copy_Backward
+		    | bit_arch_Avoid_AVX_Fast_Unaligned_Load);
+
+	    }
 	}
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index 0624a92..fe59828 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686				(1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC		(1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER		(1 << 17)
+#define bit_arch_Avoid_AVX_Fast_Unaligned_Load	(1 << 18)
 
 /* CPUID Feature flags.  */
 
@@ -101,6 +102,7 @@
 # define index_arch_I686		FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Avoid_AVX_Fast_Unaligned_Load FEATURE_INDEX_1*FEATURE_SIZE
 
 
 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -259,6 +261,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686		FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Avoid_AVX_Fast_Unaligned_Load FEATURE_INDEX_1
 
 #endif	/* !__ASSEMBLER__ */
 
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..3c67da8 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -40,7 +40,7 @@ ENTRY(__new_memcpy)
 #endif
 1:	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	jnz	2f
+	jnz	3f
 	lea	__memcpy_sse2_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
 	jnz	2f
@@ -52,6 +52,10 @@ ENTRY(__new_memcpy)
 	jnz	2f
 	lea	__memcpy_ssse3(%rip), %RAX_LP
 2:	ret
+3:	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz	2b
+	lea	__memcpy_ssse3_back(%rip), %RAX_LP
+	ret
 END(__new_memcpy)
 
 # undef ENTRY
diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S
index 648217e..5b608f6 100644
--- a/sysdeps/x86_64/multiarch/memcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/memcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__memcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz  2f
 	leaq    __memcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz  2f
+	leaq    __memcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__memcpy_chk)
 # else
diff --git a/sysdeps/x86_64/multiarch/memmove.c b/sysdeps/x86_64/multiarch/memmove.c
index 8da5640..b9d9439 100644
--- a/sysdeps/x86_64/multiarch/memmove.c
+++ b/sysdeps/x86_64/multiarch/memmove.c
@@ -59,7 +59,8 @@ libc_ifunc (__libc_memmove,
 	    :
 #endif
 	    (HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	    ? __memmove_avx_unaligned
+	    ? (HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	       ? __memmove_ssse3_back : __memmove_avx_unaligned)
 	    : (HAS_CPU_FEATURE (SSSE3)
 	       ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	          ? __memmove_ssse3_back : __memmove_ssse3)
diff --git a/sysdeps/x86_64/multiarch/memmove_chk.c b/sysdeps/x86_64/multiarch/memmove_chk.c
index f64da63..b8a4282 100644
--- a/sysdeps/x86_64/multiarch/memmove_chk.c
+++ b/sysdeps/x86_64/multiarch/memmove_chk.c
@@ -39,7 +39,9 @@ libc_ifunc (__memmove_chk,
 	    ? __memmove_chk_avx512_no_vzeroupper
 	    :
 #endif
-	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load) ? __memmove_chk_avx_unaligned :
+	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	    ? (HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	       ? __memmove_chk_ssse3_back : __memmove_chk_avx_unaligned) :
 	    (HAS_CPU_FEATURE (SSSE3)
 	    ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	       ? __memmove_chk_ssse3_back : __memmove_chk_ssse3)
diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S
index ed78623..6a8898d 100644
--- a/sysdeps/x86_64/multiarch/mempcpy.S
+++ b/sysdeps/x86_64/multiarch/mempcpy.S
@@ -46,6 +46,9 @@ ENTRY(__mempcpy)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz	2f
+	leaq	__mempcpy_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy)
 
diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S
index 6e8a89d..ecea9c2 100644
--- a/sysdeps/x86_64/multiarch/mempcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__mempcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz	2f
+	leaq	__mempcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy_chk)
 # else
-- 
2.1.4


[-- Attachment #3: ChangeLog --]
[-- Type: application/octet-stream, Size: 1022 bytes --]

2016-03-18  Amit Pawar  <Amit.Pawar@amd.com>

	[BZ #19583]
	* sysdeps/x86/cpu-features.h (bit_arch_Avoid_AVX_Fast_Unaligned_Load): 
	New.
	(index_arch_Avoid_AVX_Fast_Unaligned_Load) Likewise.
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Copy_Backward
	and Avoid_AVX_Fast_Unaligned_Load bits for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S 
	(__new_memcpy, Avoid_AVX_Fast_Unaligned_Load): Add check for
	Avoid_AVX_Fast_Unaligned_Load bit and select on Excavator core instead
	of AVX_Fast_Unaligned_Load.
	* sysdeps/x86_64/multiarch/memcpy_chk.S
	(__memcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise. 
	* sysdeps/x86_64/multiarch/mempcpy.S
	(__mempcpy, Avoid_AVX_Fast_Unaligned_Load): Likewise. 
	* sysdeps/x86_64/multiarch/mempcpy_chk.S
	(__mempcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c
	(__libc_memmove, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c
	(__memmove_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 11:43       ` Pawar, Amit
@ 2016-03-18 11:51         ` H.J. Lu
  2016-03-18 12:25           ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 11:51 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Fri, Mar 18, 2016 at 4:43 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
> As per the comments, attached files are updated. If OK please commit.
>
> Thanks,
> Amit

diff --git a/sysdeps/x86_64/multiarch/memcpy.S
b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..3c67da8 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -40,7 +40,7 @@ ENTRY(__new_memcpy)
 #endif
 1: lea __memcpy_avx_unaligned(%rip), %RAX_LP
  HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
- jnz 2f
+ jnz 3f
  lea __memcpy_sse2_unaligned(%rip), %RAX_LP
  HAS_ARCH_FEATURE (Fast_Unaligned_Load)
  jnz 2f
@@ -52,6 +52,10 @@ ENTRY(__new_memcpy)
  jnz 2f
  lea __memcpy_ssse3(%rip), %RAX_LP
 2: ret
+3: HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+ jz 2b
+ lea __memcpy_ssse3_back(%rip), %RAX_LP
+ ret
 END(__new_memcpy)

This is wrong.  You should check Avoid_AVX_Fast_Unaligned_Load
to disable __memcpy_avx_unaligned, not select
 __memcpy_ssse3_back.  Each selection should be loaded
only once.


-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 11:51         ` H.J. Lu
@ 2016-03-18 12:25           ` Pawar, Amit
  2016-03-18 12:34             ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-18 12:25 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 917 bytes --]

>diff --git a/sysdeps/x86_64/multiarch/memcpy.S
>b/sysdeps/x86_64/multiarch/memcpy.S
>index 8882590..3c67da8 100644
>--- a/sysdeps/x86_64/multiarch/memcpy.S
>+++ b/sysdeps/x86_64/multiarch/memcpy.S
>@@ -40,7 +40,7 @@ ENTRY(__new_memcpy)
> #endif
> 1: lea __memcpy_avx_unaligned(%rip), %RAX_LP
>  HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>- jnz 2f
>+ jnz 3f
>  lea __memcpy_sse2_unaligned(%rip), %RAX_LP
>  HAS_ARCH_FEATURE (Fast_Unaligned_Load)
>  jnz 2f
>@@ -52,6 +52,10 @@ ENTRY(__new_memcpy)
>  jnz 2f
>  lea __memcpy_ssse3(%rip), %RAX_LP
> 2: ret
>+3: HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)  jz 2b  lea 
>+__memcpy_ssse3_back(%rip), %RAX_LP  ret
> END(__new_memcpy)
>
>This is wrong.  You should check Avoid_AVX_Fast_Unaligned_Load
>to disable __memcpy_avx_unaligned, not select
> __memcpy_ssse3_back.  Each selection should be loaded
>only once.

Now OK?.

--Amit Pawar

[-- Attachment #2: 0001-x86_64-Update-memcpy-mempcpy-and-memmove-selection-o.patch --]
[-- Type: application/octet-stream, Size: 7593 bytes --]

From be3fa748e35802871ce61291e22966ae10d62901 Mon Sep 17 00:00:00 2001
From: Amit Pawar <Amit.Pawar@amd.com>
Date: Fri, 18 Mar 2016 17:33:22 +0530
Subject: [PATCH] x86_64 Update memcpy, mempcpy and memmove selection order for
 Excavator CPU.

Performance of Fast_Copy_Backward based implementations are better compare to
AVX_Fast_Unaligned_Load based implementation of memcpy, memcpy_chk, mempcpy,
mempcpy_chk, memmove and memmove_chk functions on Excavator cpu. New feature
bit is required to fix this issue in IFUNC functions without affecting other
targets. So defining two new bit_arch_Avoid_AVX_Fast_Unaligned_Load and
index_arch_Avoid_AVX_Fast_Unaligned_Load feature bit macros and selection
order of these functions are updated.

	[BZ #19583]
	* sysdeps/x86/cpu-features.h (bit_arch_Avoid_AVX_Fast_Unaligned_Load):
	New.
	(index_arch_Avoid_AVX_Fast_Unaligned_Load) Likewise.
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Copy_Backward
	and Avoid_AVX_Fast_Unaligned_Load bits for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S
	(__new_memcpy, Avoid_AVX_Fast_Unaligned_Load): Add check for
	Avoid_AVX_Fast_Unaligned_Load bit and select on Excavator core instead
	of AVX_Fast_Unaligned_Load.
	* sysdeps/x86_64/multiarch/memcpy_chk.S
	(__memcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S
	(__mempcpy, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S
	(__mempcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c
	(__libc_memmove, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c
	(__memmove_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
---
 sysdeps/x86/cpu-features.c             | 15 +++++++++++++--
 sysdeps/x86/cpu-features.h             |  3 +++
 sysdeps/x86_64/multiarch/memcpy.S      |  5 ++++-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |  3 +++
 sysdeps/x86_64/multiarch/memmove.c     |  3 ++-
 sysdeps/x86_64/multiarch/memmove_chk.c |  4 +++-
 sysdeps/x86_64/multiarch/mempcpy.S     |  3 +++
 sysdeps/x86_64/multiarch/mempcpy_chk.S |  3 +++
 8 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 1787716..f0baf0e 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -160,8 +160,19 @@ init_cpu_features (struct cpu_features *cpu_features)
 	{
 	  /* "Excavator"   */
 	  if (model >= 0x60 && model <= 0x7f)
-	    cpu_features->feature[index_arch_Fast_Unaligned_Load]
-	      |= bit_arch_Fast_Unaligned_Load;
+	    {
+#if index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+# error index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+#endif
+#if index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Load
+# error index_arch_Avoid_AVX_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Load
+#endif
+	      cpu_features->feature[index_arch_Avoid_AVX_Fast_Unaligned_Load]
+		|= (bit_arch_Fast_Unaligned_Load
+		    | bit_arch_Fast_Copy_Backward
+		    | bit_arch_Avoid_AVX_Fast_Unaligned_Load);
+
+	    }
 	}
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index 0624a92..fe59828 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686				(1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC		(1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER		(1 << 17)
+#define bit_arch_Avoid_AVX_Fast_Unaligned_Load	(1 << 18)
 
 /* CPUID Feature flags.  */
 
@@ -101,6 +102,7 @@
 # define index_arch_I686		FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Avoid_AVX_Fast_Unaligned_Load FEATURE_INDEX_1*FEATURE_SIZE
 
 
 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -259,6 +261,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686		FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Avoid_AVX_Fast_Unaligned_Load FEATURE_INDEX_1
 
 #endif	/* !__ASSEMBLER__ */
 
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..7983153 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -38,7 +38,10 @@ ENTRY(__new_memcpy)
 	lea    __memcpy_avx512_no_vzeroupper(%rip), %RAX_LP
 	ret
 #endif
-1:	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
+1:	lea	__memcpy_ssse3_back(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jnz	2f
+	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jnz	2f
 	lea	__memcpy_sse2_unaligned(%rip), %RAX_LP
diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S
index 648217e..5b608f6 100644
--- a/sysdeps/x86_64/multiarch/memcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/memcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__memcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz  2f
 	leaq    __memcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz  2f
+	leaq    __memcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__memcpy_chk)
 # else
diff --git a/sysdeps/x86_64/multiarch/memmove.c b/sysdeps/x86_64/multiarch/memmove.c
index 8da5640..b9d9439 100644
--- a/sysdeps/x86_64/multiarch/memmove.c
+++ b/sysdeps/x86_64/multiarch/memmove.c
@@ -59,7 +59,8 @@ libc_ifunc (__libc_memmove,
 	    :
 #endif
 	    (HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	    ? __memmove_avx_unaligned
+	    ? (HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	       ? __memmove_ssse3_back : __memmove_avx_unaligned)
 	    : (HAS_CPU_FEATURE (SSSE3)
 	       ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	          ? __memmove_ssse3_back : __memmove_ssse3)
diff --git a/sysdeps/x86_64/multiarch/memmove_chk.c b/sysdeps/x86_64/multiarch/memmove_chk.c
index f64da63..b8a4282 100644
--- a/sysdeps/x86_64/multiarch/memmove_chk.c
+++ b/sysdeps/x86_64/multiarch/memmove_chk.c
@@ -39,7 +39,9 @@ libc_ifunc (__memmove_chk,
 	    ? __memmove_chk_avx512_no_vzeroupper
 	    :
 #endif
-	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load) ? __memmove_chk_avx_unaligned :
+	    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	    ? (HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	       ? __memmove_chk_ssse3_back : __memmove_chk_avx_unaligned) :
 	    (HAS_CPU_FEATURE (SSSE3)
 	    ? (HAS_ARCH_FEATURE (Fast_Copy_Backward)
 	       ? __memmove_chk_ssse3_back : __memmove_chk_ssse3)
diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S
index ed78623..6a8898d 100644
--- a/sysdeps/x86_64/multiarch/mempcpy.S
+++ b/sysdeps/x86_64/multiarch/mempcpy.S
@@ -46,6 +46,9 @@ ENTRY(__mempcpy)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz	2f
+	leaq	__mempcpy_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy)
 
diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S
index 6e8a89d..ecea9c2 100644
--- a/sysdeps/x86_64/multiarch/mempcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S
@@ -48,6 +48,9 @@ ENTRY(__mempcpy_chk)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jz	2f
 	leaq	__mempcpy_chk_avx_unaligned(%rip), %rax
+	HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+	jz	2f
+	leaq	__mempcpy_chk_ssse3_back(%rip), %rax
 2:	ret
 END(__mempcpy_chk)
 # else
-- 
2.1.4


[-- Attachment #3: ChangeLog --]
[-- Type: application/octet-stream, Size: 1022 bytes --]

2016-03-18  Amit Pawar  <Amit.Pawar@amd.com>

	[BZ #19583]
	* sysdeps/x86/cpu-features.h (bit_arch_Avoid_AVX_Fast_Unaligned_Load): 
	New.
	(index_arch_Avoid_AVX_Fast_Unaligned_Load) Likewise.
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Copy_Backward
	and Avoid_AVX_Fast_Unaligned_Load bits for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S 
	(__new_memcpy, Avoid_AVX_Fast_Unaligned_Load): Add check for
	Avoid_AVX_Fast_Unaligned_Load bit and select on Excavator core instead
	of AVX_Fast_Unaligned_Load.
	* sysdeps/x86_64/multiarch/memcpy_chk.S
	(__memcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise. 
	* sysdeps/x86_64/multiarch/mempcpy.S
	(__mempcpy, Avoid_AVX_Fast_Unaligned_Load): Likewise. 
	* sysdeps/x86_64/multiarch/mempcpy_chk.S
	(__mempcpy_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c
	(__libc_memmove, Avoid_AVX_Fast_Unaligned_Load): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c
	(__memmove_chk, Avoid_AVX_Fast_Unaligned_Load): Likewise.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 12:25           ` Pawar, Amit
@ 2016-03-18 12:34             ` H.J. Lu
  2016-03-18 13:22               ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 12:34 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Fri, Mar 18, 2016 at 5:25 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>diff --git a/sysdeps/x86_64/multiarch/memcpy.S
>>b/sysdeps/x86_64/multiarch/memcpy.S
>>index 8882590..3c67da8 100644
>>--- a/sysdeps/x86_64/multiarch/memcpy.S
>>+++ b/sysdeps/x86_64/multiarch/memcpy.S
>>@@ -40,7 +40,7 @@ ENTRY(__new_memcpy)
>> #endif
>> 1: lea __memcpy_avx_unaligned(%rip), %RAX_LP
>>  HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>>- jnz 2f
>>+ jnz 3f
>>  lea __memcpy_sse2_unaligned(%rip), %RAX_LP
>>  HAS_ARCH_FEATURE (Fast_Unaligned_Load)
>>  jnz 2f
>>@@ -52,6 +52,10 @@ ENTRY(__new_memcpy)
>>  jnz 2f
>>  lea __memcpy_ssse3(%rip), %RAX_LP
>> 2: ret
>>+3: HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)  jz 2b  lea
>>+__memcpy_ssse3_back(%rip), %RAX_LP  ret
>> END(__new_memcpy)
>>
>>This is wrong.  You should check Avoid_AVX_Fast_Unaligned_Load
>>to disable __memcpy_avx_unaligned, not select
>> __memcpy_ssse3_back.  Each selection should be loaded
>>only once.
>
> Now OK?.

No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should
disable __memcpy_avx_unaligned and nothing more.  Also
you need to fix ALL selections.

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 12:34             ` H.J. Lu
@ 2016-03-18 13:22               ` Pawar, Amit
  2016-03-18 13:51                 ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-18 13:22 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should disable __memcpy_avx_unaligned and nothing more.  Also you need to fix ALL selections.

diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..a5afaf4 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -39,6 +39,8 @@ ENTRY(__new_memcpy)
        ret
 #endif
 1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
+       HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+       jnz     3f
        HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
        jnz     2f
        lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
@@ -52,6 +54,8 @@ ENTRY(__new_memcpy)
        jnz     2f
        lea     __memcpy_ssse3(%rip), %RAX_LP
 2:     ret
+3:     lea     __memcpy_ssse3(%rip), %RAX_LP
+       ret
 END(__new_memcpy)

 # undef ENTRY

Will update all IFUNC's if this ok else please suggest.

--Amit Pawar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 13:22               ` Pawar, Amit
@ 2016-03-18 13:51                 ` H.J. Lu
  2016-03-18 13:55                   ` Adhemerval Zanella
  2016-03-18 14:45                   ` H.J. Lu
  0 siblings, 2 replies; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 13:51 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Fri, Mar 18, 2016 at 6:22 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should disable __memcpy_avx_unaligned and nothing more.  Also you need to fix ALL selections.
>
> diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
> index 8882590..a5afaf4 100644
> --- a/sysdeps/x86_64/multiarch/memcpy.S
> +++ b/sysdeps/x86_64/multiarch/memcpy.S
> @@ -39,6 +39,8 @@ ENTRY(__new_memcpy)
>         ret
>  #endif
>  1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
> +       HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
> +       jnz     3f
>         HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>         jnz     2f
>         lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
> @@ -52,6 +54,8 @@ ENTRY(__new_memcpy)
>         jnz     2f
>         lea     __memcpy_ssse3(%rip), %RAX_LP
>  2:     ret
> +3:     lea     __memcpy_ssse3(%rip), %RAX_LP
> +       ret
>  END(__new_memcpy)
>
>  # undef ENTRY
>
> Will update all IFUNC's if this ok else please suggest.
>

Better, but not OK.  Try something like

iff --git a/sysdeps/x86_64/multiarch/memcpy.S
b/sysdeps/x86_64/multiarch/memcpy.S
index ab5998c..2abe2fd 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -42,9 +42,11 @@ ENTRY(__new_memcpy)
   ret
 #endif
 1:   lea   __memcpy_avx_unaligned(%rip), %RAX_LP
+  HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
+  jnz   3f
   HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
   jnz   2f
-  lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
+3:   lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
   HAS_ARCH_FEATURE (Fast_Unaligned_Load)
   jnz   2f
   lea   __memcpy_sse2(%rip), %RAX_LP


-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 13:51                 ` H.J. Lu
@ 2016-03-18 13:55                   ` Adhemerval Zanella
  2016-03-18 14:43                     ` H.J. Lu
  2016-03-18 14:45                   ` H.J. Lu
  1 sibling, 1 reply; 23+ messages in thread
From: Adhemerval Zanella @ 2016-03-18 13:55 UTC (permalink / raw)
  To: libc-alpha



On 18-03-2016 10:51, H.J. Lu wrote:
> On Fri, Mar 18, 2016 at 6:22 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>> No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should disable __memcpy_avx_unaligned and nothing more.  Also you need to fix ALL selections.
>>
>> diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
>> index 8882590..a5afaf4 100644
>> --- a/sysdeps/x86_64/multiarch/memcpy.S
>> +++ b/sysdeps/x86_64/multiarch/memcpy.S
>> @@ -39,6 +39,8 @@ ENTRY(__new_memcpy)
>>         ret
>>  #endif
>>  1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
>> +       HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
>> +       jnz     3f
>>         HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>>         jnz     2f
>>         lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
>> @@ -52,6 +54,8 @@ ENTRY(__new_memcpy)
>>         jnz     2f
>>         lea     __memcpy_ssse3(%rip), %RAX_LP
>>  2:     ret
>> +3:     lea     __memcpy_ssse3(%rip), %RAX_LP
>> +       ret
>>  END(__new_memcpy)
>>
>>  # undef ENTRY
>>
>> Will update all IFUNC's if this ok else please suggest.
>>
> 
> Better, but not OK.  Try something like
> 
> iff --git a/sysdeps/x86_64/multiarch/memcpy.S
> b/sysdeps/x86_64/multiarch/memcpy.S
> index ab5998c..2abe2fd 100644
> --- a/sysdeps/x86_64/multiarch/memcpy.S
> +++ b/sysdeps/x86_64/multiarch/memcpy.S
> @@ -42,9 +42,11 @@ ENTRY(__new_memcpy)
>    ret
>  #endif
>  1:   lea   __memcpy_avx_unaligned(%rip), %RAX_LP
> +  HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
> +  jnz   3f
>    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>    jnz   2f
> -  lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
> +3:   lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
>    HAS_ARCH_FEATURE (Fast_Unaligned_Load)
>    jnz   2f
>    lea   __memcpy_sse2(%rip), %RAX_LP
> 
> 

I know this is not related to this patch, but any reason to not code the
resolver using the libc_ifunc macros?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 13:55                   ` Adhemerval Zanella
@ 2016-03-18 14:43                     ` H.J. Lu
  0 siblings, 0 replies; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 14:43 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Mar 18, 2016 at 6:55 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 18-03-2016 10:51, H.J. Lu wrote:
>> On Fri, Mar 18, 2016 at 6:22 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>>> No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should disable __memcpy_avx_unaligned and nothing more.  Also you need to fix ALL selections.
>>>
>>> diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
>>> index 8882590..a5afaf4 100644
>>> --- a/sysdeps/x86_64/multiarch/memcpy.S
>>> +++ b/sysdeps/x86_64/multiarch/memcpy.S
>>> @@ -39,6 +39,8 @@ ENTRY(__new_memcpy)
>>>         ret
>>>  #endif
>>>  1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
>>> +       HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
>>> +       jnz     3f
>>>         HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>>>         jnz     2f
>>>         lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
>>> @@ -52,6 +54,8 @@ ENTRY(__new_memcpy)
>>>         jnz     2f
>>>         lea     __memcpy_ssse3(%rip), %RAX_LP
>>>  2:     ret
>>> +3:     lea     __memcpy_ssse3(%rip), %RAX_LP
>>> +       ret
>>>  END(__new_memcpy)
>>>
>>>  # undef ENTRY
>>>
>>> Will update all IFUNC's if this ok else please suggest.
>>>
>>
>> Better, but not OK.  Try something like
>>
>> iff --git a/sysdeps/x86_64/multiarch/memcpy.S
>> b/sysdeps/x86_64/multiarch/memcpy.S
>> index ab5998c..2abe2fd 100644
>> --- a/sysdeps/x86_64/multiarch/memcpy.S
>> +++ b/sysdeps/x86_64/multiarch/memcpy.S
>> @@ -42,9 +42,11 @@ ENTRY(__new_memcpy)
>>    ret
>>  #endif
>>  1:   lea   __memcpy_avx_unaligned(%rip), %RAX_LP
>> +  HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
>> +  jnz   3f
>>    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>>    jnz   2f
>> -  lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
>> +3:   lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
>>    HAS_ARCH_FEATURE (Fast_Unaligned_Load)
>>    jnz   2f
>>    lea   __memcpy_sse2(%rip), %RAX_LP
>>
>>
>
> I know this is not related to this patch, but any reason to not code the
> resolver using the libc_ifunc macros?

Did you mean writing them in C?  It can be done.  Someone
needs to write patches.

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 13:51                 ` H.J. Lu
  2016-03-18 13:55                   ` Adhemerval Zanella
@ 2016-03-18 14:45                   ` H.J. Lu
  2016-03-18 15:19                     ` Pawar, Amit
  1 sibling, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 14:45 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Fri, Mar 18, 2016 at 6:51 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Mar 18, 2016 at 6:22 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>>No, it isn't fixed.  Avoid_AVX_Fast_Unaligned_Load should disable __memcpy_avx_unaligned and nothing more.  Also you need to fix ALL selections.
>>
>> diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
>> index 8882590..a5afaf4 100644
>> --- a/sysdeps/x86_64/multiarch/memcpy.S
>> +++ b/sysdeps/x86_64/multiarch/memcpy.S
>> @@ -39,6 +39,8 @@ ENTRY(__new_memcpy)
>>         ret
>>  #endif
>>  1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
>> +       HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
>> +       jnz     3f
>>         HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>>         jnz     2f
>>         lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
>> @@ -52,6 +54,8 @@ ENTRY(__new_memcpy)
>>         jnz     2f
>>         lea     __memcpy_ssse3(%rip), %RAX_LP
>>  2:     ret
>> +3:     lea     __memcpy_ssse3(%rip), %RAX_LP
>> +       ret
>>  END(__new_memcpy)
>>
>>  # undef ENTRY
>>
>> Will update all IFUNC's if this ok else please suggest.
>>
>
> Better, but not OK.  Try something like
>
> iff --git a/sysdeps/x86_64/multiarch/memcpy.S
> b/sysdeps/x86_64/multiarch/memcpy.S
> index ab5998c..2abe2fd 100644
> --- a/sysdeps/x86_64/multiarch/memcpy.S
> +++ b/sysdeps/x86_64/multiarch/memcpy.S
> @@ -42,9 +42,11 @@ ENTRY(__new_memcpy)
>    ret
>  #endif
>  1:   lea   __memcpy_avx_unaligned(%rip), %RAX_LP
> +  HAS_ARCH_FEATURE (Avoid_AVX_Fast_Unaligned_Load)
> +  jnz   3f
>    HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
>    jnz   2f
> -  lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
> +3:   lea   __memcpy_sse2_unaligned(%rip), %RAX_LP
>    HAS_ARCH_FEATURE (Fast_Unaligned_Load)
>    jnz   2f
>    lea   __memcpy_sse2(%rip), %RAX_LP
>

One question.  If  you don't want __memcpy_avx_unaligned,
why do you set AVX_Fast_Unaligned_Load?

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 14:45                   ` H.J. Lu
@ 2016-03-18 15:19                     ` Pawar, Amit
  2016-03-18 15:24                       ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-18 15:19 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>One question.  If  you don't want __memcpy_avx_unaligned, why do you set AVX_Fast_Unaligned_Load?
Any idea whether currently any other string and memory functions are under implementation based on this? If not then let me just verify it. 

Also this feature is enabled in generic code. To disable it, need to change after this.

--Amit Pawar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 15:19                     ` Pawar, Amit
@ 2016-03-18 15:24                       ` H.J. Lu
  2016-03-22 11:08                         ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-18 15:24 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Fri, Mar 18, 2016 at 8:19 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>One question.  If  you don't want __memcpy_avx_unaligned, why do you set AVX_Fast_Unaligned_Load?
> Any idea whether currently any other string and memory functions are under implementation based on this? If not then let me just verify it.
>
> Also this feature is enabled in generic code. To disable it, need to change after this.
>

It was done based on assumption that AVX enabled machine has
fast AVX unaligned load.  If it isn't true for AMD CPUs, we can enable it
for all Intel AVX CPUs and you can set it for AMD CPUs properly.

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-18 15:24                       ` H.J. Lu
@ 2016-03-22 11:08                         ` Pawar, Amit
  2016-03-22 14:50                           ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-22 11:08 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>It was done based on assumption that AVX enabled machine has fast AVX unaligned load.  If it isn't true for AMD CPUs, we can enable it for all Intel AVX CPUs and you can set it for AMD CPUs properly.

Memcpy still needs to be fixed otherwise SSE2_Unaligned version is selected. Is it OK to fix in following way else please suggest.

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 1787716..e5c7184 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -159,9 +159,17 @@ init_cpu_features (struct cpu_features *cpu_features)
       if (family == 0x15)
        {
          /* "Excavator"   */
+#if index_arch_Fast_Unaligned_Load != index_arch_Prefer_Fast_Copy_Backward
+# error index_arch_Fast_Unaligned_Load != index_arch_Prefer_Fast_Copy_Backward
+#endif
+#if index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+# error index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+#endif
          if (model >= 0x60 && model <= 0x7f)
            cpu_features->feature[index_arch_Fast_Unaligned_Load]
-             |= bit_arch_Fast_Unaligned_Load;
+             |= (bit_arch_Fast_Unaligned_Load
+                 | bit_arch_Fast_Copy_Backward
+                 | bit_arch_Prefer_Fast_Copy_Backward);
        }
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index 0624a92..9750f2f 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686                          (1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC         (1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER          (1 << 17)
+#define bit_arch_Prefer_Fast_Copy_Backward     (1 << 18)

 /* CPUID Feature flags.  */

@@ -101,6 +102,7 @@
 # define index_arch_I686               FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Prefer_Fast_Copy_Backward FEATURE_INDEX_1*FEATURE_SIZE


 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -259,6 +261,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686               FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Prefer_Fast_Copy_Backward FEATURE_INDEX_1

 #endif /* !__ASSEMBLER__ */

diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..6fad5cb 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -40,18 +40,20 @@ ENTRY(__new_memcpy)
 #endif
 1:     lea     __memcpy_avx_unaligned(%rip), %RAX_LP
        HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+       jnz     3f
+       HAS_ARCH_FEATURE (Preferred_Fast_Copy_Backward)
        jnz     2f
        lea     __memcpy_sse2_unaligned(%rip), %RAX_LP
        HAS_ARCH_FEATURE (Fast_Unaligned_Load)
-       jnz     2f
-       lea     __memcpy_sse2(%rip), %RAX_LP
+       jnz     3f
+2:     lea     __memcpy_sse2(%rip), %RAX_LP
        HAS_CPU_FEATURE (SSSE3)
-       jz      2f
+       jz      3f
        lea    __memcpy_ssse3_back(%rip), %RAX_LP
        HAS_ARCH_FEATURE (Fast_Copy_Backward)
-       jnz     2f
+       jnz     3f
        lea     __memcpy_ssse3(%rip), %RAX_LP
-2:     ret
+3:     ret
 END(__new_memcpy)

 # undef ENTRY


--Amit

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-22 11:08                         ` Pawar, Amit
@ 2016-03-22 14:50                           ` H.J. Lu
  2016-03-22 14:57                             ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-22 14:50 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Tue, Mar 22, 2016 at 4:08 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>It was done based on assumption that AVX enabled machine has fast AVX unaligned load.  If it isn't true for AMD CPUs, we can enable it for all Intel AVX CPUs and you can set it for AMD CPUs properly.
>
> Memcpy still needs to be fixed otherwise SSE2_Unaligned version is selected. Is it OK to fix in following way else please suggest.

So AMD processor doesn't want Fast_Unaligned_Load.  Why is it
set?


-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-22 14:50                           ` H.J. Lu
@ 2016-03-22 14:57                             ` Pawar, Amit
  2016-03-22 15:03                               ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-22 14:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>>>It was done based on assumption that AVX enabled machine has fast AVX unaligned load.  If it isn't true for AMD CPUs, we can enable it for all Intel AVX CPUs and you can set it for AMD CPUs properly.
>>
>> Memcpy still needs to be fixed otherwise SSE2_Unaligned version is selected. Is it OK to fix in following way else please suggest.
>
>So AMD processor doesn't want Fast_Unaligned_Load.  Why is it set?

For this function it is not better but good for other routines like  strcat, strncat, stpcpy, stpncpy, strcpy and  strncpy.

Thanks,
Amit Pawar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-22 14:57                             ` Pawar, Amit
@ 2016-03-22 15:03                               ` H.J. Lu
  2016-03-23 10:12                                 ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-22 15:03 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Tue, Mar 22, 2016 at 7:57 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>>>It was done based on assumption that AVX enabled machine has fast AVX unaligned load.  If it isn't true for AMD CPUs, we can enable it for all Intel AVX CPUs and you can set it for AMD CPUs properly.
>>>
>>> Memcpy still needs to be fixed otherwise SSE2_Unaligned version is selected. Is it OK to fix in following way else please suggest.
>>
>>So AMD processor doesn't want Fast_Unaligned_Load.  Why is it set?
>
> For this function it is not better but good for other routines like  strcat, strncat, stpcpy, stpncpy, strcpy and  strncpy.

Then we should add Fast_Unaligned_Copy and only use it in
memcpy.

-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-22 15:03                               ` H.J. Lu
@ 2016-03-23 10:12                                 ` Pawar, Amit
  2016-03-23 17:59                                   ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-23 10:12 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 220 bytes --]

> Then we should add Fast_Unaligned_Copy and only use it in memcpy.
PFA patch and ChangeLog files containing fix for memcpy IFUNC function. Is it OK else please suggest for any required changes.

Thanks,
Amit Pawar

[-- Attachment #2: 0001-x86_64-Fix-memcpy-IFUNC-selection-order-for-Excavato.patch --]
[-- Type: application/octet-stream, Size: 4108 bytes --]

From 77b89b605ed498e6ab32132e97b0efb8088fd4a6 Mon Sep 17 00:00:00 2001
From: Amit Pawar <Amit.Pawar@amd.com>
Date: Wed, 23 Mar 2016 15:35:27 +0530
Subject: [PATCH] x86_64 Fix memcpy IFUNC selection order for Excavator CPU.

Performance of memcpy implementation based on Fast_Copy_Backward is better
compare to currently selected Fast_Unaligned_Load based implementation on
Excavator cpu. New feature bit is required to fix this issue in memcpy IFUNC
function without affecting other targets. So defining two new
bit_arch_Fast_Unaligned_Copy and index_arch_Fast_Unaligned_Copy feature bits
macros and selection order of this functions is updated.

	[BZ #19583]
        * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
        New.
        (index_arch_Fast_Unaligned_Copy) Likewise.
        * sysdeps/x86/cpu-features.c
        (init_cpu_features, Fast_Copy_Backward): Set it for Excavator core.
        (init_cpu_features, Fast_Unaligned_Copy): Set it for Excavator core.
        * sysdeps/x86_64/multiarch/memcpy.S
        (__new_memcpy, Fast_Unaligned_Copy): Add check for
        Fast_Unaligned_Copy bit and select it on Excavator core.
---
 sysdeps/x86/cpu-features.c        | 11 ++++++++++-
 sysdeps/x86/cpu-features.h        |  3 +++
 sysdeps/x86_64/multiarch/memcpy.S | 12 +++++++-----
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index c8f81ef..7701548 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -220,10 +220,19 @@ init_cpu_features (struct cpu_features *cpu_features)
 
       if (family == 0x15)
 	{
+#if index_arch_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Copy
+# error index_arch_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Copy
+#endif
+#if index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+# error index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+#endif
 	  /* "Excavator"   */
 	  if (model >= 0x60 && model <= 0x7f)
 	    cpu_features->feature[index_arch_Fast_Unaligned_Load]
-	      |= bit_arch_Fast_Unaligned_Load;
+	      |= (bit_arch_Fast_Unaligned_Load
+		  | bit_arch_Fast_Unaligned_Copy
+		  | bit_arch_Fast_Copy_Backward);
+
 	}
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index e06eb7e..bfe1f4c 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686				(1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC		(1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER		(1 << 17)
+#define bit_arch_Fast_Unaligned_Copy		(1 << 18)
 
 /* CPUID Feature flags.  */
 
@@ -101,6 +102,7 @@
 # define index_arch_I686		FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Fast_Unaligned_Copy	FEATURE_INDEX_1*FEATURE_SIZE
 
 
 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -265,6 +267,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686		FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Fast_Unaligned_Copy	FEATURE_INDEX_1
 
 #endif	/* !__ASSEMBLER__ */
 
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..9b37626 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -40,18 +40,20 @@ ENTRY(__new_memcpy)
 #endif
 1:	lea	__memcpy_avx_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	jnz	3f
+	HAS_ARCH_FEATURE (Fast_Unaligned_Copy)
 	jnz	2f
 	lea	__memcpy_sse2_unaligned(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
-	jnz	2f
-	lea	__memcpy_sse2(%rip), %RAX_LP
+	jnz	3f
+2:	lea	__memcpy_sse2(%rip), %RAX_LP
 	HAS_CPU_FEATURE (SSSE3)
-	jz	2f
+	jz	3f
 	lea    __memcpy_ssse3_back(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Copy_Backward)
-	jnz	2f
+	jnz	3f
 	lea	__memcpy_ssse3(%rip), %RAX_LP
-2:	ret
+3:	ret
 END(__new_memcpy)
 
 # undef ENTRY
-- 
2.1.4


[-- Attachment #3: ChangeLog --]
[-- Type: application/octet-stream, Size: 489 bytes --]

2016-03-23  Amit Pawar  <Amit.Pawar@amd.com>

	[BZ #19583]
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): 
	New.
	(index_arch_Fast_Unaligned_Copy) Likewise.
	* sysdeps/x86/cpu-features.c
	(init_cpu_features, Fast_Copy_Backward): Set it for Excavator core.
	(init_cpu_features, Fast_Unaligned_Copy): Set it for Excavator core.
	* sysdeps/x86_64/multiarch/memcpy.S 
	(__new_memcpy, Fast_Unaligned_Copy): Add check for
	Fast_Unaligned_Copy bit and select it on Excavator core.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-23 10:12                                 ` Pawar, Amit
@ 2016-03-23 17:59                                   ` H.J. Lu
  2016-03-28  7:43                                     ` Pawar, Amit
  0 siblings, 1 reply; 23+ messages in thread
From: H.J. Lu @ 2016-03-23 17:59 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 308 bytes --]

On Wed, Mar 23, 2016 at 3:12 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>> Then we should add Fast_Unaligned_Copy and only use it in memcpy.
> PFA patch and ChangeLog files containing fix for memcpy IFUNC function. Is it OK else please suggest for any required changes.
>

It isn't OK.  Try this.

-- 
H.J.

[-- Attachment #2: 0001-x86-Add-a-feature-bit-Fast_Unaligned_Copy.patch --]
[-- Type: text/x-patch, Size: 4586 bytes --]

From 327aadf6348bd41d1fae46ee7780e214c0a493c1 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 23 Mar 2016 10:33:19 -0700
Subject: [PATCH] [x86] Add a feature bit: Fast_Unaligned_Copy

On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load.  A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
	processors.  Set Fast_Copy_Backward for AMD Excavator
	processors.
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
	New.
	(index_arch_Fast_Unaligned_Copy): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
---
 sysdeps/x86/cpu-features.c        | 14 +++++++++++++-
 sysdeps/x86/cpu-features.h        |  3 +++
 sysdeps/x86_64/multiarch/memcpy.S |  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index c8f81ef..de75c79 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -153,8 +153,12 @@ init_cpu_features (struct cpu_features *cpu_features)
 #if index_arch_Fast_Unaligned_Load != index_arch_Slow_SSE4_2
 # error index_arch_Fast_Unaligned_Load != index_arch_Slow_SSE4_2
 #endif
+#if index_arch_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Copy
+# error index_arch_Fast_Unaligned_Load != index_arch_Fast_Unaligned_Copy
+#endif
 	      cpu_features->feature[index_arch_Fast_Unaligned_Load]
 		|= (bit_arch_Fast_Unaligned_Load
+		    | bit_arch_Fast_Unaligned_Copy
 		    | bit_arch_Prefer_PMINUB_for_stringop
 		    | bit_arch_Slow_SSE4_2);
 	      break;
@@ -183,10 +187,14 @@ init_cpu_features (struct cpu_features *cpu_features)
 #if index_arch_Fast_Rep_String != index_arch_Prefer_PMINUB_for_stringop
 # error index_arch_Fast_Rep_String != index_arch_Prefer_PMINUB_for_stringop
 #endif
+#if index_arch_Fast_Rep_String != index_arch_Fast_Unaligned_Copy
+# error index_arch_Fast_Rep_String != index_arch_Fast_Unaligned_Copy
+#endif
 	      cpu_features->feature[index_arch_Fast_Rep_String]
 		|= (bit_arch_Fast_Rep_String
 		    | bit_arch_Fast_Copy_Backward
 		    | bit_arch_Fast_Unaligned_Load
+		    | bit_arch_Fast_Unaligned_Copy
 		    | bit_arch_Prefer_PMINUB_for_stringop);
 	      break;
 	    }
@@ -220,10 +228,14 @@ init_cpu_features (struct cpu_features *cpu_features)
 
       if (family == 0x15)
 	{
+#if index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+# error index_arch_Fast_Unaligned_Load != index_arch_Fast_Copy_Backward
+#endif
 	  /* "Excavator"   */
 	  if (model >= 0x60 && model <= 0x7f)
 	    cpu_features->feature[index_arch_Fast_Unaligned_Load]
-	      |= bit_arch_Fast_Unaligned_Load;
+	      |= (bit_arch_Fast_Unaligned_Load
+		  | bit_arch_Fast_Copy_Backward);
 	}
     }
   else
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index e06eb7e..bfe1f4c 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -35,6 +35,7 @@
 #define bit_arch_I686				(1 << 15)
 #define bit_arch_Prefer_MAP_32BIT_EXEC		(1 << 16)
 #define bit_arch_Prefer_No_VZEROUPPER		(1 << 17)
+#define bit_arch_Fast_Unaligned_Copy		(1 << 18)
 
 /* CPUID Feature flags.  */
 
@@ -101,6 +102,7 @@
 # define index_arch_I686		FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1*FEATURE_SIZE
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1*FEATURE_SIZE
+# define index_arch_Fast_Unaligned_Copy	FEATURE_INDEX_1*FEATURE_SIZE
 
 
 # if defined (_LIBC) && !IS_IN (nonlib)
@@ -265,6 +267,7 @@ extern const struct cpu_features *__get_cpu_features (void)
 # define index_arch_I686		FEATURE_INDEX_1
 # define index_arch_Prefer_MAP_32BIT_EXEC FEATURE_INDEX_1
 # define index_arch_Prefer_No_VZEROUPPER FEATURE_INDEX_1
+# define index_arch_Fast_Unaligned_Copy	FEATURE_INDEX_1
 
 #endif	/* !__ASSEMBLER__ */
 
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
index 8882590..5b045d7 100644
--- a/sysdeps/x86_64/multiarch/memcpy.S
+++ b/sysdeps/x86_64/multiarch/memcpy.S
@@ -42,7 +42,7 @@ ENTRY(__new_memcpy)
 	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
 	jnz	2f
 	lea	__memcpy_sse2_unaligned(%rip), %RAX_LP
-	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+	HAS_ARCH_FEATURE (Fast_Unaligned_Copy)
 	jnz	2f
 	lea	__memcpy_sse2(%rip), %RAX_LP
 	HAS_CPU_FEATURE (SSSE3)
-- 
2.5.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-23 17:59                                   ` H.J. Lu
@ 2016-03-28  7:43                                     ` Pawar, Amit
  2016-03-28 12:12                                       ` H.J. Lu
  0 siblings, 1 reply; 23+ messages in thread
From: Pawar, Amit @ 2016-03-28  7:43 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

>It isn't OK.  Try this.
This is OK. Can you please commit this change?

Thanks,
Amit Pawar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583
  2016-03-28  7:43                                     ` Pawar, Amit
@ 2016-03-28 12:12                                       ` H.J. Lu
  0 siblings, 0 replies; 23+ messages in thread
From: H.J. Lu @ 2016-03-28 12:12 UTC (permalink / raw)
  To: Pawar, Amit; +Cc: libc-alpha

On Mon, Mar 28, 2016 at 12:43 AM, Pawar, Amit <Amit.Pawar@amd.com> wrote:
>>It isn't OK.  Try this.
> This is OK. Can you please commit this change?
>
> Thanks,
> Amit Pawar

Tested on ia32 and x86-64.  I am checking it in.


-- 
H.J.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-03-28 12:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-17 10:52 [PATCH x86_64] Update memcpy, mempcpy and memmove selection order for Excavator CPU BZ #19583 Pawar, Amit
2016-03-17 11:53 ` H.J. Lu
2016-03-17 14:16   ` Pawar, Amit
2016-03-17 14:46     ` H.J. Lu
2016-03-18 11:43       ` Pawar, Amit
2016-03-18 11:51         ` H.J. Lu
2016-03-18 12:25           ` Pawar, Amit
2016-03-18 12:34             ` H.J. Lu
2016-03-18 13:22               ` Pawar, Amit
2016-03-18 13:51                 ` H.J. Lu
2016-03-18 13:55                   ` Adhemerval Zanella
2016-03-18 14:43                     ` H.J. Lu
2016-03-18 14:45                   ` H.J. Lu
2016-03-18 15:19                     ` Pawar, Amit
2016-03-18 15:24                       ` H.J. Lu
2016-03-22 11:08                         ` Pawar, Amit
2016-03-22 14:50                           ` H.J. Lu
2016-03-22 14:57                             ` Pawar, Amit
2016-03-22 15:03                               ` H.J. Lu
2016-03-23 10:12                                 ` Pawar, Amit
2016-03-23 17:59                                   ` H.J. Lu
2016-03-28  7:43                                     ` Pawar, Amit
2016-03-28 12:12                                       ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).