* x86-64: strlen-evex performance performance degradation compared to strlen-avx2 @ 2024-04-26 4:03 abush wang 2024-04-26 13:30 ` H.J. Lu 0 siblings, 1 reply; 12+ messages in thread From: abush wang @ 2024-04-26 4:03 UTC (permalink / raw) To: H.J. Lu, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 2188 bytes --] Hi, H.J. When I test glibc performance between 2.28 and 2.38, I found there is a performance degradation about strlen. In fact, this difference comes from __strlen_avx2 and __strlen_evex ``` 2.28 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 42 ENTRY (STRLEN) 2.38 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 79 ENTRY_P2ALIGN (STRLEN, 6) ``` This is my test: ``` #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #define MAX_STRINGS 100 uint64_t rdtsc() { uint32_t lo, hi; __asm__ __volatile__ ( "rdtsc" : "=a"(lo), "=d"(hi) ); return ((uint64_t)hi << 32) | lo; } int main(int argc, char *argv[]) { char *input_str[MAX_STRINGS]; size_t lengths[MAX_STRINGS]; int num_strings = 0; // Number of input strings uint64_t start_cycles, end_cycles; // Parse command line arguments and store pointers in input_str array for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { input_str[num_strings] = argv[i]; num_strings++; } // Measure the strlen operation for each string start_cycles = rdtsc(); for (int i = 0; i < num_strings; ++i) { lengths[i] = strlen(input_str[i]); } end_cycles = rdtsc(); unsigned long long total_cycle = end_cycles - start_cycles; unsigned long long av_cycle = total_cycle / num_strings; // Print the total cycles taken for the strlen operations printf("Total cycles: %llu av cycle: %llu \n", total_cycle, av_cycle); // Print the recorded lengths printf("Lengths of the input strings:\n"); for (int i = 0; i < num_strings; ++i) { printf("String %d length: %zu\n", i, lengths[i]); } return 0; } ``` This is result ``` 2.28 ./strlen_test str1 str2 str3 str4 str5 Total cycles: 1468 av cycle: 293 Lengths of the input strings: String 0 length: 4 String 1 length: 4 String 2 length: 4 String 3 length: 4 String 4 length: 4 2.38 ./strlen_test str1 str2 str3 str4 str5 Total cycles: 1814 av cycle: 362 Lengths of the input strings: String 0 length: 4 String 1 length: 4 String 2 length: 4 String 3 length: 4 String 4 length: 4 ``` Thanks, abush ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-26 4:03 x86-64: strlen-evex performance performance degradation compared to strlen-avx2 abush wang @ 2024-04-26 13:30 ` H.J. Lu 2024-04-26 16:53 ` Sunil Pandey 2024-04-28 2:06 ` abush wang 0 siblings, 2 replies; 12+ messages in thread From: H.J. Lu @ 2024-04-26 13:30 UTC (permalink / raw) To: abush wang, Sunil K Pandey, Noah Goldstein; +Cc: abushwang via Libc-alpha On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote: > > Hi, H.J. > When I test glibc performance between 2.28 and 2.38, > I found there is a performance degradation about strlen. > In fact, this difference comes from __strlen_avx2 and __strlen_evex > > ``` > 2.28 > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 > 42 ENTRY (STRLEN) > > > 2.38 > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 > 79 ENTRY_P2ALIGN (STRLEN, 6) > ``` > > This is my test: > ``` > #include <stdio.h> > #include <stdlib.h> > #include <stdint.h> > #include <string.h> > > #define MAX_STRINGS 100 > > uint64_t rdtsc() { > uint32_t lo, hi; > __asm__ __volatile__ ( > "rdtsc" : "=a"(lo), "=d"(hi) > ); > return ((uint64_t)hi << 32) | lo; > } > > int main(int argc, char *argv[]) { > char *input_str[MAX_STRINGS]; > size_t lengths[MAX_STRINGS]; > int num_strings = 0; // Number of input strings > uint64_t start_cycles, end_cycles; > > // Parse command line arguments and store pointers in input_str array > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { > input_str[num_strings] = argv[i]; > num_strings++; > } > > // Measure the strlen operation for each string > start_cycles = rdtsc(); > for (int i = 0; i < num_strings; ++i) { > lengths[i] = strlen(input_str[i]); > } > end_cycles = rdtsc(); > > unsigned long long total_cycle = end_cycles - start_cycles; > unsigned long long av_cycle = total_cycle / num_strings; > // Print the total cycles taken for the strlen operations > printf("Total cycles: %llu av cycle: %llu \n", total_cycle, av_cycle); > > // Print the recorded lengths > printf("Lengths of the input strings:\n"); > for (int i = 0; i < num_strings; ++i) { > printf("String %d length: %zu\n", i, lengths[i]); > } > > return 0; > } > ``` > > This is result > ``` > 2.28 > ./strlen_test str1 str2 str3 str4 str5 > Total cycles: 1468 av cycle: 293 > Lengths of the input strings: > String 0 length: 4 > String 1 length: 4 > String 2 length: 4 > String 3 length: 4 > String 4 length: 4 > > 2.38 > ./strlen_test str1 str2 str3 str4 str5 > Total cycles: 1814 av cycle: 362 > Lengths of the input strings: > String 0 length: 4 > String 1 length: 4 > String 2 length: 4 > String 3 length: 4 > String 4 length: 4 > ``` > > Thanks, > abush Which processors did you use? Sunil, Noah, can we reproduce it? -- H.J. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-26 13:30 ` H.J. Lu @ 2024-04-26 16:53 ` Sunil Pandey 2024-04-28 2:13 ` abush wang 2024-04-28 2:06 ` abush wang 1 sibling, 1 reply; 12+ messages in thread From: Sunil Pandey @ 2024-04-26 16:53 UTC (permalink / raw) To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 3184 bytes --] On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote: > On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote: > > > > Hi, H.J. > > When I test glibc performance between 2.28 and 2.38, > > I found there is a performance degradation about strlen. > > In fact, this difference comes from __strlen_avx2 and __strlen_evex > > > > ``` > > 2.28 > > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 > > 42 ENTRY (STRLEN) > > > > > > 2.38 > > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 > > 79 ENTRY_P2ALIGN (STRLEN, 6) > > ``` > > > > This is my test: > > ``` > > #include <stdio.h> > > #include <stdlib.h> > > #include <stdint.h> > > #include <string.h> > > > > #define MAX_STRINGS 100 > > > > uint64_t rdtsc() { > > uint32_t lo, hi; > > __asm__ __volatile__ ( > > "rdtsc" : "=a"(lo), "=d"(hi) > > ); > > return ((uint64_t)hi << 32) | lo; > > } > > > > int main(int argc, char *argv[]) { > > char *input_str[MAX_STRINGS]; > > size_t lengths[MAX_STRINGS]; > > int num_strings = 0; // Number of input strings > > uint64_t start_cycles, end_cycles; > > > > // Parse command line arguments and store pointers in input_str array > > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { > > input_str[num_strings] = argv[i]; > > num_strings++; > > } > > > > // Measure the strlen operation for each string > > start_cycles = rdtsc(); > > for (int i = 0; i < num_strings; ++i) { > > lengths[i] = strlen(input_str[i]); > > } > > end_cycles = rdtsc(); > > > > unsigned long long total_cycle = end_cycles - start_cycles; > > unsigned long long av_cycle = total_cycle / num_strings; > > // Print the total cycles taken for the strlen operations > > printf("Total cycles: %llu av cycle: %llu \n", total_cycle, > av_cycle); > > > > // Print the recorded lengths > > printf("Lengths of the input strings:\n"); > > for (int i = 0; i < num_strings; ++i) { > > printf("String %d length: %zu\n", i, lengths[i]); > > } > > > > return 0; > > } > > ``` > > > > This is result > > ``` > > 2.28 > > ./strlen_test str1 str2 str3 str4 str5 > > Total cycles: 1468 av cycle: 293 > > Lengths of the input strings: > > String 0 length: 4 > > String 1 length: 4 > > String 2 length: 4 > > String 3 length: 4 > > String 4 length: 4 > > > > 2.38 > > ./strlen_test str1 str2 str3 str4 str5 > > Total cycles: 1814 av cycle: 362 > > Lengths of the input strings: > > String 0 length: 4 > > String 1 length: 4 > > String 2 length: 4 > > String 3 length: 4 > > String 4 length: 4 > > ``` > > > > Thanks, > > abush > I'm not sure how you are measuring the performance of strlen function. Are you making performance conclusion based on these 2 runs? 2.28 Total cycles: 1468 av cycle: 293 2.38 Total cycles: 1814 av cycle: 362 Please use glibc microbenchmark to see if you can reproduce perf drop. > > Which processors did you use? Sunil, Noah, can we reproduce it? > > -- > H.J. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-26 16:53 ` Sunil Pandey @ 2024-04-28 2:13 ` abush wang 2024-04-28 16:12 ` Sunil Pandey 0 siblings, 1 reply; 12+ messages in thread From: abush wang @ 2024-04-28 2:13 UTC (permalink / raw) To: Sunil Pandey; +Cc: H.J. Lu, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 3614 bytes --] Actually, I was handling performance issue from libmicro in our distro OS. I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen. So I abstracted this test case. On Sat, Apr 27, 2024 at 12:54 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote: > >> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote: >> > >> > Hi, H.J. >> > When I test glibc performance between 2.28 and 2.38, >> > I found there is a performance degradation about strlen. >> > In fact, this difference comes from __strlen_avx2 and __strlen_evex >> > >> > ``` >> > 2.28 >> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 >> > 42 ENTRY (STRLEN) >> > >> > >> > 2.38 >> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 >> > 79 ENTRY_P2ALIGN (STRLEN, 6) >> > ``` >> > >> > This is my test: >> > ``` >> > #include <stdio.h> >> > #include <stdlib.h> >> > #include <stdint.h> >> > #include <string.h> >> > >> > #define MAX_STRINGS 100 >> > >> > uint64_t rdtsc() { >> > uint32_t lo, hi; >> > __asm__ __volatile__ ( >> > "rdtsc" : "=a"(lo), "=d"(hi) >> > ); >> > return ((uint64_t)hi << 32) | lo; >> > } >> > >> > int main(int argc, char *argv[]) { >> > char *input_str[MAX_STRINGS]; >> > size_t lengths[MAX_STRINGS]; >> > int num_strings = 0; // Number of input strings >> > uint64_t start_cycles, end_cycles; >> > >> > // Parse command line arguments and store pointers in input_str >> array >> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { >> > input_str[num_strings] = argv[i]; >> > num_strings++; >> > } >> > >> > // Measure the strlen operation for each string >> > start_cycles = rdtsc(); >> > for (int i = 0; i < num_strings; ++i) { >> > lengths[i] = strlen(input_str[i]); >> > } >> > end_cycles = rdtsc(); >> > >> > unsigned long long total_cycle = end_cycles - start_cycles; >> > unsigned long long av_cycle = total_cycle / num_strings; >> > // Print the total cycles taken for the strlen operations >> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle, >> av_cycle); >> > >> > // Print the recorded lengths >> > printf("Lengths of the input strings:\n"); >> > for (int i = 0; i < num_strings; ++i) { >> > printf("String %d length: %zu\n", i, lengths[i]); >> > } >> > >> > return 0; >> > } >> > ``` >> > >> > This is result >> > ``` >> > 2.28 >> > ./strlen_test str1 str2 str3 str4 str5 >> > Total cycles: 1468 av cycle: 293 >> > Lengths of the input strings: >> > String 0 length: 4 >> > String 1 length: 4 >> > String 2 length: 4 >> > String 3 length: 4 >> > String 4 length: 4 >> > >> > 2.38 >> > ./strlen_test str1 str2 str3 str4 str5 >> > Total cycles: 1814 av cycle: 362 >> > Lengths of the input strings: >> > String 0 length: 4 >> > String 1 length: 4 >> > String 2 length: 4 >> > String 3 length: 4 >> > String 4 length: 4 >> > ``` >> > >> > Thanks, >> > abush >> > > I'm not sure how you are measuring the performance of strlen function. > Are you making performance conclusion based on these 2 runs? > > 2.28 > Total cycles: 1468 av cycle: 293 > > 2.38 > Total cycles: 1814 av cycle: 362 > > Please use glibc microbenchmark to see if you can reproduce perf drop. > > >> >> Which processors did you use? Sunil, Noah, can we reproduce it? >> >> -- >> H.J. >> > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-28 2:13 ` abush wang @ 2024-04-28 16:12 ` Sunil Pandey 2024-04-28 16:16 ` H.J. Lu 0 siblings, 1 reply; 12+ messages in thread From: Sunil Pandey @ 2024-04-28 16:12 UTC (permalink / raw) To: abush wang; +Cc: H.J. Lu, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 3960 bytes --] On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote: > Actually, I was handling performance issue from libmicro in our distro OS. > I found that the performance degradation of localtime_r benchmark from > libmicro is blame to strlen. > So I abstracted this test case. > > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back? You can see high swing from run > On Sat, Apr 27, 2024 at 12:54 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > >> >> >> On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> >>> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote: >>> > >>> > Hi, H.J. >>> > When I test glibc performance between 2.28 and 2.38, >>> > I found there is a performance degradation about strlen. >>> > In fact, this difference comes from __strlen_avx2 and __strlen_evex >>> > >>> > ``` >>> > 2.28 >>> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 >>> > 42 ENTRY (STRLEN) >>> > >>> > >>> > 2.38 >>> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 >>> > 79 ENTRY_P2ALIGN (STRLEN, 6) >>> > ``` >>> > >>> > This is my test: >>> > ``` >>> > #include <stdio.h> >>> > #include <stdlib.h> >>> > #include <stdint.h> >>> > #include <string.h> >>> > >>> > #define MAX_STRINGS 100 >>> > >>> > uint64_t rdtsc() { >>> > uint32_t lo, hi; >>> > __asm__ __volatile__ ( >>> > "rdtsc" : "=a"(lo), "=d"(hi) >>> > ); >>> > return ((uint64_t)hi << 32) | lo; >>> > } >>> > >>> > int main(int argc, char *argv[]) { >>> > char *input_str[MAX_STRINGS]; >>> > size_t lengths[MAX_STRINGS]; >>> > int num_strings = 0; // Number of input strings >>> > uint64_t start_cycles, end_cycles; >>> > >>> > // Parse command line arguments and store pointers in input_str >>> array >>> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { >>> > input_str[num_strings] = argv[i]; >>> > num_strings++; >>> > } >>> > >>> > // Measure the strlen operation for each string >>> > start_cycles = rdtsc(); >>> > for (int i = 0; i < num_strings; ++i) { >>> > lengths[i] = strlen(input_str[i]); >>> > } >>> > end_cycles = rdtsc(); >>> > >>> > unsigned long long total_cycle = end_cycles - start_cycles; >>> > unsigned long long av_cycle = total_cycle / num_strings; >>> > // Print the total cycles taken for the strlen operations >>> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle, >>> av_cycle); >>> > >>> > // Print the recorded lengths >>> > printf("Lengths of the input strings:\n"); >>> > for (int i = 0; i < num_strings; ++i) { >>> > printf("String %d length: %zu\n", i, lengths[i]); >>> > } >>> > >>> > return 0; >>> > } >>> > ``` >>> > >>> > This is result >>> > ``` >>> > 2.28 >>> > ./strlen_test str1 str2 str3 str4 str5 >>> > Total cycles: 1468 av cycle: 293 >>> > Lengths of the input strings: >>> > String 0 length: 4 >>> > String 1 length: 4 >>> > String 2 length: 4 >>> > String 3 length: 4 >>> > String 4 length: 4 >>> > >>> > 2.38 >>> > ./strlen_test str1 str2 str3 str4 str5 >>> > Total cycles: 1814 av cycle: 362 >>> > Lengths of the input strings: >>> > String 0 length: 4 >>> > String 1 length: 4 >>> > String 2 length: 4 >>> > String 3 length: 4 >>> > String 4 length: 4 >>> > ``` >>> > >>> > Thanks, >>> > abush >>> >> >> I'm not sure how you are measuring the performance of strlen function. >> Are you making performance conclusion based on these 2 runs? >> >> 2.28 >> Total cycles: 1468 av cycle: 293 >> >> 2.38 >> Total cycles: 1814 av cycle: 362 >> >> Please use glibc microbenchmark to see if you can reproduce perf drop. >> >> >>> >>> Which processors did you use? Sunil, Noah, can we reproduce it? >>> >>> -- >>> H.J. >>> >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-28 16:12 ` Sunil Pandey @ 2024-04-28 16:16 ` H.J. Lu 2024-04-29 17:41 ` Sunil Pandey 0 siblings, 1 reply; 12+ messages in thread From: H.J. Lu @ 2024-04-28 16:16 UTC (permalink / raw) To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote: >> >> Actually, I was handling performance issue from libmicro in our distro OS. >> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen. >> So I abstracted this test case. >> > > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back? > > You can see high swing from run Hi Sunil, Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to benchtests/bench-strlen.c and check its performance on SKX. -- H.J. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-28 16:16 ` H.J. Lu @ 2024-04-29 17:41 ` Sunil Pandey 2024-04-29 20:19 ` H.J. Lu 0 siblings, 1 reply; 12+ messages in thread From: Sunil Pandey @ 2024-04-29 17:41 UTC (permalink / raw) To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 1439 bytes --] On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote: > On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > > > > > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote: > >> > >> Actually, I was handling performance issue from libmicro in our distro > OS. > >> I found that the performance degradation of localtime_r benchmark from > libmicro is blame to strlen. > >> So I abstracted this test case. > >> > > > > Can you consistently reproduce strlen perf behaviour by running multiple > times back-to-back? > > > > You can see high swing from run > > Hi Sunil, > > Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to > benchtests/bench-strlen.c and check its performance on SKX. > > -- > H.J. > I collected the glibc micro-benchmark data for the string length in question. 2.38 evex data: length=4, alignment=4: 4.40 length=4, alignment=0: 4.29 length=4, alignment=0: 3.64 length=4, alignment=7: 3.64 length=4, alignment=2: 3.64 2.28 evex data: Length 4, alignment 4: 6.46875 Length 4, alignment 0: 6.5 Length 4, alignment 0: 6.53125 Length 4, alignment 7: 6.46875 Length 4, alignment 2: 6.53125 Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 2.38 perf numbers are better than 2.28 as expected. --Sunil ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-29 17:41 ` Sunil Pandey @ 2024-04-29 20:19 ` H.J. Lu 2024-04-30 0:54 ` Sunil Pandey 0 siblings, 1 reply; 12+ messages in thread From: H.J. Lu @ 2024-04-29 20:19 UTC (permalink / raw) To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote: >> > >> > >> > >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote: >> >> >> >> Actually, I was handling performance issue from libmicro in our distro OS. >> >> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen. >> >> So I abstracted this test case. >> >> >> > >> > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back? >> > >> > You can see high swing from run >> >> Hi Sunil, >> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to >> benchtests/bench-strlen.c and check its performance on SKX. >> >> -- >> H.J. > > > I collected the glibc micro-benchmark data for the string length in question. > > 2.38 evex data: > > length=4, alignment=4: 4.40 > length=4, alignment=0: 4.29 > length=4, alignment=0: 3.64 > length=4, alignment=7: 3.64 > length=4, alignment=2: 3.64 > > 2.28 evex data: > > Length 4, alignment 4: 6.46875 > Length 4, alignment 0: 6.5 > Length 4, alignment 0: 6.53125 > Length 4, alignment 7: 6.46875 > Length 4, alignment 2: 6.53125 > > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz > > 2.38 perf numbers are better than 2.28 as expected. 1. Please compare AVX2 vs EVEX strlen on glibc master branch. 2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2, 3. -- H.J. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-29 20:19 ` H.J. Lu @ 2024-04-30 0:54 ` Sunil Pandey 2024-04-30 2:51 ` H.J. Lu 0 siblings, 1 reply; 12+ messages in thread From: Sunil Pandey @ 2024-04-30 0:54 UTC (permalink / raw) To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 3752 bytes --] On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > > > > > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote: > >> > > >> > > >> > > >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> > wrote: > >> >> > >> >> Actually, I was handling performance issue from libmicro in our > distro OS. > >> >> I found that the performance degradation of localtime_r benchmark > from libmicro is blame to strlen. > >> >> So I abstracted this test case. > >> >> > >> > > >> > Can you consistently reproduce strlen perf behaviour by running > multiple times back-to-back? > >> > > >> > You can see high swing from run > >> > >> Hi Sunil, > >> > >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test > to > >> benchtests/bench-strlen.c and check its performance on SKX. > >> > >> -- > >> H.J. > > > > > > I collected the glibc micro-benchmark data for the string length in > question. > > > > 2.38 evex data: > > > > length=4, alignment=4: 4.40 > > length=4, alignment=0: 4.29 > > length=4, alignment=0: 3.64 > > length=4, alignment=7: 3.64 > > length=4, alignment=2: 3.64 > > > > 2.28 evex data: > > > > Length 4, alignment 4: 6.46875 > > Length 4, alignment 0: 6.5 > > Length 4, alignment 0: 6.53125 > > Length 4, alignment 7: 6.46875 > > Length 4, alignment 2: 6.53125 > > > > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz > > > > 2.38 perf numbers are better than 2.28 as expected. > > 1. Please compare AVX2 vs EVEX strlen on glibc master branch. > 2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2, > 3. > > -- > H.J. > Data from master branch: Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz __strlen_evex __strlen_avx2 ======================================================= length=4, alignment=0: 5.00 5.11 length=4, alignment=1: 4.92 4.80 length=4, alignment=2: 4.82 4.62 length=4, alignment=3: 4.62 4.92 length=4, alignment=4: 4.44 4.44 length=4, alignment=5: 4.59 4.29 length=4, alignment=6: 4.39 4.29 length=4, alignment=7: 4.14 4.14 length=4, alignment=8: 4.19 4.00 length=4, alignment=9: 4.00 4.00 length=4, alignment=10: 4.31 3.87 length=4, alignment=11: 3.96 3.87 length=4, alignment=12: 3.86 3.75 length=4, alignment=13: 3.75 3.75 length=4, alignment=14: 3.64 3.64 length=4, alignment=15: 3.64 3.72 length=4, alignment=16: 3.64 3.53 length=4, alignment=17: 3.63 3.53 length=4, alignment=18: 4.12 3.53 length=4, alignment=19: 3.43 3.43 length=4, alignment=20: 3.43 3.43 length=4, alignment=21: 3.33 3.33 length=4, alignment=22: 3.33 3.42 length=4, alignment=23: 3.33 3.33 length=4, alignment=24: 3.33 3.33 length=4, alignment=25: 3.33 3.33 length=4, alignment=26: 3.96 3.33 length=4, alignment=27: 3.33 3.41 length=4, alignment=28: 3.33 3.33 length=4, alignment=29: 3.41 3.33 length=4, alignment=30: 3.33 3.41 length=4, alignment=31: 3.33 3.33 --Sunil ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-30 0:54 ` Sunil Pandey @ 2024-04-30 2:51 ` H.J. Lu 2024-04-30 20:16 ` Sunil Pandey 0 siblings, 1 reply; 12+ messages in thread From: H.J. Lu @ 2024-04-30 2:51 UTC (permalink / raw) To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha On Mon, Apr 29, 2024 at 5:55 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote: >> > >> > >> > >> > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote: >> >> > >> >> > >> >> > >> >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote: >> >> >> >> >> >> Actually, I was handling performance issue from libmicro in our distro OS. >> >> >> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen. >> >> >> So I abstracted this test case. >> >> >> >> >> > >> >> > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back? >> >> > >> >> > You can see high swing from run >> >> >> >> Hi Sunil, >> >> >> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to >> >> benchtests/bench-strlen.c and check its performance on SKX. >> >> >> >> -- >> >> H.J. >> > >> > >> > I collected the glibc micro-benchmark data for the string length in question. >> > >> > 2.38 evex data: >> > >> > length=4, alignment=4: 4.40 >> > length=4, alignment=0: 4.29 >> > length=4, alignment=0: 3.64 >> > length=4, alignment=7: 3.64 >> > length=4, alignment=2: 3.64 >> > >> > 2.28 evex data: >> > >> > Length 4, alignment 4: 6.46875 >> > Length 4, alignment 0: 6.5 >> > Length 4, alignment 0: 6.53125 >> > Length 4, alignment 7: 6.46875 >> > Length 4, alignment 2: 6.53125 >> > >> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz >> > >> > 2.38 perf numbers are better than 2.28 as expected. >> >> 1. Please compare AVX2 vs EVEX strlen on glibc master branch. >> 2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2, 3. >> >> -- >> H.J. > > > Data from master branch: > > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz > > __strlen_evex __strlen_avx2 > ======================================================= > length=4, alignment=0: 5.00 5.11 > length=4, alignment=1: 4.92 4.80 > length=4, alignment=2: 4.82 4.62 > length=4, alignment=3: 4.62 4.92 > length=4, alignment=4: 4.44 4.44 > length=4, alignment=5: 4.59 4.29 > length=4, alignment=6: 4.39 4.29 > length=4, alignment=7: 4.14 4.14 > length=4, alignment=8: 4.19 4.00 > length=4, alignment=9: 4.00 4.00 > length=4, alignment=10: 4.31 3.87 > length=4, alignment=11: 3.96 3.87 > length=4, alignment=12: 3.86 3.75 > length=4, alignment=13: 3.75 3.75 > length=4, alignment=14: 3.64 3.64 > length=4, alignment=15: 3.64 3.72 > length=4, alignment=16: 3.64 3.53 > length=4, alignment=17: 3.63 3.53 > length=4, alignment=18: 4.12 3.53 > length=4, alignment=19: 3.43 3.43 > length=4, alignment=20: 3.43 3.43 > length=4, alignment=21: 3.33 3.33 > length=4, alignment=22: 3.33 3.42 > length=4, alignment=23: 3.33 3.33 > length=4, alignment=24: 3.33 3.33 > length=4, alignment=25: 3.33 3.33 > length=4, alignment=26: 3.96 3.33 > length=4, alignment=27: 3.33 3.41 > length=4, alignment=28: 3.33 3.33 > length=4, alignment=29: 3.41 3.33 > length=4, alignment=30: 3.33 3.41 > length=4, alignment=31: 3.33 3.33 > > --Sunil Hi Sunil, strlen-avx2.S in glibc 2.28 release (tag glibc-2.28) is different from strlen-avx2.S on master branch. Please compare their performances. -- H.J. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-30 2:51 ` H.J. Lu @ 2024-04-30 20:16 ` Sunil Pandey 0 siblings, 0 replies; 12+ messages in thread From: Sunil Pandey @ 2024-04-30 20:16 UTC (permalink / raw) To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 6778 bytes --] On Mon, Apr 29, 2024 at 7:52 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Apr 29, 2024 at 5:55 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > > > > > On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> > wrote: > >> > > >> > > >> > > >> > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> > >> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> > wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> > wrote: > >> >> >> > >> >> >> Actually, I was handling performance issue from libmicro in our > distro OS. > >> >> >> I found that the performance degradation of localtime_r benchmark > from libmicro is blame to strlen. > >> >> >> So I abstracted this test case. > >> >> >> > >> >> > > >> >> > Can you consistently reproduce strlen perf behaviour by running > multiple times back-to-back? > >> >> > > >> >> > You can see high swing from run > >> >> > >> >> Hi Sunil, > >> >> > >> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this > test to > >> >> benchtests/bench-strlen.c and check its performance on SKX. > >> >> > >> >> -- > >> >> H.J. > >> > > >> > > >> > I collected the glibc micro-benchmark data for the string length in > question. > >> > > >> > 2.38 evex data: > >> > > >> > length=4, alignment=4: 4.40 > >> > length=4, alignment=0: 4.29 > >> > length=4, alignment=0: 3.64 > >> > length=4, alignment=7: 3.64 > >> > length=4, alignment=2: 3.64 > >> > > >> > 2.28 evex data: > >> > > >> > Length 4, alignment 4: 6.46875 > >> > Length 4, alignment 0: 6.5 > >> > Length 4, alignment 0: 6.53125 > >> > Length 4, alignment 7: 6.46875 > >> > Length 4, alignment 2: 6.53125 > >> > > >> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz > >> > > >> > 2.38 perf numbers are better than 2.28 as expected. > >> > >> 1. Please compare AVX2 vs EVEX strlen on glibc master branch. > >> 2. Please check strlen on strings of length == 4 and alignments = 0, 1, > 2, 3. > >> > >> -- > >> H.J. > > > > > > Data from master branch: > > > > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz > > > > __strlen_evex __strlen_avx2 > > ======================================================= > > length=4, alignment=0: 5.00 5.11 > > length=4, alignment=1: 4.92 4.80 > > length=4, alignment=2: 4.82 4.62 > > length=4, alignment=3: 4.62 4.92 > > length=4, alignment=4: 4.44 4.44 > > length=4, alignment=5: 4.59 4.29 > > length=4, alignment=6: 4.39 4.29 > > length=4, alignment=7: 4.14 4.14 > > length=4, alignment=8: 4.19 4.00 > > length=4, alignment=9: 4.00 4.00 > > length=4, alignment=10: 4.31 3.87 > > length=4, alignment=11: 3.96 3.87 > > length=4, alignment=12: 3.86 3.75 > > length=4, alignment=13: 3.75 3.75 > > length=4, alignment=14: 3.64 3.64 > > length=4, alignment=15: 3.64 3.72 > > length=4, alignment=16: 3.64 3.53 > > length=4, alignment=17: 3.63 3.53 > > length=4, alignment=18: 4.12 3.53 > > length=4, alignment=19: 3.43 3.43 > > length=4, alignment=20: 3.43 3.43 > > length=4, alignment=21: 3.33 3.33 > > length=4, alignment=22: 3.33 3.42 > > length=4, alignment=23: 3.33 3.33 > > length=4, alignment=24: 3.33 3.33 > > length=4, alignment=25: 3.33 3.33 > > length=4, alignment=26: 3.96 3.33 > > length=4, alignment=27: 3.33 3.41 > > length=4, alignment=28: 3.33 3.33 > > length=4, alignment=29: 3.41 3.33 > > length=4, alignment=30: 3.33 3.41 > > length=4, alignment=31: 3.33 3.33 > > > > --Sunil > > Hi Sunil, > > strlen-avx2.S in glibc 2.28 release (tag glibc-2.28) is > different from strlen-avx2.S on master branch. Please > compare their performances. > > -- > H.J. > I tested strlen implementations with different alignment combinations. _strlen_evex(master) __strlen_avx2(master) __strlen_avx2(2.28) ========================================================== length=4, alignment=0: 5.00 5.09 8.00 length=4, alignment=1: 4.80 4.80 7.78 length=4, alignment=2: 4.71 4.62 7.46 length=4, alignment=3: 4.44 4.55 7.11 length=4, alignment=4: 4.44 4.45 7.23 length=4, alignment=5: 4.29 4.29 6.86 length=4, alignment=6: 4.14 4.14 6.76 length=4, alignment=7: 4.00 4.00 6.40 length=4, alignment=8: 4.00 4.00 6.50 length=4, alignment=9: 3.87 3.87 6.29 length=4, alignment=10: 3.75 3.85 6.00 length=4, alignment=11: 3.75 3.75 6.00 length=4, alignment=12: 3.76 3.64 5.82 length=4, alignment=13: 3.64 3.64 6.08 length=4, alignment=14: 3.53 3.53 5.74 length=4, alignment=15: 3.53 3.53 5.74 length=4, alignment=16: 3.43 3.43 5.57 length=4, alignment=17: 3.43 3.43 5.67 length=4, alignment=18: 3.33 3.33 5.41 length=4, alignment=19: 3.33 3.33 5.44 length=4, alignment=20: 3.33 3.33 5.41 length=4, alignment=21: 3.33 3.33 5.43 length=4, alignment=22: 3.33 3.33 5.41 length=4, alignment=23: 3.33 3.33 5.41 length=4, alignment=24: 3.33 3.33 5.33 length=4, alignment=25: 3.41 3.33 5.33 length=4, alignment=26: 3.86 3.33 5.33 length=4, alignment=27: 3.42 3.33 5.33 length=4, alignment=28: 3.33 3.33 5.33 length=4, alignment=29: 3.33 3.33 5.33 length=4, alignment=30: 3.33 3.33 5.33 length=4, alignment=31: 3.33 3.33 5.33 Based on the data - avx2/evex version in master is faster than avx2 version in glibc-2.28 as expected. --Sunil ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2 2024-04-26 13:30 ` H.J. Lu 2024-04-26 16:53 ` Sunil Pandey @ 2024-04-28 2:06 ` abush wang 1 sibling, 0 replies; 12+ messages in thread From: abush wang @ 2024-04-28 2:06 UTC (permalink / raw) To: H.J. Lu; +Cc: Sunil K Pandey, Noah Goldstein, abushwang via Libc-alpha [-- Attachment #1: Type: text/plain, Size: 3184 bytes --] This is my env: lscpu ... BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz BIOS Model name: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU @ 2.5GHz ... I think you can run my demo in these environments to reproduce it On Fri, Apr 26, 2024 at 9:30 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote: > > > > Hi, H.J. > > When I test glibc performance between 2.28 and 2.38, > > I found there is a performance degradation about strlen. > > In fact, this difference comes from __strlen_avx2 and __strlen_evex > > > > ``` > > 2.28 > > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42 > > 42 ENTRY (STRLEN) > > > > > > 2.38 > > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79 > > 79 ENTRY_P2ALIGN (STRLEN, 6) > > ``` > > > > This is my test: > > ``` > > #include <stdio.h> > > #include <stdlib.h> > > #include <stdint.h> > > #include <string.h> > > > > #define MAX_STRINGS 100 > > > > uint64_t rdtsc() { > > uint32_t lo, hi; > > __asm__ __volatile__ ( > > "rdtsc" : "=a"(lo), "=d"(hi) > > ); > > return ((uint64_t)hi << 32) | lo; > > } > > > > int main(int argc, char *argv[]) { > > char *input_str[MAX_STRINGS]; > > size_t lengths[MAX_STRINGS]; > > int num_strings = 0; // Number of input strings > > uint64_t start_cycles, end_cycles; > > > > // Parse command line arguments and store pointers in input_str array > > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) { > > input_str[num_strings] = argv[i]; > > num_strings++; > > } > > > > // Measure the strlen operation for each string > > start_cycles = rdtsc(); > > for (int i = 0; i < num_strings; ++i) { > > lengths[i] = strlen(input_str[i]); > > } > > end_cycles = rdtsc(); > > > > unsigned long long total_cycle = end_cycles - start_cycles; > > unsigned long long av_cycle = total_cycle / num_strings; > > // Print the total cycles taken for the strlen operations > > printf("Total cycles: %llu av cycle: %llu \n", total_cycle, > av_cycle); > > > > // Print the recorded lengths > > printf("Lengths of the input strings:\n"); > > for (int i = 0; i < num_strings; ++i) { > > printf("String %d length: %zu\n", i, lengths[i]); > > } > > > > return 0; > > } > > ``` > > > > This is result > > ``` > > 2.28 > > ./strlen_test str1 str2 str3 str4 str5 > > Total cycles: 1468 av cycle: 293 > > Lengths of the input strings: > > String 0 length: 4 > > String 1 length: 4 > > String 2 length: 4 > > String 3 length: 4 > > String 4 length: 4 > > > > 2.38 > > ./strlen_test str1 str2 str3 str4 str5 > > Total cycles: 1814 av cycle: 362 > > Lengths of the input strings: > > String 0 length: 4 > > String 1 length: 4 > > String 2 length: 4 > > String 3 length: 4 > > String 4 length: 4 > > ``` > > > > Thanks, > > abush > > Which processors did you use? Sunil, Noah, can we reproduce it? > > -- > H.J. > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-04-30 20:17 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-04-26 4:03 x86-64: strlen-evex performance performance degradation compared to strlen-avx2 abush wang 2024-04-26 13:30 ` H.J. Lu 2024-04-26 16:53 ` Sunil Pandey 2024-04-28 2:13 ` abush wang 2024-04-28 16:12 ` Sunil Pandey 2024-04-28 16:16 ` H.J. Lu 2024-04-29 17:41 ` Sunil Pandey 2024-04-29 20:19 ` H.J. Lu 2024-04-30 0:54 ` Sunil Pandey 2024-04-30 2:51 ` H.J. Lu 2024-04-30 20:16 ` Sunil Pandey 2024-04-28 2:06 ` abush wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).