* x86-64: strlen-evex performance performance degradation compared to strlen-avx2
@ 2024-04-26 4:03 abush wang
2024-04-26 13:30 ` H.J. Lu
0 siblings, 1 reply; 12+ messages in thread
From: abush wang @ 2024-04-26 4:03 UTC (permalink / raw)
To: H.J. Lu, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]
Hi, H.J.
When I test glibc performance between 2.28 and 2.38,
I found there is a performance degradation about strlen.
In fact, this difference comes from __strlen_avx2 and __strlen_evex
```
2.28
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
42 ENTRY (STRLEN)
2.38
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
79 ENTRY_P2ALIGN (STRLEN, 6)
```
This is my test:
```
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#define MAX_STRINGS 100
uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
int main(int argc, char *argv[]) {
char *input_str[MAX_STRINGS];
size_t lengths[MAX_STRINGS];
int num_strings = 0; // Number of input strings
uint64_t start_cycles, end_cycles;
// Parse command line arguments and store pointers in input_str array
for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
input_str[num_strings] = argv[i];
num_strings++;
}
// Measure the strlen operation for each string
start_cycles = rdtsc();
for (int i = 0; i < num_strings; ++i) {
lengths[i] = strlen(input_str[i]);
}
end_cycles = rdtsc();
unsigned long long total_cycle = end_cycles - start_cycles;
unsigned long long av_cycle = total_cycle / num_strings;
// Print the total cycles taken for the strlen operations
printf("Total cycles: %llu av cycle: %llu \n", total_cycle, av_cycle);
// Print the recorded lengths
printf("Lengths of the input strings:\n");
for (int i = 0; i < num_strings; ++i) {
printf("String %d length: %zu\n", i, lengths[i]);
}
return 0;
}
```
This is result
```
2.28
./strlen_test str1 str2 str3 str4 str5
Total cycles: 1468 av cycle: 293
Lengths of the input strings:
String 0 length: 4
String 1 length: 4
String 2 length: 4
String 3 length: 4
String 4 length: 4
2.38
./strlen_test str1 str2 str3 str4 str5
Total cycles: 1814 av cycle: 362
Lengths of the input strings:
String 0 length: 4
String 1 length: 4
String 2 length: 4
String 3 length: 4
String 4 length: 4
```
Thanks,
abush
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-26 4:03 x86-64: strlen-evex performance performance degradation compared to strlen-avx2 abush wang
@ 2024-04-26 13:30 ` H.J. Lu
2024-04-26 16:53 ` Sunil Pandey
2024-04-28 2:06 ` abush wang
0 siblings, 2 replies; 12+ messages in thread
From: H.J. Lu @ 2024-04-26 13:30 UTC (permalink / raw)
To: abush wang, Sunil K Pandey, Noah Goldstein; +Cc: abushwang via Libc-alpha
On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
>
> Hi, H.J.
> When I test glibc performance between 2.28 and 2.38,
> I found there is a performance degradation about strlen.
> In fact, this difference comes from __strlen_avx2 and __strlen_evex
>
> ```
> 2.28
> __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
> 42 ENTRY (STRLEN)
>
>
> 2.38
> __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
> 79 ENTRY_P2ALIGN (STRLEN, 6)
> ```
>
> This is my test:
> ```
> #include <stdio.h>
> #include <stdlib.h>
> #include <stdint.h>
> #include <string.h>
>
> #define MAX_STRINGS 100
>
> uint64_t rdtsc() {
> uint32_t lo, hi;
> __asm__ __volatile__ (
> "rdtsc" : "=a"(lo), "=d"(hi)
> );
> return ((uint64_t)hi << 32) | lo;
> }
>
> int main(int argc, char *argv[]) {
> char *input_str[MAX_STRINGS];
> size_t lengths[MAX_STRINGS];
> int num_strings = 0; // Number of input strings
> uint64_t start_cycles, end_cycles;
>
> // Parse command line arguments and store pointers in input_str array
> for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
> input_str[num_strings] = argv[i];
> num_strings++;
> }
>
> // Measure the strlen operation for each string
> start_cycles = rdtsc();
> for (int i = 0; i < num_strings; ++i) {
> lengths[i] = strlen(input_str[i]);
> }
> end_cycles = rdtsc();
>
> unsigned long long total_cycle = end_cycles - start_cycles;
> unsigned long long av_cycle = total_cycle / num_strings;
> // Print the total cycles taken for the strlen operations
> printf("Total cycles: %llu av cycle: %llu \n", total_cycle, av_cycle);
>
> // Print the recorded lengths
> printf("Lengths of the input strings:\n");
> for (int i = 0; i < num_strings; ++i) {
> printf("String %d length: %zu\n", i, lengths[i]);
> }
>
> return 0;
> }
> ```
>
> This is result
> ```
> 2.28
> ./strlen_test str1 str2 str3 str4 str5
> Total cycles: 1468 av cycle: 293
> Lengths of the input strings:
> String 0 length: 4
> String 1 length: 4
> String 2 length: 4
> String 3 length: 4
> String 4 length: 4
>
> 2.38
> ./strlen_test str1 str2 str3 str4 str5
> Total cycles: 1814 av cycle: 362
> Lengths of the input strings:
> String 0 length: 4
> String 1 length: 4
> String 2 length: 4
> String 3 length: 4
> String 4 length: 4
> ```
>
> Thanks,
> abush
Which processors did you use? Sunil, Noah, can we reproduce it?
--
H.J.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-26 13:30 ` H.J. Lu
@ 2024-04-26 16:53 ` Sunil Pandey
2024-04-28 2:13 ` abush wang
2024-04-28 2:06 ` abush wang
1 sibling, 1 reply; 12+ messages in thread
From: Sunil Pandey @ 2024-04-26 16:53 UTC (permalink / raw)
To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3184 bytes --]
On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
> >
> > Hi, H.J.
> > When I test glibc performance between 2.28 and 2.38,
> > I found there is a performance degradation about strlen.
> > In fact, this difference comes from __strlen_avx2 and __strlen_evex
> >
> > ```
> > 2.28
> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
> > 42 ENTRY (STRLEN)
> >
> >
> > 2.38
> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
> > 79 ENTRY_P2ALIGN (STRLEN, 6)
> > ```
> >
> > This is my test:
> > ```
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <stdint.h>
> > #include <string.h>
> >
> > #define MAX_STRINGS 100
> >
> > uint64_t rdtsc() {
> > uint32_t lo, hi;
> > __asm__ __volatile__ (
> > "rdtsc" : "=a"(lo), "=d"(hi)
> > );
> > return ((uint64_t)hi << 32) | lo;
> > }
> >
> > int main(int argc, char *argv[]) {
> > char *input_str[MAX_STRINGS];
> > size_t lengths[MAX_STRINGS];
> > int num_strings = 0; // Number of input strings
> > uint64_t start_cycles, end_cycles;
> >
> > // Parse command line arguments and store pointers in input_str array
> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
> > input_str[num_strings] = argv[i];
> > num_strings++;
> > }
> >
> > // Measure the strlen operation for each string
> > start_cycles = rdtsc();
> > for (int i = 0; i < num_strings; ++i) {
> > lengths[i] = strlen(input_str[i]);
> > }
> > end_cycles = rdtsc();
> >
> > unsigned long long total_cycle = end_cycles - start_cycles;
> > unsigned long long av_cycle = total_cycle / num_strings;
> > // Print the total cycles taken for the strlen operations
> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle,
> av_cycle);
> >
> > // Print the recorded lengths
> > printf("Lengths of the input strings:\n");
> > for (int i = 0; i < num_strings; ++i) {
> > printf("String %d length: %zu\n", i, lengths[i]);
> > }
> >
> > return 0;
> > }
> > ```
> >
> > This is result
> > ```
> > 2.28
> > ./strlen_test str1 str2 str3 str4 str5
> > Total cycles: 1468 av cycle: 293
> > Lengths of the input strings:
> > String 0 length: 4
> > String 1 length: 4
> > String 2 length: 4
> > String 3 length: 4
> > String 4 length: 4
> >
> > 2.38
> > ./strlen_test str1 str2 str3 str4 str5
> > Total cycles: 1814 av cycle: 362
> > Lengths of the input strings:
> > String 0 length: 4
> > String 1 length: 4
> > String 2 length: 4
> > String 3 length: 4
> > String 4 length: 4
> > ```
> >
> > Thanks,
> > abush
>
I'm not sure how you are measuring the performance of strlen function.
Are you making performance conclusion based on these 2 runs?
2.28
Total cycles: 1468 av cycle: 293
2.38
Total cycles: 1814 av cycle: 362
Please use glibc microbenchmark to see if you can reproduce perf drop.
>
> Which processors did you use? Sunil, Noah, can we reproduce it?
>
> --
> H.J.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-26 13:30 ` H.J. Lu
2024-04-26 16:53 ` Sunil Pandey
@ 2024-04-28 2:06 ` abush wang
1 sibling, 0 replies; 12+ messages in thread
From: abush wang @ 2024-04-28 2:06 UTC (permalink / raw)
To: H.J. Lu; +Cc: Sunil K Pandey, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3184 bytes --]
This is my env:
lscpu
...
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
BIOS Model name: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU @
2.5GHz
...
I think you can run my demo in these environments to reproduce it
On Fri, Apr 26, 2024 at 9:30 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
> >
> > Hi, H.J.
> > When I test glibc performance between 2.28 and 2.38,
> > I found there is a performance degradation about strlen.
> > In fact, this difference comes from __strlen_avx2 and __strlen_evex
> >
> > ```
> > 2.28
> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
> > 42 ENTRY (STRLEN)
> >
> >
> > 2.38
> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
> > 79 ENTRY_P2ALIGN (STRLEN, 6)
> > ```
> >
> > This is my test:
> > ```
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <stdint.h>
> > #include <string.h>
> >
> > #define MAX_STRINGS 100
> >
> > uint64_t rdtsc() {
> > uint32_t lo, hi;
> > __asm__ __volatile__ (
> > "rdtsc" : "=a"(lo), "=d"(hi)
> > );
> > return ((uint64_t)hi << 32) | lo;
> > }
> >
> > int main(int argc, char *argv[]) {
> > char *input_str[MAX_STRINGS];
> > size_t lengths[MAX_STRINGS];
> > int num_strings = 0; // Number of input strings
> > uint64_t start_cycles, end_cycles;
> >
> > // Parse command line arguments and store pointers in input_str array
> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
> > input_str[num_strings] = argv[i];
> > num_strings++;
> > }
> >
> > // Measure the strlen operation for each string
> > start_cycles = rdtsc();
> > for (int i = 0; i < num_strings; ++i) {
> > lengths[i] = strlen(input_str[i]);
> > }
> > end_cycles = rdtsc();
> >
> > unsigned long long total_cycle = end_cycles - start_cycles;
> > unsigned long long av_cycle = total_cycle / num_strings;
> > // Print the total cycles taken for the strlen operations
> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle,
> av_cycle);
> >
> > // Print the recorded lengths
> > printf("Lengths of the input strings:\n");
> > for (int i = 0; i < num_strings; ++i) {
> > printf("String %d length: %zu\n", i, lengths[i]);
> > }
> >
> > return 0;
> > }
> > ```
> >
> > This is result
> > ```
> > 2.28
> > ./strlen_test str1 str2 str3 str4 str5
> > Total cycles: 1468 av cycle: 293
> > Lengths of the input strings:
> > String 0 length: 4
> > String 1 length: 4
> > String 2 length: 4
> > String 3 length: 4
> > String 4 length: 4
> >
> > 2.38
> > ./strlen_test str1 str2 str3 str4 str5
> > Total cycles: 1814 av cycle: 362
> > Lengths of the input strings:
> > String 0 length: 4
> > String 1 length: 4
> > String 2 length: 4
> > String 3 length: 4
> > String 4 length: 4
> > ```
> >
> > Thanks,
> > abush
>
> Which processors did you use? Sunil, Noah, can we reproduce it?
>
> --
> H.J.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-26 16:53 ` Sunil Pandey
@ 2024-04-28 2:13 ` abush wang
2024-04-28 16:12 ` Sunil Pandey
0 siblings, 1 reply; 12+ messages in thread
From: abush wang @ 2024-04-28 2:13 UTC (permalink / raw)
To: Sunil Pandey; +Cc: H.J. Lu, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]
Actually, I was handling performance issue from libmicro in our distro OS.
I found that the performance degradation of localtime_r benchmark from
libmicro is blame to strlen.
So I abstracted this test case.
On Sat, Apr 27, 2024 at 12:54 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
> On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
>> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
>> >
>> > Hi, H.J.
>> > When I test glibc performance between 2.28 and 2.38,
>> > I found there is a performance degradation about strlen.
>> > In fact, this difference comes from __strlen_avx2 and __strlen_evex
>> >
>> > ```
>> > 2.28
>> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
>> > 42 ENTRY (STRLEN)
>> >
>> >
>> > 2.38
>> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
>> > 79 ENTRY_P2ALIGN (STRLEN, 6)
>> > ```
>> >
>> > This is my test:
>> > ```
>> > #include <stdio.h>
>> > #include <stdlib.h>
>> > #include <stdint.h>
>> > #include <string.h>
>> >
>> > #define MAX_STRINGS 100
>> >
>> > uint64_t rdtsc() {
>> > uint32_t lo, hi;
>> > __asm__ __volatile__ (
>> > "rdtsc" : "=a"(lo), "=d"(hi)
>> > );
>> > return ((uint64_t)hi << 32) | lo;
>> > }
>> >
>> > int main(int argc, char *argv[]) {
>> > char *input_str[MAX_STRINGS];
>> > size_t lengths[MAX_STRINGS];
>> > int num_strings = 0; // Number of input strings
>> > uint64_t start_cycles, end_cycles;
>> >
>> > // Parse command line arguments and store pointers in input_str
>> array
>> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
>> > input_str[num_strings] = argv[i];
>> > num_strings++;
>> > }
>> >
>> > // Measure the strlen operation for each string
>> > start_cycles = rdtsc();
>> > for (int i = 0; i < num_strings; ++i) {
>> > lengths[i] = strlen(input_str[i]);
>> > }
>> > end_cycles = rdtsc();
>> >
>> > unsigned long long total_cycle = end_cycles - start_cycles;
>> > unsigned long long av_cycle = total_cycle / num_strings;
>> > // Print the total cycles taken for the strlen operations
>> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle,
>> av_cycle);
>> >
>> > // Print the recorded lengths
>> > printf("Lengths of the input strings:\n");
>> > for (int i = 0; i < num_strings; ++i) {
>> > printf("String %d length: %zu\n", i, lengths[i]);
>> > }
>> >
>> > return 0;
>> > }
>> > ```
>> >
>> > This is result
>> > ```
>> > 2.28
>> > ./strlen_test str1 str2 str3 str4 str5
>> > Total cycles: 1468 av cycle: 293
>> > Lengths of the input strings:
>> > String 0 length: 4
>> > String 1 length: 4
>> > String 2 length: 4
>> > String 3 length: 4
>> > String 4 length: 4
>> >
>> > 2.38
>> > ./strlen_test str1 str2 str3 str4 str5
>> > Total cycles: 1814 av cycle: 362
>> > Lengths of the input strings:
>> > String 0 length: 4
>> > String 1 length: 4
>> > String 2 length: 4
>> > String 3 length: 4
>> > String 4 length: 4
>> > ```
>> >
>> > Thanks,
>> > abush
>>
>
> I'm not sure how you are measuring the performance of strlen function.
> Are you making performance conclusion based on these 2 runs?
>
> 2.28
> Total cycles: 1468 av cycle: 293
>
> 2.38
> Total cycles: 1814 av cycle: 362
>
> Please use glibc microbenchmark to see if you can reproduce perf drop.
>
>
>>
>> Which processors did you use? Sunil, Noah, can we reproduce it?
>>
>> --
>> H.J.
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-28 2:13 ` abush wang
@ 2024-04-28 16:12 ` Sunil Pandey
2024-04-28 16:16 ` H.J. Lu
0 siblings, 1 reply; 12+ messages in thread
From: Sunil Pandey @ 2024-04-28 16:12 UTC (permalink / raw)
To: abush wang; +Cc: H.J. Lu, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3960 bytes --]
On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote:
> Actually, I was handling performance issue from libmicro in our distro OS.
> I found that the performance degradation of localtime_r benchmark from
> libmicro is blame to strlen.
> So I abstracted this test case.
>
>
Can you consistently reproduce strlen perf behaviour by running multiple
times back-to-back?
You can see high swing from run
> On Sat, Apr 27, 2024 at 12:54 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>>
>>
>> On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>>> On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
>>> >
>>> > Hi, H.J.
>>> > When I test glibc performance between 2.28 and 2.38,
>>> > I found there is a performance degradation about strlen.
>>> > In fact, this difference comes from __strlen_avx2 and __strlen_evex
>>> >
>>> > ```
>>> > 2.28
>>> > __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
>>> > 42 ENTRY (STRLEN)
>>> >
>>> >
>>> > 2.38
>>> > __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
>>> > 79 ENTRY_P2ALIGN (STRLEN, 6)
>>> > ```
>>> >
>>> > This is my test:
>>> > ```
>>> > #include <stdio.h>
>>> > #include <stdlib.h>
>>> > #include <stdint.h>
>>> > #include <string.h>
>>> >
>>> > #define MAX_STRINGS 100
>>> >
>>> > uint64_t rdtsc() {
>>> > uint32_t lo, hi;
>>> > __asm__ __volatile__ (
>>> > "rdtsc" : "=a"(lo), "=d"(hi)
>>> > );
>>> > return ((uint64_t)hi << 32) | lo;
>>> > }
>>> >
>>> > int main(int argc, char *argv[]) {
>>> > char *input_str[MAX_STRINGS];
>>> > size_t lengths[MAX_STRINGS];
>>> > int num_strings = 0; // Number of input strings
>>> > uint64_t start_cycles, end_cycles;
>>> >
>>> > // Parse command line arguments and store pointers in input_str
>>> array
>>> > for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
>>> > input_str[num_strings] = argv[i];
>>> > num_strings++;
>>> > }
>>> >
>>> > // Measure the strlen operation for each string
>>> > start_cycles = rdtsc();
>>> > for (int i = 0; i < num_strings; ++i) {
>>> > lengths[i] = strlen(input_str[i]);
>>> > }
>>> > end_cycles = rdtsc();
>>> >
>>> > unsigned long long total_cycle = end_cycles - start_cycles;
>>> > unsigned long long av_cycle = total_cycle / num_strings;
>>> > // Print the total cycles taken for the strlen operations
>>> > printf("Total cycles: %llu av cycle: %llu \n", total_cycle,
>>> av_cycle);
>>> >
>>> > // Print the recorded lengths
>>> > printf("Lengths of the input strings:\n");
>>> > for (int i = 0; i < num_strings; ++i) {
>>> > printf("String %d length: %zu\n", i, lengths[i]);
>>> > }
>>> >
>>> > return 0;
>>> > }
>>> > ```
>>> >
>>> > This is result
>>> > ```
>>> > 2.28
>>> > ./strlen_test str1 str2 str3 str4 str5
>>> > Total cycles: 1468 av cycle: 293
>>> > Lengths of the input strings:
>>> > String 0 length: 4
>>> > String 1 length: 4
>>> > String 2 length: 4
>>> > String 3 length: 4
>>> > String 4 length: 4
>>> >
>>> > 2.38
>>> > ./strlen_test str1 str2 str3 str4 str5
>>> > Total cycles: 1814 av cycle: 362
>>> > Lengths of the input strings:
>>> > String 0 length: 4
>>> > String 1 length: 4
>>> > String 2 length: 4
>>> > String 3 length: 4
>>> > String 4 length: 4
>>> > ```
>>> >
>>> > Thanks,
>>> > abush
>>>
>>
>> I'm not sure how you are measuring the performance of strlen function.
>> Are you making performance conclusion based on these 2 runs?
>>
>> 2.28
>> Total cycles: 1468 av cycle: 293
>>
>> 2.38
>> Total cycles: 1814 av cycle: 362
>>
>> Please use glibc microbenchmark to see if you can reproduce perf drop.
>>
>>
>>>
>>> Which processors did you use? Sunil, Noah, can we reproduce it?
>>>
>>> --
>>> H.J.
>>>
>>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-28 16:12 ` Sunil Pandey
@ 2024-04-28 16:16 ` H.J. Lu
2024-04-29 17:41 ` Sunil Pandey
0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2024-04-28 16:16 UTC (permalink / raw)
To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote:
>>
>> Actually, I was handling performance issue from libmicro in our distro OS.
>> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen.
>> So I abstracted this test case.
>>
>
> Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back?
>
> You can see high swing from run
Hi Sunil,
Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to
benchtests/bench-strlen.c and check its performance on SKX.
--
H.J.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-28 16:16 ` H.J. Lu
@ 2024-04-29 17:41 ` Sunil Pandey
2024-04-29 20:19 ` H.J. Lu
0 siblings, 1 reply; 12+ messages in thread
From: Sunil Pandey @ 2024-04-29 17:41 UTC (permalink / raw)
To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 1439 bytes --]
On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >
> >
> >
> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote:
> >>
> >> Actually, I was handling performance issue from libmicro in our distro
> OS.
> >> I found that the performance degradation of localtime_r benchmark from
> libmicro is blame to strlen.
> >> So I abstracted this test case.
> >>
> >
> > Can you consistently reproduce strlen perf behaviour by running multiple
> times back-to-back?
> >
> > You can see high swing from run
>
> Hi Sunil,
>
> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to
> benchtests/bench-strlen.c and check its performance on SKX.
>
> --
> H.J.
>
I collected the glibc micro-benchmark data for the string length in
question.
2.38 evex data:
length=4, alignment=4: 4.40
length=4, alignment=0: 4.29
length=4, alignment=0: 3.64
length=4, alignment=7: 3.64
length=4, alignment=2: 3.64
2.28 evex data:
Length 4, alignment 4: 6.46875
Length 4, alignment 0: 6.5
Length 4, alignment 0: 6.53125
Length 4, alignment 7: 6.46875
Length 4, alignment 2: 6.53125
Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
2.38 perf numbers are better than 2.28 as expected.
--Sunil
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-29 17:41 ` Sunil Pandey
@ 2024-04-29 20:19 ` H.J. Lu
2024-04-30 0:54 ` Sunil Pandey
0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2024-04-29 20:19 UTC (permalink / raw)
To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote:
>> >>
>> >> Actually, I was handling performance issue from libmicro in our distro OS.
>> >> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen.
>> >> So I abstracted this test case.
>> >>
>> >
>> > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back?
>> >
>> > You can see high swing from run
>>
>> Hi Sunil,
>>
>> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to
>> benchtests/bench-strlen.c and check its performance on SKX.
>>
>> --
>> H.J.
>
>
> I collected the glibc micro-benchmark data for the string length in question.
>
> 2.38 evex data:
>
> length=4, alignment=4: 4.40
> length=4, alignment=0: 4.29
> length=4, alignment=0: 3.64
> length=4, alignment=7: 3.64
> length=4, alignment=2: 3.64
>
> 2.28 evex data:
>
> Length 4, alignment 4: 6.46875
> Length 4, alignment 0: 6.5
> Length 4, alignment 0: 6.53125
> Length 4, alignment 7: 6.46875
> Length 4, alignment 2: 6.53125
>
> Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
>
> 2.38 perf numbers are better than 2.28 as expected.
1. Please compare AVX2 vs EVEX strlen on glibc master branch.
2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2, 3.
--
H.J.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-29 20:19 ` H.J. Lu
@ 2024-04-30 0:54 ` Sunil Pandey
2024-04-30 2:51 ` H.J. Lu
0 siblings, 1 reply; 12+ messages in thread
From: Sunil Pandey @ 2024-04-30 0:54 UTC (permalink / raw)
To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3752 bytes --]
On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >
> >
> >
> > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >> >
> >> >
> >> >
> >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com>
> wrote:
> >> >>
> >> >> Actually, I was handling performance issue from libmicro in our
> distro OS.
> >> >> I found that the performance degradation of localtime_r benchmark
> from libmicro is blame to strlen.
> >> >> So I abstracted this test case.
> >> >>
> >> >
> >> > Can you consistently reproduce strlen perf behaviour by running
> multiple times back-to-back?
> >> >
> >> > You can see high swing from run
> >>
> >> Hi Sunil,
> >>
> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test
> to
> >> benchtests/bench-strlen.c and check its performance on SKX.
> >>
> >> --
> >> H.J.
> >
> >
> > I collected the glibc micro-benchmark data for the string length in
> question.
> >
> > 2.38 evex data:
> >
> > length=4, alignment=4: 4.40
> > length=4, alignment=0: 4.29
> > length=4, alignment=0: 3.64
> > length=4, alignment=7: 3.64
> > length=4, alignment=2: 3.64
> >
> > 2.28 evex data:
> >
> > Length 4, alignment 4: 6.46875
> > Length 4, alignment 0: 6.5
> > Length 4, alignment 0: 6.53125
> > Length 4, alignment 7: 6.46875
> > Length 4, alignment 2: 6.53125
> >
> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
> >
> > 2.38 perf numbers are better than 2.28 as expected.
>
> 1. Please compare AVX2 vs EVEX strlen on glibc master branch.
> 2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2,
> 3.
>
> --
> H.J.
>
Data from master branch:
Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
__strlen_evex __strlen_avx2
=======================================================
length=4, alignment=0: 5.00 5.11
length=4, alignment=1: 4.92 4.80
length=4, alignment=2: 4.82 4.62
length=4, alignment=3: 4.62 4.92
length=4, alignment=4: 4.44 4.44
length=4, alignment=5: 4.59 4.29
length=4, alignment=6: 4.39 4.29
length=4, alignment=7: 4.14 4.14
length=4, alignment=8: 4.19 4.00
length=4, alignment=9: 4.00 4.00
length=4, alignment=10: 4.31 3.87
length=4, alignment=11: 3.96 3.87
length=4, alignment=12: 3.86 3.75
length=4, alignment=13: 3.75 3.75
length=4, alignment=14: 3.64 3.64
length=4, alignment=15: 3.64 3.72
length=4, alignment=16: 3.64 3.53
length=4, alignment=17: 3.63 3.53
length=4, alignment=18: 4.12 3.53
length=4, alignment=19: 3.43 3.43
length=4, alignment=20: 3.43 3.43
length=4, alignment=21: 3.33 3.33
length=4, alignment=22: 3.33 3.42
length=4, alignment=23: 3.33 3.33
length=4, alignment=24: 3.33 3.33
length=4, alignment=25: 3.33 3.33
length=4, alignment=26: 3.96 3.33
length=4, alignment=27: 3.33 3.41
length=4, alignment=28: 3.33 3.33
length=4, alignment=29: 3.41 3.33
length=4, alignment=30: 3.33 3.41
length=4, alignment=31: 3.33 3.33
--Sunil
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-30 0:54 ` Sunil Pandey
@ 2024-04-30 2:51 ` H.J. Lu
2024-04-30 20:16 ` Sunil Pandey
0 siblings, 1 reply; 12+ messages in thread
From: H.J. Lu @ 2024-04-30 2:51 UTC (permalink / raw)
To: Sunil Pandey; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
On Mon, Apr 29, 2024 at 5:55 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>
>> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com> wrote:
>> >> >>
>> >> >> Actually, I was handling performance issue from libmicro in our distro OS.
>> >> >> I found that the performance degradation of localtime_r benchmark from libmicro is blame to strlen.
>> >> >> So I abstracted this test case.
>> >> >>
>> >> >
>> >> > Can you consistently reproduce strlen perf behaviour by running multiple times back-to-back?
>> >> >
>> >> > You can see high swing from run
>> >>
>> >> Hi Sunil,
>> >>
>> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this test to
>> >> benchtests/bench-strlen.c and check its performance on SKX.
>> >>
>> >> --
>> >> H.J.
>> >
>> >
>> > I collected the glibc micro-benchmark data for the string length in question.
>> >
>> > 2.38 evex data:
>> >
>> > length=4, alignment=4: 4.40
>> > length=4, alignment=0: 4.29
>> > length=4, alignment=0: 3.64
>> > length=4, alignment=7: 3.64
>> > length=4, alignment=2: 3.64
>> >
>> > 2.28 evex data:
>> >
>> > Length 4, alignment 4: 6.46875
>> > Length 4, alignment 0: 6.5
>> > Length 4, alignment 0: 6.53125
>> > Length 4, alignment 7: 6.46875
>> > Length 4, alignment 2: 6.53125
>> >
>> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
>> >
>> > 2.38 perf numbers are better than 2.28 as expected.
>>
>> 1. Please compare AVX2 vs EVEX strlen on glibc master branch.
>> 2. Please check strlen on strings of length == 4 and alignments = 0, 1, 2, 3.
>>
>> --
>> H.J.
>
>
> Data from master branch:
>
> Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
>
> __strlen_evex __strlen_avx2
> =======================================================
> length=4, alignment=0: 5.00 5.11
> length=4, alignment=1: 4.92 4.80
> length=4, alignment=2: 4.82 4.62
> length=4, alignment=3: 4.62 4.92
> length=4, alignment=4: 4.44 4.44
> length=4, alignment=5: 4.59 4.29
> length=4, alignment=6: 4.39 4.29
> length=4, alignment=7: 4.14 4.14
> length=4, alignment=8: 4.19 4.00
> length=4, alignment=9: 4.00 4.00
> length=4, alignment=10: 4.31 3.87
> length=4, alignment=11: 3.96 3.87
> length=4, alignment=12: 3.86 3.75
> length=4, alignment=13: 3.75 3.75
> length=4, alignment=14: 3.64 3.64
> length=4, alignment=15: 3.64 3.72
> length=4, alignment=16: 3.64 3.53
> length=4, alignment=17: 3.63 3.53
> length=4, alignment=18: 4.12 3.53
> length=4, alignment=19: 3.43 3.43
> length=4, alignment=20: 3.43 3.43
> length=4, alignment=21: 3.33 3.33
> length=4, alignment=22: 3.33 3.42
> length=4, alignment=23: 3.33 3.33
> length=4, alignment=24: 3.33 3.33
> length=4, alignment=25: 3.33 3.33
> length=4, alignment=26: 3.96 3.33
> length=4, alignment=27: 3.33 3.41
> length=4, alignment=28: 3.33 3.33
> length=4, alignment=29: 3.41 3.33
> length=4, alignment=30: 3.33 3.41
> length=4, alignment=31: 3.33 3.33
>
> --Sunil
Hi Sunil,
strlen-avx2.S in glibc 2.28 release (tag glibc-2.28) is
different from strlen-avx2.S on master branch. Please
compare their performances.
--
H.J.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: x86-64: strlen-evex performance performance degradation compared to strlen-avx2
2024-04-30 2:51 ` H.J. Lu
@ 2024-04-30 20:16 ` Sunil Pandey
0 siblings, 0 replies; 12+ messages in thread
From: Sunil Pandey @ 2024-04-30 20:16 UTC (permalink / raw)
To: H.J. Lu; +Cc: abush wang, Noah Goldstein, abushwang via Libc-alpha
[-- Attachment #1: Type: text/plain, Size: 6778 bytes --]
On Mon, Apr 29, 2024 at 7:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Apr 29, 2024 at 5:55 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >
> >
> >
> > On Mon, Apr 29, 2024 at 1:20 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Mon, Apr 29, 2024 at 10:42 AM Sunil Pandey <skpgkp2@gmail.com>
> wrote:
> >> >
> >> >
> >> >
> >> > On Sun, Apr 28, 2024 at 9:17 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> >>
> >> >> On Sun, Apr 28, 2024 at 9:13 AM Sunil Pandey <skpgkp2@gmail.com>
> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sat, Apr 27, 2024 at 7:13 PM abush wang <abushwangs@gmail.com>
> wrote:
> >> >> >>
> >> >> >> Actually, I was handling performance issue from libmicro in our
> distro OS.
> >> >> >> I found that the performance degradation of localtime_r benchmark
> from libmicro is blame to strlen.
> >> >> >> So I abstracted this test case.
> >> >> >>
> >> >> >
> >> >> > Can you consistently reproduce strlen perf behaviour by running
> multiple times back-to-back?
> >> >> >
> >> >> > You can see high swing from run
> >> >>
> >> >> Hi Sunil,
> >> >>
> >> >> Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz is SKX. Please add this
> test to
> >> >> benchtests/bench-strlen.c and check its performance on SKX.
> >> >>
> >> >> --
> >> >> H.J.
> >> >
> >> >
> >> > I collected the glibc micro-benchmark data for the string length in
> question.
> >> >
> >> > 2.38 evex data:
> >> >
> >> > length=4, alignment=4: 4.40
> >> > length=4, alignment=0: 4.29
> >> > length=4, alignment=0: 3.64
> >> > length=4, alignment=7: 3.64
> >> > length=4, alignment=2: 3.64
> >> >
> >> > 2.28 evex data:
> >> >
> >> > Length 4, alignment 4: 6.46875
> >> > Length 4, alignment 0: 6.5
> >> > Length 4, alignment 0: 6.53125
> >> > Length 4, alignment 7: 6.46875
> >> > Length 4, alignment 2: 6.53125
> >> >
> >> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
> >> >
> >> > 2.38 perf numbers are better than 2.28 as expected.
> >>
> >> 1. Please compare AVX2 vs EVEX strlen on glibc master branch.
> >> 2. Please check strlen on strings of length == 4 and alignments = 0, 1,
> 2, 3.
> >>
> >> --
> >> H.J.
> >
> >
> > Data from master branch:
> >
> > Data collected on Machine: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
> >
> > __strlen_evex __strlen_avx2
> > =======================================================
> > length=4, alignment=0: 5.00 5.11
> > length=4, alignment=1: 4.92 4.80
> > length=4, alignment=2: 4.82 4.62
> > length=4, alignment=3: 4.62 4.92
> > length=4, alignment=4: 4.44 4.44
> > length=4, alignment=5: 4.59 4.29
> > length=4, alignment=6: 4.39 4.29
> > length=4, alignment=7: 4.14 4.14
> > length=4, alignment=8: 4.19 4.00
> > length=4, alignment=9: 4.00 4.00
> > length=4, alignment=10: 4.31 3.87
> > length=4, alignment=11: 3.96 3.87
> > length=4, alignment=12: 3.86 3.75
> > length=4, alignment=13: 3.75 3.75
> > length=4, alignment=14: 3.64 3.64
> > length=4, alignment=15: 3.64 3.72
> > length=4, alignment=16: 3.64 3.53
> > length=4, alignment=17: 3.63 3.53
> > length=4, alignment=18: 4.12 3.53
> > length=4, alignment=19: 3.43 3.43
> > length=4, alignment=20: 3.43 3.43
> > length=4, alignment=21: 3.33 3.33
> > length=4, alignment=22: 3.33 3.42
> > length=4, alignment=23: 3.33 3.33
> > length=4, alignment=24: 3.33 3.33
> > length=4, alignment=25: 3.33 3.33
> > length=4, alignment=26: 3.96 3.33
> > length=4, alignment=27: 3.33 3.41
> > length=4, alignment=28: 3.33 3.33
> > length=4, alignment=29: 3.41 3.33
> > length=4, alignment=30: 3.33 3.41
> > length=4, alignment=31: 3.33 3.33
> >
> > --Sunil
>
> Hi Sunil,
>
> strlen-avx2.S in glibc 2.28 release (tag glibc-2.28) is
> different from strlen-avx2.S on master branch. Please
> compare their performances.
>
> --
> H.J.
>
I tested strlen implementations with different alignment combinations.
_strlen_evex(master) __strlen_avx2(master)
__strlen_avx2(2.28)
==========================================================
length=4, alignment=0: 5.00 5.09 8.00
length=4, alignment=1: 4.80 4.80 7.78
length=4, alignment=2: 4.71 4.62 7.46
length=4, alignment=3: 4.44 4.55 7.11
length=4, alignment=4: 4.44 4.45 7.23
length=4, alignment=5: 4.29 4.29 6.86
length=4, alignment=6: 4.14 4.14 6.76
length=4, alignment=7: 4.00 4.00 6.40
length=4, alignment=8: 4.00 4.00 6.50
length=4, alignment=9: 3.87 3.87 6.29
length=4, alignment=10: 3.75 3.85 6.00
length=4, alignment=11: 3.75 3.75 6.00
length=4, alignment=12: 3.76 3.64 5.82
length=4, alignment=13: 3.64 3.64 6.08
length=4, alignment=14: 3.53 3.53 5.74
length=4, alignment=15: 3.53 3.53 5.74
length=4, alignment=16: 3.43 3.43 5.57
length=4, alignment=17: 3.43 3.43 5.67
length=4, alignment=18: 3.33 3.33 5.41
length=4, alignment=19: 3.33 3.33 5.44
length=4, alignment=20: 3.33 3.33 5.41
length=4, alignment=21: 3.33 3.33 5.43
length=4, alignment=22: 3.33 3.33 5.41
length=4, alignment=23: 3.33 3.33 5.41
length=4, alignment=24: 3.33 3.33 5.33
length=4, alignment=25: 3.41 3.33 5.33
length=4, alignment=26: 3.86 3.33 5.33
length=4, alignment=27: 3.42 3.33 5.33
length=4, alignment=28: 3.33 3.33 5.33
length=4, alignment=29: 3.33 3.33 5.33
length=4, alignment=30: 3.33 3.33 5.33
length=4, alignment=31: 3.33 3.33 5.33
Based on the data
- avx2/evex version in master is faster than avx2 version in glibc-2.28 as
expected.
--Sunil
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-04-30 20:17 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 4:03 x86-64: strlen-evex performance performance degradation compared to strlen-avx2 abush wang
2024-04-26 13:30 ` H.J. Lu
2024-04-26 16:53 ` Sunil Pandey
2024-04-28 2:13 ` abush wang
2024-04-28 16:12 ` Sunil Pandey
2024-04-28 16:16 ` H.J. Lu
2024-04-29 17:41 ` Sunil Pandey
2024-04-29 20:19 ` H.J. Lu
2024-04-30 0:54 ` Sunil Pandey
2024-04-30 2:51 ` H.J. Lu
2024-04-30 20:16 ` Sunil Pandey
2024-04-28 2:06 ` abush wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).