Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
@ 2022-03-28  8:10 Mayshao-oc
  2022-03-28 13:07 ` H.J. Lu
  0 siblings, 1 reply; 10+ messages in thread
From: Mayshao-oc @ 2022-03-28  8:10 UTC (permalink / raw)
  To: goldstein.w.n
  Cc: GNU C Library, H.J. Lu, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:

> With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> SSSE3. As a result its no longer with the code size cost.
> ---
> sysdeps/x86_64/multiarch/Makefile          |    2 -
> sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> 5 files changed, 7 insertions(+), 3183 deletions(-)
> delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S


On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
is better than that of AVX2, and the current computer system has sufficient
disk capacity and memory capacity.

It is strongly recommended to keep the SSSE3 version.

Best Regards,
May Shao



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-28  8:10 [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 Mayshao-oc
@ 2022-03-28 13:07 ` H.J. Lu
  2022-03-29  2:51   ` Mayshao-oc
  0 siblings, 1 reply; 10+ messages in thread
From: H.J. Lu @ 2022-03-28 13:07 UTC (permalink / raw)
  To: Mayshao-oc
  Cc: goldstein.w.n, GNU C Library, Florian Weimer,
	Carlos O'Donell, Louis Qi(BJ-RD)

On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
>
> On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > SSSE3. As a result its no longer with the code size cost.
> > ---
> > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
>
> On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> is better than that of AVX2, and the current computer system has sufficient
> disk capacity and memory capacity.

How does the SSSE3 version compare against the SSE2 version?

> It is strongly recommended to keep the SSSE3 version.
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-28 13:07 ` H.J. Lu
@ 2022-03-29  2:51   ` Mayshao-oc
  2022-03-29  2:57     ` Noah Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Mayshao-oc @ 2022-03-29  2:51 UTC (permalink / raw)
  To: H.J. Lu
  Cc: goldstein.w.n, GNU C Library, Florian Weimer,
	Carlos O'Donell, Louis Qi(BJ-RD)

On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> >
> > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > SSSE3. As a result its no longer with the code size cost.
> > > ---
> > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> >
> > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > is better than that of AVX2, and the current computer system has sufficient
> > disk capacity and memory capacity.
>
> How does the SSSE3 version compare against the SSE2 version?

On some Zhaoxin processors, the overall performance of SSSE3 is about
10% higher than that of SSE2.


Best Regards,
May Shao

> > It is strongly recommended to keep the SSSE3 version.
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-29  2:51   ` Mayshao-oc
@ 2022-03-29  2:57     ` Noah Goldstein
  2022-03-30  9:56       ` Mayshao-oc
  0 siblings, 1 reply; 10+ messages in thread
From: Noah Goldstein @ 2022-03-29  2:57 UTC (permalink / raw)
  To: Mayshao-oc
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
>
> On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
>
> > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > >
> > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > SSSE3. As a result its no longer with the code size cost.
> > > > ---
> > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > >
> > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > is better than that of AVX2, and the current computer system has sufficient
> > > disk capacity and memory capacity.
> >
> > How does the SSSE3 version compare against the SSE2 version?
>
> On some Zhaoxin processors, the overall performance of SSSE3 is about
> 10% higher than that of SSE2.
>
>
> Best Regards,
> May Shao

Any chance you can post the result from running `bench-memset` or some
equivalent benchmark? Curious where the regressions are. Ideally we would
fix the SSE2 version so its optimal.
>
> > > It is strongly recommended to keep the SSSE3 version.
> > >
> >
> >
> > --
> > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-29  2:57     ` Noah Goldstein
@ 2022-03-30  9:56       ` Mayshao-oc
  2022-03-30 16:45         ` Noah Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Mayshao-oc @ 2022-03-30  9:56 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]

On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:


>On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> >
> > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> >
> > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > >
> > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > ---
> > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > >
> > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > is better than that of AVX2, and the current computer system has sufficient
> > > > disk capacity and memory capacity.
> > >
> > > How does the SSSE3 version compare against the SSE2 version?
> >
> > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > 10% higher than that of SSE2.
> >
> >
> > Best Regards,
> > May Shao
>
> Any chance you can post the result from running `bench-memset` or some
> equivalent benchmark? Curious where the regressions are. Ideally we would
> fix the SSE2 version so its optimal.

Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
length >= 128, memcpy SSSE3 can achieve an average performance improvement
of 25% compared to SSSE2.

I have attached the test results, hope this is what you want to see.

> > > > It is strongly recommended to keep the SSSE3 version.
> > > >
> > >
> > >
> > > --
> > > H.J.

[-- Attachment #2: bench-memcpy.pdf --]
[-- Type: application/pdf, Size: 238958 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-30  9:56       ` Mayshao-oc
@ 2022-03-30 16:45         ` Noah Goldstein
  2022-03-30 16:54           ` Noah Goldstein
  2022-03-31  3:34           ` Mayshao-oc
  0 siblings, 2 replies; 10+ messages in thread
From: Noah Goldstein @ 2022-03-30 16:45 UTC (permalink / raw)
  To: Mayshao-oc
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
>
> On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
>
>
> >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > >
> > > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > >
> > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > >
> > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > > ---
> > > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > > >
> > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > > is better than that of AVX2, and the current computer system has sufficient
> > > > > disk capacity and memory capacity.
> > > >
> > > > How does the SSSE3 version compare against the SSE2 version?
> > >
> > > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > > 10% higher than that of SSE2.
> > >
> > >
> > > Best Regards,
> > > May Shao
> >
> > Any chance you can post the result from running `bench-memset` or some
> > equivalent benchmark? Curious where the regressions are. Ideally we would
> > fix the SSE2 version so its optimal.
>
> Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
> length >= 128, memcpy SSSE3 can achieve an average performance improvement
> of 25% compared to SSSE2.

Thanks

The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized
copies to significantly hotter.

Regarding the large sizes, it seems to be because the SSSE3 version avoids
unaligned loads/stores much more aggressively.

For now we will keep the function. Will look into a replacement that isn't so
costly to code size.

Out of curiosity, is bench-memcpy-random performance also improved with
SSSE3? The jump table / branches generally look really nice in micro-benchmarks
but that may not be fully indicative of how it will performance in an
application.
>
> I have attached the test results, hope this is what you want to see.
>
> > > > > It is strongly recommended to keep the SSSE3 version.
> > > > >
> > > >
> > > >
> > > > --
> > > > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-30 16:45         ` Noah Goldstein
@ 2022-03-30 16:54           ` Noah Goldstein
  2022-03-31  3:34           ` Mayshao-oc
  1 sibling, 0 replies; 10+ messages in thread
From: Noah Goldstein @ 2022-03-30 16:54 UTC (permalink / raw)
  To: Mayshao-oc
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Wed, Mar 30, 2022 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> >
> > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
> >
> >
> > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > >
> > > > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > >
> > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > > > ---
> > > > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > > > >
> > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > > > is better than that of AVX2, and the current computer system has sufficient
> > > > > > disk capacity and memory capacity.
> > > > >
> > > > > How does the SSSE3 version compare against the SSE2 version?
> > > >
> > > > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > > > 10% higher than that of SSE2.
> > > >
> > > >
> > > > Best Regards,
> > > > May Shao
> > >
> > > Any chance you can post the result from running `bench-memset` or some
> > > equivalent benchmark? Curious where the regressions are. Ideally we would
> > > fix the SSE2 version so its optimal.
> >
> > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
> > length >= 128, memcpy SSSE3 can achieve an average performance improvement
> > of 25% compared to SSSE2.
>
> Thanks
>
> The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized
> copies to significantly hotter.
>
> Regarding the large sizes, it seems to be because the SSSE3 version avoids
> unaligned loads/stores much more aggressively.
>
> For now we will keep the function. Will look into a replacement that isn't so
> costly to code size.
>
> Out of curiosity, is bench-memcpy-random performance also improved with
> SSSE3? The jump table / branches generally look really nice in micro-benchmarks
> but that may not be fully indicative of how it will performance in an
> application.
> >
> > I have attached the test results, hope this is what you want to see.
> >
> > > > > > It is strongly recommended to keep the SSSE3 version.

Will you guys have any issues if we upgrade the unaligned memcpy
to sse4.1? That will allow us to use `pshufb` and get rid of the jump
table and excessive code size.

> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re:Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-30 16:45         ` Noah Goldstein
  2022-03-30 16:54           ` Noah Goldstein
@ 2022-03-31  3:34           ` Mayshao-oc
  2022-03-31  3:47             ` Noah Goldstein
  1 sibling, 1 reply; 10+ messages in thread
From: Mayshao-oc @ 2022-03-31  3:34 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Thur, Mar 31, 2022 at 12:45 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:


> On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> >
> > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
> >
> >
> > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > >
> > > > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > >
> > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > > > ---
> > > > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > > > >
> > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > > > is better than that of AVX2, and the current computer system has sufficient
> > > > > > disk capacity and memory capacity.
> > > > >
> > > > > How does the SSSE3 version compare against the SSE2 version?
> > > >
> > > > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > > > 10% higher than that of SSE2.
> > > >
> > > >
> > > > Best Regards,
> > > > May Shao
> > >
> > > Any chance you can post the result from running `bench-memset` or some
> > > equivalent benchmark? Curious where the regressions are. Ideally we would
> > > fix the SSE2 version so its optimal.
> >
> > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
> > length >= 128, memcpy SSSE3 can achieve an average performance improvement
> > of 25% compared to SSSE2.
>
> Thanks
>
> The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized
> copies to significantly hotter.
>
> Regarding the large sizes, it seems to be because the SSSE3 version avoids
> unaligned loads/stores much more aggressively.

Agree.

> For now we will keep the function. Will look into a replacement that isn't so
> costly to code size.

Thanks very much for your support.

> Out of curiosity, is bench-memcpy-random performance also improved with
> SSSE3? The jump table / branches generally look really nice in micro-benchmarks
> but that may not be fully indicative of how it will performance in an
> application.

Bench-memcpy-random shows about a 5% performance drop for SSSE3:
        __memcpy_sse2_unaligned   __memcpy_ssse3  Improvement(ssse3 over sse2)
length=32768 805982          874585         -8.51%
length=65536 885317         940458        -6.23%
length=131072 929177         979173         -5.38%
length=262144 980083        1033130         -5.41%
length=524288 1042590 1095560 -5.08%
length=1048576 1078020 1127990 -4.64%


> >
> > I have attached the test results, hope this is what you want to see.
> >
> > > > > > It is strongly recommended to keep the SSSE3 version.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-31  3:34           ` Mayshao-oc
@ 2022-03-31  3:47             ` Noah Goldstein
  2022-03-31  4:54               ` Mayshao-oc
  0 siblings, 1 reply; 10+ messages in thread
From: Noah Goldstein @ 2022-03-31  3:47 UTC (permalink / raw)
  To: Mayshao-oc
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Wed, Mar 30, 2022 at 10:34 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
>
> On Thur, Mar 31, 2022 at 12:45 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
>
>
> > On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
> > >
> > >
> > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > >
> > > > > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > >
> > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > > >
> > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > > > > ---
> > > > > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > > > > >
> > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > > > > is better than that of AVX2, and the current computer system has sufficient
> > > > > > > disk capacity and memory capacity.
> > > > > >
> > > > > > How does the SSSE3 version compare against the SSE2 version?
> > > > >
> > > > > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > > > > 10% higher than that of SSE2.
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > May Shao
> > > >
> > > > Any chance you can post the result from running `bench-memset` or some
> > > > equivalent benchmark? Curious where the regressions are. Ideally we would
> > > > fix the SSE2 version so its optimal.
> > >
> > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
> > > length >= 128, memcpy SSSE3 can achieve an average performance improvement
> > > of 25% compared to SSSE2.
> >
> > Thanks
> >
> > The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized
> > copies to significantly hotter.
> >
> > Regarding the large sizes, it seems to be because the SSSE3 version avoids
> > unaligned loads/stores much more aggressively.
>
> Agree.
>
> > For now we will keep the function. Will look into a replacement that isn't so
> > costly to code size.
>
> Thanks very much for your support.

Will SSE4.1 be an issue for you? I think the only reasonable way to fix this is
with `pshufb`.
>
> > Out of curiosity, is bench-memcpy-random performance also improved with
> > SSSE3? The jump table / branches generally look really nice in micro-benchmarks
> > but that may not be fully indicative of how it will performance in an
> > application.
>
> Bench-memcpy-random shows about a 5% performance drop for SSSE3:

Thanks.

>         __memcpy_sse2_unaligned   __memcpy_ssse3  Improvement(ssse3 over sse2)
> length=32768 805982          874585         -8.51%
> length=65536 885317         940458        -6.23%
> length=131072 929177         979173         -5.38%
> length=262144 980083        1033130         -5.41%
> length=524288 1042590 1095560 -5.08%
> length=1048576 1078020 1127990 -4.64%
>
>
> > >
> > > I have attached the test results, hope this is what you want to see.
> > >
> > > > > > > It is strongly recommended to keep the SSSE3 version.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3
  2022-03-31  3:47             ` Noah Goldstein
@ 2022-03-31  4:54               ` Mayshao-oc
  0 siblings, 0 replies; 10+ messages in thread
From: Mayshao-oc @ 2022-03-31  4:54 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: H.J. Lu, GNU C Library, Florian Weimer, Carlos O'Donell,
	Louis Qi(BJ-RD)

On Thur, Mar 31, 2022 at 11:47 AM  Noah Goldstein<goldstein.w.n@gmail.com> wrote:


> On Wed, Mar 30, 2022 at 10:34 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> >
> > On Thur, Mar 31, 2022 at 12:45 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
> >
> >
> > > On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > >
> > > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein<goldstein.w.n@gmail.com> wrote:
> > > >
> > > >
> > > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 28, 2022 at  9:07 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc <Mayshao-oc@zhaoxin.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
> > > > > > > > > SSSE3. As a result its no longer with the code size cost.
> > > > > > > > > ---
> > > > > > > > > sysdeps/x86_64/multiarch/Makefile          |    2 -
> > > > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c |   15 -
> > > > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h   |   18 +-
> > > > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S    | 3151 --------------------
> > > > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S   |    4 -
> > > > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-)
> > > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S
> > > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S
> > > > > > > >
> > > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3
> > > > > > > > is better than that of AVX2, and the current computer system has sufficient
> > > > > > > > disk capacity and memory capacity.
> > > > > > >
> > > > > > > How does the SSSE3 version compare against the SSE2 version?
> > > > > >
> > > > > > On some Zhaoxin processors, the overall performance of SSSE3 is about
> > > > > > 10% higher than that of SSE2.
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > May Shao
> > > > >
> > > > > Any chance you can post the result from running `bench-memset` or some
> > > > > equivalent benchmark? Curious where the regressions are. Ideally we would
> > > > > fix the SSE2 version so its optimal.
> > > >
> > > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or
> > > > length >= 128, memcpy SSSE3 can achieve an average performance improvement
> > > > of 25% compared to SSSE2.
> > >
> > > Thanks
> > >
> > > The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized
> > > copies to significantly hotter.
> > >
> > > Regarding the large sizes, it seems to be because the SSSE3 version avoids
> > > unaligned loads/stores much more aggressively.
> >
> > Agree.
> >
> > > For now we will keep the function. Will look into a replacement that isn't so
> > > costly to code size.
> >
> > Thanks very much for your support.
>
> Will SSE4.1 be an issue for you? I think the only reasonable way to fix this is
> with `pshufb`.

Zhaoxin supports SSE4.1, I think there should be no problem. If you have a
ready patch, I‘d love to try it soon.

Thanks again.

> >
> > > Out of curiosity, is bench-memcpy-random performance also improved with
> > > SSSE3? The jump table / branches generally look really nice in micro-benchmarks
> > > but that may not be fully indicative of how it will performance in an
> > > application.
> >
> > Bench-memcpy-random shows about a 5% performance drop for SSSE3:
>
> Thanks.
>
> >         __memcpy_sse2_unaligned   __memcpy_ssse3  Improvement(ssse3 over sse2)
> > length=32768 805982          874585         -8.51%
> > length=65536 885317         940458        -6.23%
> > length=131072 929177         979173         -5.38%
> > length=262144 980083        1033130         -5.41%
> > length=524288 1042590 1095560 -5.08%
> > length=1048576 1078020 1127990 -4.64%
> >
> >
> > > >
> > > > I have attached the test results, hope this is what you want to see.
> > > >
> > > > > > > > It is strongly recommended to keep the SSSE3 version.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > H.J.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-31  4:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28  8:10 [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 Mayshao-oc
2022-03-28 13:07 ` H.J. Lu
2022-03-29  2:51   ` Mayshao-oc
2022-03-29  2:57     ` Noah Goldstein
2022-03-30  9:56       ` Mayshao-oc
2022-03-30 16:45         ` Noah Goldstein
2022-03-30 16:54           ` Noah Goldstein
2022-03-31  3:34           ` Mayshao-oc
2022-03-31  3:47             ` Noah Goldstein
2022-03-31  4:54               ` Mayshao-oc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).