Re: [PATCH v1] x86: Improve memset-vec-unaligned-erms.S

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "H.J. Lu" <hjl.tools@gmail.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
	"Carlos O'Donell" <carlos@systemhalted.org>
Subject: Re: [PATCH v1] x86: Improve memset-vec-unaligned-erms.S
Date: Sun, 6 Jun 2021 19:47:27 -0700	[thread overview]
Message-ID: <CAMe9rOoq2VExN6E88ous3KxZG7i6LZvhZtQV-=zPYUUfnC_fKA@mail.gmail.com> (raw)
In-Reply-To: <CAFUsyfJni5xEwCWqDym0_+zez2KeDf2_1eDZPw2Diqu2uzFMbg@mail.gmail.com>

On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
>
>
> On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >
>> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > >
>> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> > > >
>> > > > No bug. This commit makes a few small improvements to
>> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64
>> > > > instead of 128. Either alignment will perform equally well in a loop
>> > > > and 128 just increases the odds of having to do an extra iteration
>> > > > which can be significant overhead for small values. 2) Align some
>> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4)
>> > > > Reorder the last 4x VEC so that they are stored after the loop. 5)
>> > > > Move the condition for leq 8x VEC to before the alignment
>> > > > process. test-memset and test-wmemset are both passing.
>> > > >
>> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
>> > > > ---
>> > > > Tests where run on the following CPUs:
>> > > >
>> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html
>> > > >
>> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html
>> > > >
>> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
>> > > >
>> > > > All times are the geometric mean of N=50. The unit of time is
>> > > > seconds.
>> > > >
>> > > > "Cur" refers to the current implementation
>> > > > "New" refers to this patches implementation
>> > > >
>> > > > Performance data attached in memset-data.pdf
>> > > >
>> > > > Some notes on the numbers:
>> > > >
>> > > > I only included numbers for the proper VEC_SIZE for the corresponding
>> > > > cpu.
>> > > >
>> > > > skl -> avx2
>> > > > icl -> evex
>> > > > tgl -> evex
>> > > >
>> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance
>> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment
>> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected
>> > > > by alignment % 64 or alignment % 4096) and generally affects
>> > > > throughput only, not latency (i.e with an lfence to the benchmark loop
>> > > > the deviations go away). Generally I think they can be ignored (both
>> > > > positive and negative affects).
>> > > >
>> > > > The interesting part of the data is in the medium size range [128,
>> > > > 1024] where the new implementation has a reasonable speedup. This is
>> > > > especially pronounced when the more conservative alignment saves a
>> > > > full loop iteration. The only significant exception is
>> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the
>> > > > current implementation is meaningfully faster. I am unsure of the root
>> > > > cause for this. The skylake-avx2 case only performs a bit worse in
>> > > > this case which makes me think part of it is code alignment related,
>> > > > though comparative to the speedup in other size/alignment
>> > > > configurations it is still a trough.  Despite this, I still think the
>> > > > numbers are overall an improvement.
>> > > >
>> > > > As well due to aligning the loop (and possibly slightly more efficient
>> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop
>> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight
>> > > > improvement to the main loop for large sizes.
>> > > >
>> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------
>> > > >  1 file changed, 28 insertions(+), 22 deletions(-)
>> > > >
>> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
>> > > > index 08cfa49bd1..ff196844a0 100644
>> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
>> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
>> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
>> > > >         VMOVU   %VEC(0), (%rdi)
>> > > >         VZEROUPPER_RETURN
>> > > >
>> > > > +       .p2align 4
>> > > >  L(stosb_more_2x_vec):
>> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
>> > > >         ja      L(stosb)
>> > > > +#else
>> > > > +       .p2align 4
>> > > >  #endif
>> > > >  L(more_2x_vec):
>> > > > -       cmpq  $(VEC_SIZE * 4), %rdx
>> > > > -       ja      L(loop_start)
>> > > > +       /* Stores to first 2x VEC before cmp as any path forward will
>> > > > +          require it.  */
>> > > >         VMOVU   %VEC(0), (%rdi)
>> > > >         VMOVU   %VEC(0), VEC_SIZE(%rdi)
>> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)
>> > > > +       cmpq    $(VEC_SIZE * 4), %rdx
>> > > > +       ja      L(loop_start)
>> > > >         VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
>> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)
>> > > >  L(return):
>> > > >  #if VEC_SIZE > 16
>> > > >         ZERO_UPPER_VEC_REGISTERS_RETURN
>> > > > @@ -192,28 +197,29 @@ L(return):
>> > > >  #endif
>> > > >
>> > > >  L(loop_start):
>> > > > -       leaq    (VEC_SIZE * 4)(%rdi), %rcx
>> > > > -       VMOVU   %VEC(0), (%rdi)
>> > > > -       andq    $-(VEC_SIZE * 4), %rcx
>> > > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)
>> > > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)
>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
>> > > >         VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)
>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)
>> > > >         VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)
>> > > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)
>> > > > -       addq    %rdi, %rdx
>> > > > -       andq    $-(VEC_SIZE * 4), %rdx
>> > > > -       cmpq    %rdx, %rcx
>> > > > -       je      L(return)
>> > > > +       cmpq    $(VEC_SIZE * 8), %rdx
>> > > > +       jbe     L(loop_end)
>> > > > +       andq    $-(VEC_SIZE * 2), %rdi
>> > > > +       subq    $-(VEC_SIZE * 4), %rdi
>> > > > +       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx
>
> If this overflows loop will exit first iteration. Is that an issue?

Please do following:

1.  Update memset assembly codes with
        check conditions for underwrite/overwrite.
        if true then branch to the HLT instruction.
2.  Update and run memset test to verify the test coverage for the condition.
3.  Update memset assembly codes to cover such conditions.

Thanks.

> If so the following commits from me have the same bug:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae
> https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce
> https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428
>>
>>
>>
>> > > > +       .p2align 4
>> > > >  L(loop):
>> > > > -       VMOVA   %VEC(0), (%rcx)
>> > > > -       VMOVA   %VEC(0), VEC_SIZE(%rcx)
>> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rcx)
>> > > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rcx)
>> > > > -       addq    $(VEC_SIZE * 4), %rcx
>> > > > -       cmpq    %rcx, %rdx
>> > > > -       jne     L(loop)
>> > > > +       VMOVA   %VEC(0), (%rdi)
>> > > > +       VMOVA   %VEC(0), VEC_SIZE(%rdi)
>> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)
>> > > > +       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)
>> > > > +       subq    $-(VEC_SIZE * 4), %rdi
>> > > > +       cmpq    %rcx, %rdi
>> > > > +       jb      L(loop)
>
> Issue because %rdi will not be below %rcx here.
>>
>> > > > +L(loop_end):
>> > > > +       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
>> > > > +              rdx as length is also unchanged.  */
>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)
>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)
>> > > > +       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)
>> > > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)
>> > > >         VZEROUPPER_SHORT_RETURN
>> > > >
>> > > >         .p2align 4
>> > > > --
>> > > > 2.25.1
>> > > >
>> > >
>> > > LGTM.
>> >
>> > Awesome!
>> >
>> > For future patches do you prefer performance numbers like this or
>> > raw text? Or some other alternative?
>>
>> The current data format is fine.   Thanks.
>>
>> > >
>> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>> > >
>> > > Thanks.
>> > >
>> > > --
>> > > H.J.
>>
>>
>>
>> --
>> H.J.



-- 
H.J.

next prev parent reply	other threads:[~2021-06-07  2:48 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-20 18:44 Noah Goldstein
2021-05-20 18:47 ` Noah Goldstein
2021-05-20 19:39 ` H.J. Lu
2021-05-20 20:03   ` Noah Goldstein
2021-05-20 20:44     ` H.J. Lu
2021-06-07  2:35       ` Noah Goldstein
2021-06-07  2:47         ` H.J. Lu [this message]
2021-06-07  3:05           ` Noah Goldstein
2021-06-07  3:20             ` H.J. Lu
2021-06-07  3:38               ` Noah Goldstein
2021-06-07  4:33                 ` H.J. Lu
2021-06-07  7:12                   ` Noah Goldstein
2021-06-07 12:54                     ` H.J. Lu
2022-04-28  0:06                       ` Sunil Pandey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMe9rOoq2VExN6E88ous3KxZG7i6LZvhZtQV-=zPYUUfnC_fKA@mail.gmail.com' \
    --to=hjl.tools@gmail.com \
    --cc=carlos@systemhalted.org \
    --cc=goldstein.w.n@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).