From: "Paul A. Clarke" <pc@us.ibm.com>
To: Raphael Moreira Zinsly <rzinsly@linux.ibm.com>
Cc: libc-alpha@sourceware.org
Subject: Re: [PATCH 1/2] powerpc: Optimized strncpy for POWER9
Date: Fri, 28 Aug 2020 14:12:41 -0500 [thread overview]
Message-ID: <20200828191241.GA280591@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com> (raw)
In-Reply-To: <20200820182917.12602-1-rzinsly@linux.ibm.com>
On Thu, Aug 20, 2020 at 03:29:16PM -0300, Raphael Moreira Zinsly via Libc-alpha wrote:
> Similar to the strcpy P9 optimization, this version uses VSX to improve
> performance.
> ---
> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strncpy.S b/sysdeps/powerpc/powerpc64/le/power9/strncpy.S
> new file mode 100644
> index 0000000000..cde68384d4
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/strncpy.S
> @@ -0,0 +1,276 @@
> +/* Optimized strncpy implementation for PowerPC64/POWER9.
sysdeps/powerpc/powerpc64/multiarch/strncpy-power9.S below, has
"POWER9/PPC64". Can we make these consistent? Can we just say
"POWER9"? Do we need to indicate little-endian only?
> + Copyright (C) 2020 Free Software Foundation, Inc.
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library; if not, see
> + <https://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +# ifndef STRNCPY
> +# define FUNC_NAME strncpy
> +# else
> +# define FUNC_NAME STRNCPY
> +# endif
> +
> +/* Implements the function
> +
> + char * [r3] strncpy (char *dest [r3], const char *src [r4], size_t n [r5])
> +
> + The implementation can load bytes past a null terminator, but only
> + up to the next 16B boundary, so it never crosses a page. */
nit, subjective: "up to the next 16-byte aligned address"
> +
> +.machine power9
> +ENTRY_TOCLESS (FUNC_NAME, 4)
> + CALL_MCOUNT 2
> +
> + cmpwi r5, 0
This should be "cmpdi".
> + beqlr
> + /* NULL string optimisation */
This comment would make more sense above the "cmpdi", above.
> + lbz r0,0(r4)
> + stb r0,0(r3)
> + addi r11,r3,1
> + addi r5,r5,-1
> + vspltisb v18,0 /* Zeroes in v18 */
> + cmpwi r0,0
This should be "cmpdi".
> + beq L(zero_padding_loop)
> +
Given the above "NULL string" comment, you could
put an "empty string optimization" comment here.
> + cmpwi r5,0
This should be "cmpdi".
> + beqlr
The "addi r11,r3,1" and "vspltisb v18,0" above aren't needed until
a bit later, which penalizes the empty string case. I think you
can move the empty string test up. Some experiments seemed to move
the lbz and dependent stb apart. Something like this:
/* NULL string optimisation */
cmpdi r5,0
beqlr
lbz r0,0(r4)
/* empty/1-byte string optimisation */
cmpdi r5,1
stb r0,0(r3)
beqlr
cmpdi r0,0
addi r11,r3,1
addi r5,r5,-1
vspltisb v18,0 /* Zeroes in v18 */
beq L(zero_padding_loop)
(But, I didn't see significant performance difference in
some light experimentation. It might be worth another look.)
> +
> +L(cont):
This label isn't used.
> + addi r4,r4,1
> + neg r7,r4
> + rldicl r9,r7,0,60 /* How many bytes to get source 16B aligned? */
> +
> + /* Get source 16B aligned */
> + lvx v0,0,r4
> + lvsr v1,0,r4
> + vperm v0,v18,v0,v1
> +
> + vcmpequb v6,v0,v18 /* 0xff if byte is NULL, 0x00 otherwise */
> + vctzlsbb r7,v6 /* Number of trailing zeroes */
> + addi r8,r7,1 /* Add null terminator */
> +
> + /* r8 = bytes including null
> + r9 = bytes to get source 16B aligned
> + if r8 > r9
> + no null, copy r9 bytes
> + else
> + there is a null, copy r8 bytes and return. */
> + cmpd r8,r9
This should probably be "cmpld".
> + bgt L(no_null)
> +
> + cmpd r8,r5 /* r8 <= n? */
This should probably be "cmpld".
> + ble L(null)
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + stxvl 32+v0,r11,r10 /* Partial store */
Do we still need this "32+v0" syntax? Is that due to a minimum supported
level of binutils which isn't VSX-aware?
> +
> + blr
> +
> +L(null):
> + sldi r10,r8,56 /* stxvl wants size in top 8 bits */
> + stxvl 32+v0,r11,r10 /* Partial store */
> +
> + add r11,r11,r8
> + sub r5,r5,r8
> + b L(zero_padding_loop)
> +
> +L(no_null):
> + cmpd r9,r5 /* Check if length was reached. */
This should probably be "cmpld".
> + bge L(n_tail1)
> + sldi r10,r9,56 /* stxvl wants size in top 8 bits */
> + stxvl 32+v0,r11,r10 /* Partial store */
> +
> + add r4,r4,r9
> + add r11,r11,r9
> + sub r5,r5,r9
> +
> +L(loop):
> + cmpldi cr6,r5,64 /* Check if length was reached. */
> + ble cr6,L(final_loop)
> +
> + lxv 32+v0,0(r4)
> + vcmpequb. v6,v0,v18 /* Any zero bytes? */
> + bne cr6,L(prep_tail1)
> +
> + lxv 32+v1,16(r4)
> + vcmpequb. v6,v1,v18 /* Any zero bytes? */
> + bne cr6,L(prep_tail2)
> +
> + lxv 32+v2,32(r4)
> + vcmpequb. v6,v2,v18 /* Any zero bytes? */
> + bne cr6,L(prep_tail3)
> +
> + lxv 32+v3,48(r4)
> + vcmpequb. v6,v3,v18 /* Any zero bytes? */
> + bne cr6,L(prep_tail4)
> +
> + stxv 32+v0,0(r11)
> + stxv 32+v1,16(r11)
> + stxv 32+v2,32(r11)
> + stxv 32+v3,48(r11)
> +
> + addi r4,r4,64
> + addi r11,r11,64
> + addi r5,r5,-64
> +
> + b L(loop)
> +
> +L(final_loop):
> + cmpldi cr5,r5,16
> + lxv 32+v0,0(r4)
> + vcmpequb. v6,v0,v18 /* Any zero bytes? */
> + ble cr5,L(prep_n_tail1)
> + bne cr6,L(count_tail1)
> + addi r5,r5,-16
> +
> + cmpldi cr5,r5,16
> + lxv 32+v1,16(r4)
> + vcmpequb. v6,v1,v18 /* Any zero bytes? */
> + ble cr5,L(prep_n_tail2)
> + bne cr6,L(count_tail2)
> + addi r5,r5,-16
> +
> + cmpldi cr5,r5,16
> + lxv 32+v2,32(r4)
> + vcmpequb. v6,v2,v18 /* Any zero bytes? */
> + ble cr5,L(prep_n_tail3)
> + bne cr6,L(count_tail3)
> + addi r5,r5,-16
> +
> + lxv 32+v3,48(r4)
> + vcmpequb. v6,v3,v18 /* Any zero bytes? */
> + beq cr6,L(n_tail4)
> +
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> + cmpd r8,r5 /* r8 < n? */
This should probably be "cmpld".
> + blt L(tail4)
> +L(n_tail4):
> + stxv 32+v0,0(r11)
> + stxv 32+v1,16(r11)
> + stxv 32+v2,32(r11)
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,48 /* Offset */
> + stxvl 32+v3,r11,r10 /* Partial store */
> + blr
> +
> +L(prep_n_tail1):
> + beq cr6,L(n_tail1) /* Any zero bytes? */
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> + cmpd r8,r5 /* r8 < n? */
This should probably be "cmpld".
> + blt L(tail1)
> +L(n_tail1):
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + stxvl 32+v0,r11,r10 /* Partial store */
> + blr
> +
> +L(prep_n_tail2):
> + beq cr6,L(n_tail2) /* Any zero bytes? */
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> + cmpd r8,r5 /* r8 < n? */
This should probably be "cmpld".
> + blt L(tail2)
> +L(n_tail2):
> + stxv 32+v0,0(r11)
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,16 /* offset */
> + stxvl 32+v1,r11,r10 /* Partial store */
> + blr
> +
> +L(prep_n_tail3):
> + beq cr6,L(n_tail3) /* Any zero bytes? */
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> + cmpd r8,r5 /* r8 < n? */
This should probably be "cmpld".
> + blt L(tail3)
> +L(n_tail3):
> + stxv 32+v0,0(r11)
> + stxv 32+v1,16(r11)
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,32 /* Offset */
> + stxvl 32+v2,r11,r10 /* Partial store */
> + blr
> +
> +L(prep_tail1):
> +L(count_tail1):
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> +L(tail1):
> + addi r9,r8,1 /* Add null terminator */
> + sldi r10,r9,56 /* stxvl wants size in top 8 bits */
> + stxvl 32+v0,r11,r10 /* Partial store */
> + add r11,r11,r9
> + sub r5,r5,r9
> + b L(zero_padding_loop)
> +
> +L(prep_tail2):
> + addi r5,r5,-16
> +L(count_tail2):
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> +L(tail2):
> + addi r9,r8,1 /* Add null terminator */
> + stxv 32+v0,0(r11)
> + sldi r10,r9,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,16 /* offset */
> + stxvl 32+v1,r11,r10 /* Partial store */
> + add r11,r11,r9
> + sub r5,r5,r9
> + b L(zero_padding_loop)
> +
> +L(prep_tail3):
> + addi r5,r5,-32
> +L(count_tail3):
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> +L(tail3):
> + addi r9,r8,1 /* Add null terminator */
> + stxv 32+v0,0(r11)
> + stxv 32+v1,16(r11)
> + sldi r10,r9,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,32 /* offset */
> + stxvl 32+v2,r11,r10 /* Partial store */
> + add r11,r11,r9
> + sub r5,r5,r9
> + b L(zero_padding_loop)
> +
> +L(prep_tail4):
> + addi r5,r5,-48
> + vctzlsbb r8,v6 /* Number of trailing zeroes */
> +L(tail4):
> + addi r9,r8,1 /* Add null terminator */
> + stxv 32+v0,0(r11)
> + stxv 32+v1,16(r11)
> + stxv 32+v2,32(r11)
> + sldi r10,r9,56 /* stxvl wants size in top 8 bits */
> + addi r11,r11,48 /* offset */
> + stxvl 32+v3,r11,r10 /* Partial store */
> + add r11,r11,r9
> + sub r5,r5,r9
> +
> +/* This code pads the remainder of dest with NULL bytes. */
> +L(zero_padding_loop):
> + cmpldi cr6,r5,16 /* Check if length was reached. */
> + ble cr6,L(zero_padding_end)
> +
> + stxv v18,0(r11)
> + addi r11,r11,16
> + addi r5,r5,-16
> +
> + b L(zero_padding_loop)
> +
> +L(zero_padding_end):
> + sldi r10,r5,56 /* stxvl wants size in top 8 bits */
> + stxvl v18,r11,r10 /* Partial store */
> + blr
> +
> +L(n_tail):
> +
> +END (FUNC_NAME)
PC
next prev parent reply other threads:[~2020-08-28 19:12 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-20 18:29 Raphael Moreira Zinsly
2020-08-20 18:29 ` [PATCH 2/2] powerpc: Optimzed stpncpy " Raphael Moreira Zinsly
2020-08-20 18:31 ` Raphael M Zinsly
2020-08-28 17:04 ` Paul E Murphy
2020-08-20 18:31 ` [PATCH 1/2] powerpc: Optimized strncpy " Raphael M Zinsly
2020-08-28 14:25 ` Paul E Murphy
2020-08-28 19:12 ` Paul A. Clarke [this message]
2020-09-02 13:20 ` Tulio Magno Quites Machado Filho
2020-09-02 14:00 ` Paul E Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200828191241.GA280591@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com \
--to=pc@us.ibm.com \
--cc=libc-alpha@sourceware.org \
--cc=rzinsly@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).