Re: [PATCH 1/2] powerpc: Optimized strncpy for POWER9

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "Paul A. Clarke" <pc@us.ibm.com>
To: Raphael Moreira Zinsly <rzinsly@linux.ibm.com>
Cc: libc-alpha@sourceware.org
Subject: Re: [PATCH 1/2] powerpc: Optimized strncpy for POWER9
Date: Fri, 28 Aug 2020 14:12:41 -0500	[thread overview]
Message-ID: <20200828191241.GA280591@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com> (raw)
In-Reply-To: <20200820182917.12602-1-rzinsly@linux.ibm.com>

On Thu, Aug 20, 2020 at 03:29:16PM -0300, Raphael Moreira Zinsly via Libc-alpha wrote:
> Similar to the strcpy P9 optimization, this version uses VSX to improve
> performance.
> ---

> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strncpy.S b/sysdeps/powerpc/powerpc64/le/power9/strncpy.S
> new file mode 100644
> index 0000000000..cde68384d4
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/strncpy.S
> @@ -0,0 +1,276 @@
> +/* Optimized strncpy implementation for PowerPC64/POWER9.

sysdeps/powerpc/powerpc64/multiarch/strncpy-power9.S below, has
"POWER9/PPC64".  Can we make these consistent?  Can we just say
"POWER9"? Do we need to indicate little-endian only?

> +   Copyright (C) 2020 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +# ifndef STRNCPY
> +#  define FUNC_NAME strncpy
> +# else
> +#  define FUNC_NAME STRNCPY
> +# endif
> +
> +/* Implements the function
> +
> +   char * [r3] strncpy (char *dest [r3], const char *src [r4], size_t n [r5])
> +
> +   The implementation can load bytes past a null terminator, but only
> +   up to the next 16B boundary, so it never crosses a page.  */

nit, subjective: "up to the next 16-byte aligned address"

> +
> +.machine power9
> +ENTRY_TOCLESS (FUNC_NAME, 4)
> +	CALL_MCOUNT 2
> +
> +	cmpwi   r5, 0

This should be "cmpdi".

> +	beqlr
> +	/* NULL string optimisation  */

This comment would make more sense above the "cmpdi", above.

> +	lbz	r0,0(r4)
> +	stb	r0,0(r3)
> +	addi	r11,r3,1
> +	addi	r5,r5,-1
> +	vspltisb v18,0		/* Zeroes in v18  */
> +	cmpwi	r0,0

This should be "cmpdi".

> +	beq	L(zero_padding_loop)
> +

Given the above "NULL string" comment, you could
put an "empty string optimization" comment here.

> +	cmpwi	r5,0

This should be "cmpdi".

> +	beqlr

The "addi r11,r3,1" and "vspltisb v18,0" above aren't needed until
a bit later, which penalizes the empty string case.  I think you
can move the empty string test up.  Some experiments seemed to move
the lbz and dependent stb apart.  Something like this:
	/* NULL string optimisation  */
	cmpdi	r5,0
	beqlr

	lbz	r0,0(r4)
	/* empty/1-byte string optimisation  */
	cmpdi	r5,1
	stb	r0,0(r3)
	beqlr

	cmpdi	r0,0
	addi	r11,r3,1
	addi	r5,r5,-1
	vspltisb v18,0		/* Zeroes in v18  */
	beq	L(zero_padding_loop)

(But, I didn't see significant performance difference in
some light experimentation. It might be worth another look.)

> +
> +L(cont):

This label isn't used.

> +	addi	r4,r4,1
> +	neg	r7,r4
> +	rldicl	r9,r7,0,60	/* How many bytes to get source 16B aligned?  */
> +
> +	/* Get source 16B aligned  */
> +	lvx	v0,0,r4
> +	lvsr	v1,0,r4
> +	vperm	v0,v18,v0,v1
> +
> +	vcmpequb v6,v0,v18	/* 0xff if byte is NULL, 0x00 otherwise  */
> +	vctzlsbb r7,v6		/* Number of trailing zeroes  */
> +	addi	r8,r7,1	/* Add null terminator  */
> +
> +	/* r8 = bytes including null
> +	   r9 = bytes to get source 16B aligned
> +	   if r8 > r9
> +	      no null, copy r9 bytes
> +	   else
> +	      there is a null, copy r8 bytes and return.  */
> +	cmpd	r8,r9

This should probably be "cmpld".

> +	bgt	L(no_null)
> +
> +	cmpd	r8,r5		/* r8 <= n?  */

This should probably be "cmpld".

> +	ble	L(null)
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	32+v0,r11,r10	/* Partial store  */

Do we still need this "32+v0" syntax? Is that due to a minimum supported
level of binutils which isn't VSX-aware?

> +
> +	blr
> +
> +L(null):
> +	sldi	r10,r8,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	32+v0,r11,r10	/* Partial store  */
> +
> +	add	r11,r11,r8
> +	sub	r5,r5,r8
> +	b L(zero_padding_loop)
> +
> +L(no_null):
> +	cmpd	r9,r5		/* Check if length was reached.  */

This should probably be "cmpld".

> +	bge	L(n_tail1)
> +	sldi	r10,r9,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	32+v0,r11,r10	/* Partial store  */
> +
> +	add	r4,r4,r9
> +	add	r11,r11,r9
> +	sub	r5,r5,r9
> +
> +L(loop):
> +	cmpldi	cr6,r5,64	/* Check if length was reached.  */
> +	ble	cr6,L(final_loop)
> +
> +	lxv	32+v0,0(r4)
> +	vcmpequb. v6,v0,v18	/* Any zero bytes?  */
> +	bne	cr6,L(prep_tail1)
> +
> +	lxv	32+v1,16(r4)
> +	vcmpequb. v6,v1,v18	/* Any zero bytes?  */
> +	bne	cr6,L(prep_tail2)
> +
> +	lxv	32+v2,32(r4)
> +	vcmpequb. v6,v2,v18	/* Any zero bytes?  */
> +	bne	cr6,L(prep_tail3)
> +
> +	lxv	32+v3,48(r4)
> +	vcmpequb. v6,v3,v18	/* Any zero bytes?  */
> +	bne	cr6,L(prep_tail4)
> +
> +	stxv	32+v0,0(r11)
> +	stxv	32+v1,16(r11)
> +	stxv	32+v2,32(r11)
> +	stxv	32+v3,48(r11)
> +
> +	addi	r4,r4,64
> +	addi	r11,r11,64
> +	addi	r5,r5,-64
> +
> +	b	L(loop)
> +
> +L(final_loop):
> +	cmpldi	cr5,r5,16
> +	lxv	32+v0,0(r4)
> +	vcmpequb. v6,v0,v18	/* Any zero bytes?  */
> +	ble	cr5,L(prep_n_tail1)
> +	bne	cr6,L(count_tail1)
> +	addi	r5,r5,-16
> +
> +	cmpldi	cr5,r5,16
> +	lxv	32+v1,16(r4)
> +	vcmpequb. v6,v1,v18	/* Any zero bytes?  */
> +	ble	cr5,L(prep_n_tail2)
> +	bne	cr6,L(count_tail2)
> +	addi	r5,r5,-16
> +
> +	cmpldi	cr5,r5,16
> +	lxv	32+v2,32(r4)
> +	vcmpequb. v6,v2,v18	/* Any zero bytes?  */
> +	ble	cr5,L(prep_n_tail3)
> +	bne	cr6,L(count_tail3)
> +	addi	r5,r5,-16
> +
> +	lxv	32+v3,48(r4)
> +	vcmpequb. v6,v3,v18	/* Any zero bytes?  */
> +	beq	cr6,L(n_tail4)
> +
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +	cmpd	r8,r5		/* r8 < n?  */

This should probably be "cmpld".

> +	blt	L(tail4)
> +L(n_tail4):
> +	stxv	32+v0,0(r11)
> +	stxv	32+v1,16(r11)
> +	stxv	32+v2,32(r11)
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,48	/* Offset */
> +	stxvl	32+v3,r11,r10	/* Partial store  */
> +	blr
> +
> +L(prep_n_tail1):
> +	beq	cr6,L(n_tail1)	/* Any zero bytes?  */
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +	cmpd	r8,r5		/* r8 < n?  */

This should probably be "cmpld".

> +	blt	L(tail1)
> +L(n_tail1):
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	32+v0,r11,r10	/* Partial store  */
> +	blr
> +
> +L(prep_n_tail2):
> +	beq	cr6,L(n_tail2)	/* Any zero bytes?  */
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +	cmpd	r8,r5		/* r8 < n?  */

This should probably be "cmpld".

> +	blt	L(tail2)
> +L(n_tail2):
> +	stxv	32+v0,0(r11)
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,16	/* offset */
> +	stxvl	32+v1,r11,r10	/* Partial store  */
> +	blr
> +
> +L(prep_n_tail3):
> +	beq	cr6,L(n_tail3)	/* Any zero bytes?  */
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +	cmpd	r8,r5		/* r8 < n?  */

This should probably be "cmpld".

> +	blt	L(tail3)
> +L(n_tail3):
> +	stxv	32+v0,0(r11)
> +	stxv	32+v1,16(r11)
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,32	/* Offset */
> +	stxvl	32+v2,r11,r10	/* Partial store  */
> +	blr
> +
> +L(prep_tail1):
> +L(count_tail1):
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +L(tail1):
> +	addi	r9,r8,1	/* Add null terminator  */
> +	sldi	r10,r9,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	32+v0,r11,r10	/* Partial store  */
> +	add	r11,r11,r9
> +	sub	r5,r5,r9
> +	b L(zero_padding_loop)
> +
> +L(prep_tail2):
> +	addi	r5,r5,-16
> +L(count_tail2):
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +L(tail2):
> +	addi	r9,r8,1	/* Add null terminator  */
> +	stxv	32+v0,0(r11)
> +	sldi	r10,r9,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,16	/* offset */
> +	stxvl	32+v1,r11,r10	/* Partial store  */
> +	add	r11,r11,r9
> +	sub	r5,r5,r9
> +	b L(zero_padding_loop)
> +
> +L(prep_tail3):
> +	addi	r5,r5,-32
> +L(count_tail3):
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +L(tail3):
> +	addi	r9,r8,1	/* Add null terminator  */
> +	stxv	32+v0,0(r11)
> +	stxv	32+v1,16(r11)
> +	sldi	r10,r9,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,32	/* offset */
> +	stxvl	32+v2,r11,r10	/* Partial store  */
> +	add	r11,r11,r9
> +	sub	r5,r5,r9
> +	b L(zero_padding_loop)
> +
> +L(prep_tail4):
> +	addi	r5,r5,-48
> +	vctzlsbb r8,v6		/* Number of trailing zeroes  */
> +L(tail4):
> +	addi	r9,r8,1	/* Add null terminator  */
> +	stxv	32+v0,0(r11)
> +	stxv	32+v1,16(r11)
> +	stxv	32+v2,32(r11)
> +	sldi	r10,r9,56	/* stxvl wants size in top 8 bits  */
> +	addi	r11,r11,48	/* offset */
> +	stxvl	32+v3,r11,r10	/* Partial store  */
> +	add	r11,r11,r9
> +	sub	r5,r5,r9
> +
> +/* This code pads the remainder of dest with NULL bytes.  */
> +L(zero_padding_loop):
> +	cmpldi	cr6,r5,16	/* Check if length was reached.  */
> +	ble	cr6,L(zero_padding_end)
> +
> +	stxv	v18,0(r11)
> +	addi	r11,r11,16
> +	addi	r5,r5,-16
> +
> +	b	L(zero_padding_loop)
> +
> +L(zero_padding_end):
> +	sldi	r10,r5,56	/* stxvl wants size in top 8 bits  */
> +	stxvl	v18,r11,r10	/* Partial store  */
> +	blr
> +
> +L(n_tail):
> +
> +END (FUNC_NAME)

PC

next prev parent reply	other threads:[~2020-08-28 19:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-20 18:29 Raphael Moreira Zinsly
2020-08-20 18:29 ` [PATCH 2/2] powerpc: Optimzed stpncpy " Raphael Moreira Zinsly
2020-08-20 18:31   ` Raphael M Zinsly
2020-08-28 17:04   ` Paul E Murphy
2020-08-20 18:31 ` [PATCH 1/2] powerpc: Optimized strncpy " Raphael M Zinsly
2020-08-28 14:25 ` Paul E Murphy
2020-08-28 19:12 ` Paul A. Clarke [this message]
2020-09-02 13:20 ` Tulio Magno Quites Machado Filho
2020-09-02 14:00   ` Paul E Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200828191241.GA280591@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com \
    --to=pc@us.ibm.com \
    --cc=libc-alpha@sourceware.org \
    --cc=rzinsly@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).