From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pc@us.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id E3D52396ECE7
 for <libc-alpha@sourceware.org>; Wed, 27 May 2020 16:46:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E3D52396ECE7
Received: from pps.filterd (m0098417.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 04RGYNMx163256; Wed, 27 May 2020 12:46:03 -0400
Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com
 [169.62.189.11])
 by mx0a-001b2d01.pphosted.com with ESMTP id 319s3bpqm8-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 27 May 2020 12:46:02 -0400
Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1])
 by ppma03dal.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04RGjW5N019416;
 Wed, 27 May 2020 16:46:02 GMT
Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com
 [9.57.198.25]) by ppma03dal.us.ibm.com with ESMTP id 316ufap3se-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 27 May 2020 16:46:02 +0000
Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com
 [9.57.199.107])
 by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 04RGk1W253215644
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 27 May 2020 16:46:01 GMT
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 4D0E212405B;
 Wed, 27 May 2020 16:46:01 +0000 (GMT)
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id CD984124055;
 Wed, 27 May 2020 16:46:00 +0000 (GMT)
Received: from oc3272150783.ibm.com (unknown [9.160.91.164])
 by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTPS;
 Wed, 27 May 2020 16:46:00 +0000 (GMT)
Date: Wed, 27 May 2020 11:45:54 -0500
From: "Paul A. Clarke" <pc@us.ibm.com>
To: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com>
Cc: libc-alpha@sourceware.org, anton@ozlabs.org
Subject: Re: [PATCH] powerpc64le: add optimized strlen for P9
Message-ID: <20200527164554.GA13085@oc3272150783.ibm.com>
References: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-05-27_03:2020-05-27,
 2020-05-27 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxscore=0 suspectscore=0
 impostorscore=0 lowpriorityscore=0 phishscore=0 clxscore=1011 adultscore=0
 bulkscore=0 mlxlogscore=999 malwarescore=0 spamscore=0 priorityscore=1501
 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2004280000 definitions=main-2005270123
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 27 May 2020 16:46:08 -0000

On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote:
> This is a followup to rawmemchr/strlen from Anton.  I missed
> his original strlen patch, and likewise I wasn't happy with
> the 3-4% performance drop for larger strings which occurs
> around 2.5kB as the P8 vector loop is a bit faster.  As noted,
> this is up to 50% faster for small strings, and about 1% faster
> for larger strings (I hazard to guess this some uarch difference
> between lxv and lvx).
> 
> I guess this is a semi-V2 of the patch.  Likewise, I need to
> double check binutils 2.26 supports the P9 insn used here.
> 
> ---8<---
> 
> This started as a trivial change to Anton's rawmemchr.  I got
> carried away.  This is a hybrid between P8's asympotically
> faster 64B checks with extremely efficient small string checks
> e.g <64B (and sometimes a little bit more depending on alignment).
> 
> The second trick is to align to 64B by running a 48B checking loop
> 16B at a time until we naturally align to 64B (i.e checking 48/96/144
> bytes/iteration based on the alignment after the first 5 comparisons).
> This allieviates the need to check page boundaries.
> 
> Finally, explicly use the P7 strlen with the runtime loader when building
> P9.  We need to be cautious about vector/vsx extensions here on P9 only
> builds.
> ---
>  .../powerpc/powerpc64/le/power9/rtld-strlen.S |   1 +
>  sysdeps/powerpc/powerpc64/le/power9/strlen.S  | 215 ++++++++++++++++++
>  sysdeps/powerpc/powerpc64/multiarch/Makefile  |   2 +-
>  .../powerpc64/multiarch/ifunc-impl-list.c     |   4 +
>  .../powerpc64/multiarch/strlen-power9.S       |   2 +
>  sysdeps/powerpc/powerpc64/multiarch/strlen.c  |   5 +
>  6 files changed, 228 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S
> 
> diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
> new file mode 100644
> index 0000000000..e9d83323ac
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
> @@ -0,0 +1 @@
> +#include <sysdeps/powerpc/powerpc64/power7/strlen.S>
> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S
> new file mode 100644
> index 0000000000..084d6e31a8
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S
> @@ -0,0 +1,215 @@
> +
> +/* Optimized rawmemchr implementation for PowerPC64/POWER9.
> +   Copyright (C) 2020 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#ifndef STRLEN
> +# define STRLEN __strlen
> +# define DEFINE_STRLEN_HIDDEN_DEF 1
> +#endif
> +
> +/* Implements the function
> +
> +   int [r3] strlen (void *s [r3])

const void *s?

> +
> +   The implementation can load bytes past a matching byte, but only
> +   up to the next 16B or 64B boundary, so it never crosses a page.  */
> +
> +.machine power9
> +ENTRY_TOCLESS (STRLEN, 4)
> +	CALL_MCOUNT 2
> +
> +	mr r4, r3

This can be moved later, and folded into the "add" below.
In my experiments, it helped performance for tiny strings.

extra space after comma.

> +	vspltisb v18, 0
> +	vspltisb v19, -1

extra spaces after commas.

> +
> +	neg	r5,r3
> +	rldicl	r9,r5,0,60	/* How many bytes to get source 16B aligned?  */
> +
> +
> +	/* Align data and fill bytes not loaded with non matching char  */

Missing '.' after 'char', but I suggest some different comments (subjective)...

Consider:
/* Load cache line containing beginning of string.  */

> +	lvx	v0,0,r4

Consider:
/* Create permute vector to shift into alignment.  */

> +	lvsr	v1,0,r4

To move the "mr" above later, both of the above instructions would thus
need to use "r3" instead of "r4".

Consider:
/* Shift into alignment, filling with 0xff.  */

> +	vperm	v0,v19,v0,v1
> +
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	beq	cr6,L(aligned)
> +

Consider:
/* String ends within first cache line.  Compute length.  */

> +	vctzlsbb r3,v6
> +	blr
> +
> +	/* Test 64B 16B at a time.  The vector loop is costly for small strings. */

Consider:
/* Test 64B 16B at a time.  Postpone the vector loop ("loop", below), which
   is costly for small strings.  */

> +L(aligned):
> +	add	r4,r4,r9

And this can change to "add r4,r3,r9".

> +
> +	rldicl. r5, r4, 60, 62	/* Determine how many 48B loops we should run */

/* Determine how many 48B loops we should run in "loop" below.
   48B loops perform better than simpler 16B loops.  */

extra spaces after commas

Should this calculation be moved down, just before its use at "beq", or does
it schedule better if left here?  Since the result is not used until after
the next 14 instructions, strings of these lengths are penalized.

> +
> +	lxv	v0+32,0(r4)

Is the "+32" needed to accommodate a binutils that doesn't support VSX registers?

> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	bne	cr6,L(tail1)
> +
> +	lxv	v0+32,16(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	bne	cr6,L(tail2)
> +
> +	lxv	v0+32,32(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	bne	cr6,L(tail3)
> +
> +	lxv	v0+32,48(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	bne	cr6,L(tail4)
> +	addi 	r4, r4, 64

extra spaces after commas.

> +
> +	/* prep for weird constant generation of reduction */

Need leading capitalization ("Prep...").  But, maybe a better comment instead...
/* Load a dummy aligned address (0) so that 'lvsl' produces a shift vector
 * of 0..15.  */

> +	li r0, 0

Extra space after ','
Would it be bad to move this just a little closer to where it is used much
later?

> +
> +	/* Skip the alignment if not needed */

need a '.' after 'needed'.

(Above "rldicl." could be moved as late as here.)

> +	beq L(loop_64b)
> +	mtctr r5
> +
> +	/* Test 48B per iteration until 64B aligned  */

need a '.' after 'aligned'.

> +	.p2align 5
> +L(loop):
> +	lxv	v0+32,0(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

need a '.' after 'otherwise'.

> +	bne	cr6,L(tail1)
> +
> +	lxv	v0+32,16(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

here, too.

> +	bne	cr6,L(tail2)
> +
> +	lxv	v0+32,32(r4)
> +	vcmpequb. v6,v0,v18	/* 0xff if byte matches, 0x00 otherwise  */

and here.

> +	bne	cr6,L(tail3)
> +
> +	addi	r4,r4,48
> +	bdnz	L(loop)
> +
> +	.p2align 5
> +L(loop_64b):
> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
> +	lxv	  v2+32, 16(r4)
> +	lxv	  v3+32, 32(r4)
> +	lxv	  v4+32, 48(r4)
> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
> +	bne	  cr6,L(vmx_zero)
> +
> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
> +	lxv	  v2+32, 16(r4)
> +	lxv	  v3+32, 32(r4)
> +	lxv	  v4+32, 48(r4)
> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
> +	bne	  cr6,L(vmx_zero)
> +
> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
> +	lxv	  v2+32, 16(r4)
> +	lxv	  v3+32, 32(r4)
> +	lxv	  v4+32, 48(r4)
> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
> +	beq	  cr6,L(loop_64b)

Curious how much this loop unrolling helps, since it adds a fair bit of
redundant code?

> +
> +L(vmx_zero):
> +	/* OK, we found a null byte.  Let's look for it in the current 64-byte
> +	   block and mark it in its corresponding VR.  */
> +	vcmpequb  v1,v1,v18
> +	vcmpequb  v2,v2,v18
> +	vcmpequb  v3,v3,v18
> +	vcmpequb  v4,v4,v18
> +
> +	/* We will now 'compress' the result into a single doubleword, so it
> +	   can be moved to a GPR for the final calculation.  First, we
> +	   generate an appropriate mask for vbpermq, so we can permute bits into
> +	   the first halfword.  */

I'm wondering (without having verified) if you can do something here akin to
what's done in the "tail" sections below, using "vctzlsbb".

> +	vspltisb  v10,3
> +	lvsl	  v11,r0,r0

Second field should probably be "0" instead of "r0" ("v11,0,r0").

> +	vslb	  v10,v11,v10
> +
> +	/* Permute the first bit of each byte into bits 48-63.  */
> +	vbpermq	v1,v1,v10
> +	vbpermq	v2,v2,v10
> +	vbpermq	v3,v3,v10
> +	vbpermq	v4,v4,v10
> +
> +	/* Shift each component into its correct position for merging.  */
> +	vsldoi  v2,v2,v2,2
> +	vsldoi  v3,v3,v3,4
> +	vsldoi  v4,v4,v4,6
> +
> +	/* Merge the results and move to a GPR.  */
> +	vor	v1,v2,v1
> +	vor	v2,v3,v4
> +	vor	v4,v1,v2
> +	mfvrd	r10,v4
> +
> +	 /* Adjust address to the begninning of the current 64-byte block.  */
> +	addi	r4,r4,-64
> +
> +	addi	r9, r10,-1    /* Form a mask from trailing zeros.  */
> +	andc	r9, r9,r10
> +	popcntd	r0, r9	      /* Count the bits in the mask.  */

extra spaces after the first comma in the above 3 lines.

> +	subf	r5,r3,r4
> +	add     r3,r5,r0      /* Compute final length.  */
> +	blr
> +
> +L(tail1):
> +	vctzlsbb r0,v6
> +	add	r4,r4,r0
> +	subf	r3,r3,r4
> +	blr
> +
> +L(tail2):
> +	vctzlsbb r0,v6
> +	add	r4,r4,r0
> +	addi	r4,r4,16
> +	subf	r3,r3,r4
> +	blr
> +
> +L(tail3):
> +	vctzlsbb r0,v6
> +	add	r4,r4,r0
> +	addi	r4,r4,32
> +	subf	r3,r3,r4
> +	blr
> +
> +L(tail4):
> +	vctzlsbb r0,v6
> +	add	r4,r4,r0
> +	addi	r4,r4,48
> +	subf	r3,r3,r4
> +	blr
> +
> +END (STRLEN)
(snip)

PC