From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id E3D52396ECE7 for ; Wed, 27 May 2020 16:46:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E3D52396ECE7 Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04RGYNMx163256; Wed, 27 May 2020 12:46:03 -0400 Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com with ESMTP id 319s3bpqm8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 27 May 2020 12:46:02 -0400 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04RGjW5N019416; Wed, 27 May 2020 16:46:02 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma03dal.us.ibm.com with ESMTP id 316ufap3se-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 27 May 2020 16:46:02 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04RGk1W253215644 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 27 May 2020 16:46:01 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4D0E212405B; Wed, 27 May 2020 16:46:01 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CD984124055; Wed, 27 May 2020 16:46:00 +0000 (GMT) Received: from oc3272150783.ibm.com (unknown [9.160.91.164]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTPS; Wed, 27 May 2020 16:46:00 +0000 (GMT) Date: Wed, 27 May 2020 11:45:54 -0500 From: "Paul A. Clarke" To: "Paul E. Murphy" Cc: libc-alpha@sourceware.org, anton@ozlabs.org Subject: Re: [PATCH] powerpc64le: add optimized strlen for P9 Message-ID: <20200527164554.GA13085@oc3272150783.ibm.com> References: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687 definitions=2020-05-27_03:2020-05-27, 2020-05-27 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 phishscore=0 clxscore=1011 adultscore=0 bulkscore=0 mlxlogscore=999 malwarescore=0 spamscore=0 priorityscore=1501 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005270123 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 May 2020 16:46:08 -0000 On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote: > This is a followup to rawmemchr/strlen from Anton. I missed > his original strlen patch, and likewise I wasn't happy with > the 3-4% performance drop for larger strings which occurs > around 2.5kB as the P8 vector loop is a bit faster. As noted, > this is up to 50% faster for small strings, and about 1% faster > for larger strings (I hazard to guess this some uarch difference > between lxv and lvx). > > I guess this is a semi-V2 of the patch. Likewise, I need to > double check binutils 2.26 supports the P9 insn used here. > > ---8<--- > > This started as a trivial change to Anton's rawmemchr. I got > carried away. This is a hybrid between P8's asympotically > faster 64B checks with extremely efficient small string checks > e.g <64B (and sometimes a little bit more depending on alignment). > > The second trick is to align to 64B by running a 48B checking loop > 16B at a time until we naturally align to 64B (i.e checking 48/96/144 > bytes/iteration based on the alignment after the first 5 comparisons). > This allieviates the need to check page boundaries. > > Finally, explicly use the P7 strlen with the runtime loader when building > P9. We need to be cautious about vector/vsx extensions here on P9 only > builds. > --- > .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + > sysdeps/powerpc/powerpc64/le/power9/strlen.S | 215 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 4 + > .../powerpc64/multiarch/strlen-power9.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + > 6 files changed, 228 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > new file mode 100644 > index 0000000000..e9d83323ac > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > @@ -0,0 +1 @@ > +#include > diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > new file mode 100644 > index 0000000000..084d6e31a8 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > @@ -0,0 +1,215 @@ > + > +/* Optimized rawmemchr implementation for PowerPC64/POWER9. > + Copyright (C) 2020 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +#ifndef STRLEN > +# define STRLEN __strlen > +# define DEFINE_STRLEN_HIDDEN_DEF 1 > +#endif > + > +/* Implements the function > + > + int [r3] strlen (void *s [r3]) const void *s? > + > + The implementation can load bytes past a matching byte, but only > + up to the next 16B or 64B boundary, so it never crosses a page. */ > + > +.machine power9 > +ENTRY_TOCLESS (STRLEN, 4) > + CALL_MCOUNT 2 > + > + mr r4, r3 This can be moved later, and folded into the "add" below. In my experiments, it helped performance for tiny strings. extra space after comma. > + vspltisb v18, 0 > + vspltisb v19, -1 extra spaces after commas. > + > + neg r5,r3 > + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ > + > + > + /* Align data and fill bytes not loaded with non matching char */ Missing '.' after 'char', but I suggest some different comments (subjective)... Consider: /* Load cache line containing beginning of string. */ > + lvx v0,0,r4 Consider: /* Create permute vector to shift into alignment. */ > + lvsr v1,0,r4 To move the "mr" above later, both of the above instructions would thus need to use "r3" instead of "r4". Consider: /* Shift into alignment, filling with 0xff. */ > + vperm v0,v19,v0,v1 > + > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + beq cr6,L(aligned) > + Consider: /* String ends within first cache line. Compute length. */ > + vctzlsbb r3,v6 > + blr > + > + /* Test 64B 16B at a time. The vector loop is costly for small strings. */ Consider: /* Test 64B 16B at a time. Postpone the vector loop ("loop", below), which is costly for small strings. */ > +L(aligned): > + add r4,r4,r9 And this can change to "add r4,r3,r9". > + > + rldicl. r5, r4, 60, 62 /* Determine how many 48B loops we should run */ /* Determine how many 48B loops we should run in "loop" below. 48B loops perform better than simpler 16B loops. */ extra spaces after commas Should this calculation be moved down, just before its use at "beq", or does it schedule better if left here? Since the result is not used until after the next 14 instructions, strings of these lengths are penalized. > + > + lxv v0+32,0(r4) Is the "+32" needed to accommodate a binutils that doesn't support VSX registers? > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail3) > + > + lxv v0+32,48(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail4) > + addi r4, r4, 64 extra spaces after commas. > + > + /* prep for weird constant generation of reduction */ Need leading capitalization ("Prep..."). But, maybe a better comment instead... /* Load a dummy aligned address (0) so that 'lvsl' produces a shift vector * of 0..15. */ > + li r0, 0 Extra space after ',' Would it be bad to move this just a little closer to where it is used much later? > + > + /* Skip the alignment if not needed */ need a '.' after 'needed'. (Above "rldicl." could be moved as late as here.) > + beq L(loop_64b) > + mtctr r5 > + > + /* Test 48B per iteration until 64B aligned */ need a '.' after 'aligned'. > + .p2align 5 > +L(loop): > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ here, too. > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ and here. > + bne cr6,L(tail3) > + > + addi r4,r4,48 > + bdnz L(loop) > + > + .p2align 5 > +L(loop_64b): > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + beq cr6,L(loop_64b) Curious how much this loop unrolling helps, since it adds a fair bit of redundant code? > + > +L(vmx_zero): > + /* OK, we found a null byte. Let's look for it in the current 64-byte > + block and mark it in its corresponding VR. */ > + vcmpequb v1,v1,v18 > + vcmpequb v2,v2,v18 > + vcmpequb v3,v3,v18 > + vcmpequb v4,v4,v18 > + > + /* We will now 'compress' the result into a single doubleword, so it > + can be moved to a GPR for the final calculation. First, we > + generate an appropriate mask for vbpermq, so we can permute bits into > + the first halfword. */ I'm wondering (without having verified) if you can do something here akin to what's done in the "tail" sections below, using "vctzlsbb". > + vspltisb v10,3 > + lvsl v11,r0,r0 Second field should probably be "0" instead of "r0" ("v11,0,r0"). > + vslb v10,v11,v10 > + > + /* Permute the first bit of each byte into bits 48-63. */ > + vbpermq v1,v1,v10 > + vbpermq v2,v2,v10 > + vbpermq v3,v3,v10 > + vbpermq v4,v4,v10 > + > + /* Shift each component into its correct position for merging. */ > + vsldoi v2,v2,v2,2 > + vsldoi v3,v3,v3,4 > + vsldoi v4,v4,v4,6 > + > + /* Merge the results and move to a GPR. */ > + vor v1,v2,v1 > + vor v2,v3,v4 > + vor v4,v1,v2 > + mfvrd r10,v4 > + > + /* Adjust address to the begninning of the current 64-byte block. */ > + addi r4,r4,-64 > + > + addi r9, r10,-1 /* Form a mask from trailing zeros. */ > + andc r9, r9,r10 > + popcntd r0, r9 /* Count the bits in the mask. */ extra spaces after the first comma in the above 3 lines. > + subf r5,r3,r4 > + add r3,r5,r0 /* Compute final length. */ > + blr > + > +L(tail1): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + subf r3,r3,r4 > + blr > + > +L(tail2): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,16 > + subf r3,r3,r4 > + blr > + > +L(tail3): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,32 > + subf r3,r3,r4 > + blr > + > +L(tail4): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,48 > + subf r3,r3,r4 > + blr > + > +END (STRLEN) (snip) PC