From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from xry111.site (xry111.site [89.208.246.23]) by sourceware.org (Postfix) with ESMTPS id D3DAE38582BE for ; Thu, 3 Aug 2023 14:53:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D3DAE38582BE Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=xry111.site Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=xry111.site DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=xry111.site; s=default; t=1691074419; bh=KlZAZ3zJvGYV9pNMPu2Jh2+pyj9jm+jcR2+QL2dbjuU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=NBBAYi4SU62SG8MVL5XluShE2TjQT99D+L4rGnB7XFVHwLEwTOthWLErYw2u/y8ed kI80eo1BXiDUpp0gDazEghuvrwQk2WNkV0zYfq16IJ3Icz+0m2KvtKho2h2cUVjK22 2/dALxLKTC0fkCnQ0ANhrJw0yn8LsoxwWJtHj8Gw= Received: from localhost.localdomain (xry111.site [IPv6:2001:470:683e::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-384) server-digest SHA384) (Client did not present a certificate) (Authenticated sender: xry111@xry111.site) by xry111.site (Postfix) with ESMTPSA id D5675659A0; Thu, 3 Aug 2023 10:53:37 -0400 (EDT) Message-ID: <29863c0d1eb285c1eb62336c0e1592e317f89349.camel@xry111.site> Subject: Re: [PATCH 2/2] Loongarch: Add ifunc support and add different versions of strlen From: Xi Ruoyao To: Adhemerval Zanella Netto , dengjianbo , caiyinyu , libc-alpha@sourceware.org Cc: xuchenghua@loongson.cn, huangpei@loongson.cn Date: Thu, 03 Aug 2023 22:53:36 +0800 In-Reply-To: <2aed2087-c44e-8fed-83d1-5e60343c8f47@linaro.org> References: <20230801070902.1385953-1-dengjianbo@loongson.cn> <20230801070902.1385953-3-dengjianbo@loongson.cn> <8367cd72-e458-7a1c-63af-01280d8ccc7a@linaro.org> <60b7843c-a32c-a1b5-ffc1-1ad769fb2d57@linaro.org> <2aed2087-c44e-8fed-83d1-5e60343c8f47@linaro.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.48.4 MIME-Version: 1.0 X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,LIKELY_SPAM_FROM,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, 2023-08-03 at 10:48 -0300, Adhemerval Zanella Netto wrote: > On 03/08/23 10:27, dengjianbo wrote: > > On 2023-08-02 20:59, Adhemerval Zanella Netto wrote: > > > > > > > On 2023-08-02 10:31, Adhemerval Zanella Netto wrote: > > > > > > > +#if IS_IN (libc) > > > > > > > +# define STRLEN __strlen_aligned > > > > > > > +#else > > > > > > > +# define STRLEN strlen > > > > > > > +#endif > > > > > > Is this really an improvement over the generic implementation? = It seems to > > > > > > use a quite similar strategy. > > > > Comparing with the code generated by compiler, the assembly code do= es an 16bytes loop > > > > unrolling, and handles ascii data and non-ascii data separately whi= ch could take less > > > > instructions to calculate the length of=C2=A0 ascii data. besides, = the assembly code using > > > > fewer instructions to start the loop. I think the performance impro= vement benefits from > > > > this. Please kindly check bench result also from: > > > > https://github.com/jiadengx/glibc_test/blob/main/strlen/bench-strle= n.out > > > From the summarized results [1], it seems that the initial start to m= ask > > > off unaligned inputs are slight better.=C2=A0 The __strlen_aligned on= l seems > > > better to sizes larger than 32 (the 16 lenght results seems strange). > > > Maybe you coult improve shift_find/find_zero_all/index_first on loong= arch. > > >=20 > > > Does it improve by explicit instructing compiler to unroll the loop? > > As you know, the assembly versions of strlen uses the same strategy to > > calculate string length, if assembly code only calculate 8 bytes in the > > loop and don't separate ascii and non-ascii data, the code of loop and > > loop end part should be the same as the compiler generated code base on > > generic strlen. Loongarch doesn't provide instructions like alpha > > cmpbge, so there is no much optimizations could be done on > > find_zero_all/index_first/has_zero except we can remove some BIG_ENDIAN > > codes. Removing them will not make any difference because the compiler will optimized the BIG_ENDIAN paths away. > > Refer to the latest test results in the chart: The assembly > > implementation vs. generic strlen implementation(compiled by using > > CFLAGS-strlen.c +=3D -funroll-all-loops --param > > max-variable-expandsions-in-unroller=3D2) the performance > > improvement of the assembly implementation is evident(30% ~ 40%), > > especially in cases when the length is greater than 64 bytes. > > Please kindly see the results via: > > https://github.com/jiadengx/glibc_test/blob/main/strlen2/bench1/generic= _strlen_with_loop_unrolling.png >=20 > So maybe use the generic implementation plus the compiler flags to loop > unrolling instead of asm optimization? This is strange... I remember I'd attempted to add #pragma GCC unroll for the main loop of strlen and I observed no performance gain on my Loongson-3A5000-HV, at all. Maybe a different test environment (hardware, compiler version, or something)? > > > > > > This implementation fails to assembler with binutils 2.40.0.202= 30525: > > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S: Assembler mes= sages: > > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:30: Error: no = match insn: vld=C2=A0 $vr0,$r4,0 > > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:31: Error: no = match insn: vld=C2=A0 $vr1,$r4,16 > > > > > >=20 > > > > Sorry, it's my mistake for the wrong version of binutils. Could you= please try the latest release > > > > version 2.41? > > > Although it should work, it is unexpected that depending of the assem= bler used > > > some optimized routines are not enabled.=20 > >=20 > > In patch v2, an new configuration variable has been added to control > > whether the LASX/LSX will be compiled according to assembler support > > LASX/LSX or not, so it can be compiled with old versions of binutils. >=20 > Yes I am aware and this seems odd, albeit not really wrong.=C2=A0 It mean= s that > you will get less code coverage and optimizations depending of the used= =20 > binutils.=20 >=20 > I would advise to follow what other architecture did to provide arch-spec= ific=20 > optimization, which is either setup a minimum gcc/binutils version (for= =20 > instance aarch64 libmvec), or encode the instructions in a binutils neutr= al > mode (as the powerpc implementation I pointed out). Hmm, this policy seems different from $OTHER_PROJECTS. --=20 Xi Ruoyao School of Aerospace Science and Technology, Xidian University