From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=tz4v=DU=xry111.site=xry111@sourceware.org>
Received: from xry111.site (xry111.site [89.208.246.23])
	by sourceware.org (Postfix) with ESMTPS id D3DAE38582BE
	for <libc-alpha@sourceware.org>; Thu,  3 Aug 2023 14:53:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D3DAE38582BE
Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=xry111.site
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=xry111.site
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=xry111.site;
	s=default; t=1691074419;
	bh=KlZAZ3zJvGYV9pNMPu2Jh2+pyj9jm+jcR2+QL2dbjuU=;
	h=Subject:From:To:Cc:Date:In-Reply-To:References:From;
	b=NBBAYi4SU62SG8MVL5XluShE2TjQT99D+L4rGnB7XFVHwLEwTOthWLErYw2u/y8ed
	 kI80eo1BXiDUpp0gDazEghuvrwQk2WNkV0zYfq16IJ3Icz+0m2KvtKho2h2cUVjK22
	 2/dALxLKTC0fkCnQ0ANhrJw0yn8LsoxwWJtHj8Gw=
Received: from localhost.localdomain (xry111.site [IPv6:2001:470:683e::1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange ECDHE (P-256) server-signature ECDSA (P-384) server-digest SHA384)
	(Client did not present a certificate)
	(Authenticated sender: xry111@xry111.site)
	by xry111.site (Postfix) with ESMTPSA id D5675659A0;
	Thu,  3 Aug 2023 10:53:37 -0400 (EDT)
Message-ID: <29863c0d1eb285c1eb62336c0e1592e317f89349.camel@xry111.site>
Subject: Re: [PATCH 2/2] Loongarch: Add ifunc support and add different
 versions of strlen
From: Xi Ruoyao <xry111@xry111.site>
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>, dengjianbo
	 <dengjianbo@loongson.cn>, caiyinyu <caiyinyu@loongson.cn>, 
	libc-alpha@sourceware.org
Cc: xuchenghua@loongson.cn, huangpei@loongson.cn
Date: Thu, 03 Aug 2023 22:53:36 +0800
In-Reply-To: <2aed2087-c44e-8fed-83d1-5e60343c8f47@linaro.org>
References: <20230801070902.1385953-1-dengjianbo@loongson.cn>
	 <20230801070902.1385953-3-dengjianbo@loongson.cn>
	 <8367cd72-e458-7a1c-63af-01280d8ccc7a@linaro.org>
	 <aafa831a-ca62-b110-451c-1ec417e76b0e@loongson.cn>
	 <d99e7f37-2201-54d4-3677-c2e35197ab51@loongson.cn>
	 <60b7843c-a32c-a1b5-ffc1-1ad769fb2d57@linaro.org>
	 <c5e923cd-f014-636d-441e-5fd06400fdc0@loongson.cn>
	 <2aed2087-c44e-8fed-83d1-5e60343c8f47@linaro.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.48.4 
MIME-Version: 1.0
X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,LIKELY_SPAM_FROM,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

On Thu, 2023-08-03 at 10:48 -0300, Adhemerval Zanella Netto wrote:
> On 03/08/23 10:27, dengjianbo wrote:
> > On 2023-08-02 20:59, Adhemerval Zanella Netto wrote:
> > > > > > > On 2023-08-02 10:31, Adhemerval Zanella Netto wrote:
> > > > > > > +#if IS_IN (libc)
> > > > > > > +# define STRLEN __strlen_aligned
> > > > > > > +#else
> > > > > > > +# define STRLEN strlen
> > > > > > > +#endif
> > > > > > Is this really an improvement over the generic implementation? =
It seems to
> > > > > > use a quite similar strategy.
> > > > Comparing with the code generated by compiler, the assembly code do=
es an 16bytes loop
> > > > unrolling, and handles ascii data and non-ascii data separately whi=
ch could take less
> > > > instructions to calculate the length of=C2=A0 ascii data. besides, =
the assembly code using
> > > > fewer instructions to start the loop. I think the performance impro=
vement benefits from
> > > > this. Please kindly check bench result also from:
> > > > https://github.com/jiadengx/glibc_test/blob/main/strlen/bench-strle=
n.out
> > > From the summarized results [1], it seems that the initial start to m=
ask
> > > off unaligned inputs are slight better.=C2=A0 The __strlen_aligned on=
l seems
> > > better to sizes larger than 32 (the 16 lenght results seems strange).
> > > Maybe you coult improve shift_find/find_zero_all/index_first on loong=
arch.
> > >=20
> > > Does it improve by explicit instructing compiler to unroll the loop?
> > As you know, the assembly versions of strlen uses the same strategy to
> > calculate string length, if assembly code only calculate 8 bytes in the
> > loop and don't separate ascii and non-ascii data, the code of loop and
> > loop end part should be the same as the compiler generated code base on
> > generic strlen. Loongarch doesn't provide instructions like alpha
> > cmpbge, so there is no much optimizations could be done on
> > find_zero_all/index_first/has_zero except we can remove some BIG_ENDIAN
> > codes.

Removing them will not make any difference because the compiler will
optimized the BIG_ENDIAN paths away.

> > Refer to the latest test results in the chart: The assembly
> > implementation vs. generic strlen implementation(compiled by using
> > CFLAGS-strlen.c +=3D -funroll-all-loops --param
> > max-variable-expandsions-in-unroller=3D2) the performance
> > improvement of the assembly implementation is evident(30% ~ 40%),
> > especially in cases when the length is greater than 64 bytes.
> > Please kindly see the results via:
> > https://github.com/jiadengx/glibc_test/blob/main/strlen2/bench1/generic=
_strlen_with_loop_unrolling.png
>=20
> So maybe use the generic implementation plus the compiler flags to loop
> unrolling instead of asm optimization?

This is strange... I remember I'd attempted to add #pragma GCC unroll
for the main loop of strlen and I observed no performance gain on my
Loongson-3A5000-HV, at all.  Maybe a different test environment
(hardware, compiler version, or something)?

> > > > > > This implementation fails to assembler with binutils 2.40.0.202=
30525:
> > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S: Assembler mes=
sages:
> > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:30: Error: no =
match insn: vld=C2=A0 $vr0,$r4,0
> > > > > > ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:31: Error: no =
match insn: vld=C2=A0 $vr1,$r4,16
> > > > > >=20
> > > > Sorry, it's my mistake for the wrong version of binutils. Could you=
 please try the latest release
> > > > version 2.41?
> > > Although it should work, it is unexpected that depending of the assem=
bler used
> > > some optimized routines are not enabled.=20
> >=20
> > In patch v2, an new configuration variable has been added to control
> > whether the LASX/LSX will be compiled according to assembler support
> > LASX/LSX or not, so it can be compiled with old versions of binutils.
>=20
> Yes I am aware and this seems odd, albeit not really wrong.=C2=A0 It mean=
s that
> you will get less code coverage and optimizations depending of the used=
=20
> binutils.=20
>=20
> I would advise to follow what other architecture did to provide arch-spec=
ific=20
> optimization, which is either setup a minimum gcc/binutils version (for=
=20
> instance aarch64 libmvec), or encode the instructions in a binutils neutr=
al
> mode (as the powerpc implementation I pointed out).

Hmm, this policy seems different from $OTHER_PROJECTS.

--=20
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University