From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=dsZ1=DV=loongson.cn=caiyinyu@sourceware.org>
Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163])
	by sourceware.org (Postfix) with ESMTP id 7CD953857715
	for <libc-alpha@sourceware.org>; Fri,  4 Aug 2023 01:50:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7CD953857715
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=loongson.cn
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=loongson.cn
Received: from loongson.cn (unknown [10.20.4.187])
	by gateway (Coremail) with SMTP id _____8Bxd+hbWcxkUfsPAA--.1147S3;
	Fri, 04 Aug 2023 09:50:19 +0800 (CST)
Received: from [10.20.4.187] (unknown [10.20.4.187])
	by localhost.localdomain (Coremail) with SMTP id AQAAf8CxWMxaWcxkVL1HAA--.19000S2;
	Fri, 04 Aug 2023 09:50:18 +0800 (CST)
Subject: Re: [PATCH 2/2] Loongarch: Add ifunc support and add different
 versions of strlen
To: Xi Ruoyao <xry111@xry111.site>,
 Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>,
 dengjianbo <dengjianbo@loongson.cn>, libc-alpha@sourceware.org
Cc: xuchenghua@loongson.cn, huangpei@loongson.cn
References: <20230801070902.1385953-1-dengjianbo@loongson.cn>
 <20230801070902.1385953-3-dengjianbo@loongson.cn>
 <8367cd72-e458-7a1c-63af-01280d8ccc7a@linaro.org>
 <aafa831a-ca62-b110-451c-1ec417e76b0e@loongson.cn>
 <d99e7f37-2201-54d4-3677-c2e35197ab51@loongson.cn>
 <60b7843c-a32c-a1b5-ffc1-1ad769fb2d57@linaro.org>
 <c5e923cd-f014-636d-441e-5fd06400fdc0@loongson.cn>
 <2aed2087-c44e-8fed-83d1-5e60343c8f47@linaro.org>
 <29863c0d1eb285c1eb62336c0e1592e317f89349.camel@xry111.site>
From: caiyinyu <caiyinyu@loongson.cn>
Message-ID: <1fe274bb-bd6b-33ae-1cab-f8951e17c85b@loongson.cn>
Date: Fri, 4 Aug 2023 09:50:18 +0800
User-Agent: Mozilla/5.0 (X11; Linux mips64; rv:68.0) Gecko/20100101
 Thunderbird/68.7.0
MIME-Version: 1.0
In-Reply-To: <29863c0d1eb285c1eb62336c0e1592e317f89349.camel@xry111.site>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-CM-TRANSID:AQAAf8CxWMxaWcxkVL1HAA--.19000S2
X-CM-SenderInfo: 5fdl5xhq1xqz5rrqw2lrqou0/
X-Coremail-Antispam: 1Uk129KBj93XoWxur1UAryktF1xGrWUAryUXFc_yoW5Ww17pa
	4SqF4DJF4DG340kw48Ka97Xa10y3yrKasrWrnYyryjkrZ0qrn3tF4SkryS9F1DGr18Gryj
	vw40934fC3ZrZ3gCm3ZEXasCq-sJn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7KY7ZEXa
	sCq-sGcSsGvfJ3Ic02F40EFcxC0VAKzVAqx4xG6I80ebIjqfuFe4nvWSU5nxnvy29KBjDU
	0xBIdaVrnRJUUUv0b4IE77IF4wAFF20E14v26r1j6r4UM7CY07I20VC2zVCF04k26cxKx2
	IYs7xG6rWj6s0DM7CIcVAFz4kK6r106r15M28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48v
	e4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Gr0_Xr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI
	0_Gr0_Cr1l84ACjcxK6I8E87Iv67AKxVW8JVWxJwA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_
	Gr0_Gr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx1l5I
	8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r1j6r18McIj6I8E87Iv67AK
	xVWUJVW8JwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IY64vIr41lc7I2V7IY0VAS07AlzV
	AYIcxG8wCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E
	14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_JF0_Jw1lIx
	kGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAF
	wI0_Jr0_Gr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r
	4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU8czVUUU
	UUU==
X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


.....
>>> Refer to the latest test results in the chart: The assembly
>>> implementation vs. generic strlen implementation(compiled by using
>>> CFLAGS-strlen.c += -funroll-all-loops --param
>>> max-variable-expandsions-in-unroller=2) the performance
>>> improvement of the assembly implementation is evident(30% ~ 40%),
>>> especially in cases when the length is greater than 64 bytes.
>>> Please kindly see the results via:
>>> https://github.com/jiadengx/glibc_test/blob/main/strlen2/bench1/generic_strlen_with_loop_unrolling.png
>> So maybe use the generic implementation plus the compiler flags to loop
>> unrolling instead of asm optimization?
> This is strange... I remember I'd attempted to add #pragma GCC unroll
> for the main loop of strlen and I observed no performance gain on my
> Loongson-3A5000-HV, at all.  Maybe a different test environment
> (hardware, compiler version, or something)?

The name of his graph is ambiguous. What he means is that our assembly 
implementation performs better

than the generic code implementation (plus the compiler flags for loop 
unrolling),

and our assembly implementation improves performance by 30% to 40%,

especially in cases where the length is greater than 64 bytes.

https://github.com/jiadengx/glibc_test/blob/main/strlen2/bench1/generic_strlen_with_loop_unrolling.png

>
>>>>>>> This implementation fails to assembler with binutils 2.40.0.20230525:
>>>>>>> ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S: Assembler messages:
>>>>>>> ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:30: Error: no match insn: vld  $vr0,$r4,0
>>>>>>> ../sysdeps/loongarch/lp64/multiarch/strlen-lsx.S:31: Error: no match insn: vld  $vr1,$r4,16
>>>>>>>
>>>>> Sorry, it's my mistake for the wrong version of binutils. Could you please try the latest release
>>>>> version 2.41?
>>>> Although it should work, it is unexpected that depending of the assembler used
>>>> some optimized routines are not enabled.
>>> In patch v2, an new configuration variable has been added to control
>>> whether the LASX/LSX will be compiled according to assembler support
>>> LASX/LSX or not, so it can be compiled with old versions of binutils.
>> Yes I am aware and this seems odd, albeit not really wrong.  It means that
>> you will get less code coverage and optimizations depending of the used
>> binutils.
>>
>> I would advise to follow what other architecture did to provide arch-specific
>> optimization, which is either setup a minimum gcc/binutils version (for
>> instance aarch64 libmvec), or encode the instructions in a binutils neutral
>> mode (as the powerpc implementation I pointed out).
> Hmm, this policy seems different from $OTHER_PROJECTS.

I prefer the first plan: setting a minimum version limit for gcc/binutils.


>