We have changed strcpy to include both strcpy and stpcpy implementation,
and use USE_AS_STPCPY to distinguish these two functions, stpcpy
function will define related macros and include strcpy source code.

See patch v2:
https://sourceware.org/pipermail/libc-alpha/2023-September/151531.html

On 2023-09-11 17:53, dengjianbo wrote:
> Tested strcpy-lasx comparing with strcpy(call stpcpy-lasx), the
> difference between two timings are 0.28, strcpy-lasx takes less time.
> When the length of data is less than 32, it could reduce the runtime
> more than 30%.
>
> See:
> https://github.com/jiadengx/glibc_test/blob/main/bench/strcpy_lasx_compare_generic_strcpy.out
>
> There are some duplicated code in strcpy from stpcpy, since the main
> part is almost same. Maybe we can try to use one source code with
> MARCO USE_AS_STPCPY to distinguish strcpy and stpcpy like x86_64? it
> could avoid the performance degradation.
>
> On 2023-09-08 22:22, Xi Ruoyao wrote:
>> On Fri, 2023-09-08 at 17:33 +0800, dengjianbo wrote:
>>> According to glibc strcpy microbenchmark test results(changed to use
>>> generic_strcpy instead of strlen + memcpy), comparing with generic_strcpy,
>>> this implementation could reduce the runtime as following:
>>>
>>> Name              Percent of rutime reduced
>>> strcpy-aligned    10%-45%
>>> strcpy-unaligned  10%-49%, comparing with the aligned version,unaligned
>>>                   version experience better performance in case src and dest
>>>                   cannot be both aligned with 8bytes
>>> strcpy-lsx        20%-80%
>>> strcpy-lasx       15%-86%
>> Generic strcpy calls stpcpy, so if we've optimized stpcpy maybe it's not
>> necessary to duplicate everything in strcpy.  Is there a benchmark
>> result comparing the timing with and without this patch, but both with
>> the second patch (optimized stpcpy)?
>>