public inbox for libc-stable@sourceware.org
 help / color / mirror / Atom feed
* Re: [PATCH v2] x86-64: Optimize bzero
       [not found]                                 ` <CAMe9rOot8YEAE1Qvc-LowW-gggfusYzRhcePN4+as1q639dieQ@mail.gmail.com>
@ 2022-05-04  6:35                                   ` Sunil Pandey
  2022-05-04 12:52                                     ` Adhemerval Zanella
  0 siblings, 1 reply; 4+ messages in thread
From: Sunil Pandey @ 2022-05-04  6:35 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Adhemerval Zanella, GNU C Library

On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> >
> >
> > On 14/02/2022 09:41, Noah Goldstein wrote:
> > > On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella
> > > <adhemerval.zanella@linaro.org> wrote:
> > >>
> > >>
> > >>
> > >> On 12/02/2022 20:46, Noah Goldstein wrote:
> > >>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha
> > >>> <libc-alpha@sourceware.org> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote:
> > >>>>> Just as another point of information, Solaris libc implemented
> > >>>>> bzero as moving arguments around appropriately then jumping to
> > >>>>> memset. Noone noticed enough to file a complaint. Of course,
> > >>>>> short fixed-length bzero was handled with in line stores of zero
> > >>>>> by the compiler. For long vector bzeroing, the overhead was
> > >>>>> negligible.
> > >>>>>
> > >>>>> When certain Sparc hardware implementations provided faster methods
> > >>>>> for zeroing a cache line at a time on cache line boundaries,
> > >>>>> memset added a single test for zero ifandonlyif the length of code
> > >>>>> to memset was over a threshold that seemed likely to make it
> > >>>>> worthwhile to use the faster method. The principal advantage
> > >>>>> of the fast zeroing operation is that it did not require data
> > >>>>> to move from memory to cache before writing zeros to memory,
> > >>>>> protecting cache locality in the face of large block zeroing.
> > >>>>> I was responsible for much of that optimization effort.
> > >>>>> Whether that optimization was really worth it is open for debate
> > >>>>> for a variety of reasons that I won't go into just now.
> > >>>>
> > >>>> Afaik this is pretty much what optimized memset implementations
> > >>>> does, if architecture allows it. For instance, aarch64 uses
> > >>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a
> > >>>> similar strategy.
> > >>>>
> > >>>>>
> > >>>>> Apps still used bzero or memset(target,zero,length) according to
> > >>>>> their preferences, but the code was unified under memset.
> > >>>>>
> > >>>>> I am inclined to agree with keeping bzero in the API for
> > >>>>> compatibility with old code/old binaries/old programmers. :-)
> > >>>>
> > >>>> The main driver to remove the bzero internal implementation is just
> > >>>> the *currently* gcc just do not generate bzero calls as default
> > >>>> (I couldn't find a single binary that calls bzero in my system).
> > >>>
> > >>> Does it make sense then to add '__memsetzero' so that we can have
> > >>> a function optimized for setting zero?
> > >>
> > >> Will it be really a huge gain instead of a microoptimization that will
> > >> just a bunch of more ifunc variants along with the maintenance cost
> > >> associated with this?
> > > Is there any way it can be setup so that one C impl can cover all the
> > > arch that want to just leave `__memsetzero` as an alias to `memset`?
> > > I know they have incompatible interfaces that make it hard but would
> > > a weak static inline in string.h work?
> > >
> > > For some of the shorter control flows (which are generally small sizes
> > > and very hot) we saw reasonable benefits on x86_64.
> > >
> > > The most significant was the EVEX/AVX2 [32, 64] case where it net
> > > us ~25% throughput. This is a pretty hot set value so it may be worth it.
> >
> > With different prototypes and semantics we won't be able to define an
> > alias. What we used to do, but we move away in recent version, was to
> > define static inline function that glue the two function if optimization
> > is set.
>
> I have
>
> /* NB: bzero returns void and __memsetzero returns void *.  */
> asm (".weak bzero");
> asm ("bzero = __memsetzero");
> asm (".global __bzero");
> asm ("__bzero = __memsetzero");
>
> > >
> > >>
> > >> My understanding is __memsetzero would maybe yield some gain in the
> > >> store mask generation (some architecture might have a zero register
> > >> or some instruction to generate one), however it would require to
> > >> use the same strategy as memset to use specific architecture instruction
> > >> that optimize cache utilization (dc zva, dcbz).
> > >>
> > >> So it would mostly require a lot of arch-specific code to to share
> > >> the memset code with __memsetzero (to avoid increasing code size),
> > >> so I am not sure if this is really a gain in the long term.
> > >
> > > It's worth noting that between the two `memset` is the cold function
> > > and `__memsetzero` is the hot one. Based on profiles of GCC11 and
> > > Python3.7.7 setting zero covers 99%+ cases.
> >
> > This is very workload specific and I think with more advance compiler
> > optimization like LTO and PGO such calls could most likely being
> > optimized by the compiler itself (either by inline or by create a
> > synthetic function to handle it).
> >
> > What I worried is such symbols might ended up as the AEBI memcpy variants
> > that was added as way to optimize when alignment is know to be multiple
> > of words, but it ended up not being implemented and also not being generated
> > by the compiler (at least not gcc).
>
>
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] x86-64: Optimize bzero
  2022-05-04  6:35                                   ` [PATCH v2] x86-64: Optimize bzero Sunil Pandey
@ 2022-05-04 12:52                                     ` Adhemerval Zanella
  2022-05-04 14:50                                       ` H.J. Lu
  0 siblings, 1 reply; 4+ messages in thread
From: Adhemerval Zanella @ 2022-05-04 12:52 UTC (permalink / raw)
  To: Sunil Pandey, H.J. Lu, Libc-stable Mailing List; +Cc: GNU C Library



On 04/05/2022 03:35, Sunil Pandey wrote:
> On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>>
>>>
>>>
>>> On 14/02/2022 09:41, Noah Goldstein wrote:
>>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 12/02/2022 20:46, Noah Goldstein wrote:
>>>>>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha
>>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote:
>>>>>>>> Just as another point of information, Solaris libc implemented
>>>>>>>> bzero as moving arguments around appropriately then jumping to
>>>>>>>> memset. Noone noticed enough to file a complaint. Of course,
>>>>>>>> short fixed-length bzero was handled with in line stores of zero
>>>>>>>> by the compiler. For long vector bzeroing, the overhead was
>>>>>>>> negligible.
>>>>>>>>
>>>>>>>> When certain Sparc hardware implementations provided faster methods
>>>>>>>> for zeroing a cache line at a time on cache line boundaries,
>>>>>>>> memset added a single test for zero ifandonlyif the length of code
>>>>>>>> to memset was over a threshold that seemed likely to make it
>>>>>>>> worthwhile to use the faster method. The principal advantage
>>>>>>>> of the fast zeroing operation is that it did not require data
>>>>>>>> to move from memory to cache before writing zeros to memory,
>>>>>>>> protecting cache locality in the face of large block zeroing.
>>>>>>>> I was responsible for much of that optimization effort.
>>>>>>>> Whether that optimization was really worth it is open for debate
>>>>>>>> for a variety of reasons that I won't go into just now.
>>>>>>>
>>>>>>> Afaik this is pretty much what optimized memset implementations
>>>>>>> does, if architecture allows it. For instance, aarch64 uses
>>>>>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a
>>>>>>> similar strategy.
>>>>>>>
>>>>>>>>
>>>>>>>> Apps still used bzero or memset(target,zero,length) according to
>>>>>>>> their preferences, but the code was unified under memset.
>>>>>>>>
>>>>>>>> I am inclined to agree with keeping bzero in the API for
>>>>>>>> compatibility with old code/old binaries/old programmers. :-)
>>>>>>>
>>>>>>> The main driver to remove the bzero internal implementation is just
>>>>>>> the *currently* gcc just do not generate bzero calls as default
>>>>>>> (I couldn't find a single binary that calls bzero in my system).
>>>>>>
>>>>>> Does it make sense then to add '__memsetzero' so that we can have
>>>>>> a function optimized for setting zero?
>>>>>
>>>>> Will it be really a huge gain instead of a microoptimization that will
>>>>> just a bunch of more ifunc variants along with the maintenance cost
>>>>> associated with this?
>>>> Is there any way it can be setup so that one C impl can cover all the
>>>> arch that want to just leave `__memsetzero` as an alias to `memset`?
>>>> I know they have incompatible interfaces that make it hard but would
>>>> a weak static inline in string.h work?
>>>>
>>>> For some of the shorter control flows (which are generally small sizes
>>>> and very hot) we saw reasonable benefits on x86_64.
>>>>
>>>> The most significant was the EVEX/AVX2 [32, 64] case where it net
>>>> us ~25% throughput. This is a pretty hot set value so it may be worth it.
>>>
>>> With different prototypes and semantics we won't be able to define an
>>> alias. What we used to do, but we move away in recent version, was to
>>> define static inline function that glue the two function if optimization
>>> is set.
>>
>> I have
>>
>> /* NB: bzero returns void and __memsetzero returns void *.  */
>> asm (".weak bzero");
>> asm ("bzero = __memsetzero");
>> asm (".global __bzero");
>> asm ("__bzero = __memsetzero");
>>
>>>>
>>>>>
>>>>> My understanding is __memsetzero would maybe yield some gain in the
>>>>> store mask generation (some architecture might have a zero register
>>>>> or some instruction to generate one), however it would require to
>>>>> use the same strategy as memset to use specific architecture instruction
>>>>> that optimize cache utilization (dc zva, dcbz).
>>>>>
>>>>> So it would mostly require a lot of arch-specific code to to share
>>>>> the memset code with __memsetzero (to avoid increasing code size),
>>>>> so I am not sure if this is really a gain in the long term.
>>>>
>>>> It's worth noting that between the two `memset` is the cold function
>>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and
>>>> Python3.7.7 setting zero covers 99%+ cases.
>>>
>>> This is very workload specific and I think with more advance compiler
>>> optimization like LTO and PGO such calls could most likely being
>>> optimized by the compiler itself (either by inline or by create a
>>> synthetic function to handle it).
>>>
>>> What I worried is such symbols might ended up as the AEBI memcpy variants
>>> that was added as way to optimize when alignment is know to be multiple
>>> of words, but it ended up not being implemented and also not being generated
>>> by the compiler (at least not gcc).
>>
>>
>>
>> --
>> H.J.
> 
> I would like to backport this patch to release branches.
> Any comments or objections?

Nothing really against, but as previous discussion we had on this maillist optimizing
bzero does not yield much gain compared to memset (compiler won't generate libcall
for loop transformation, among other shortcomings). My idea is to follow other
architecture and just remove all x86_64 optimizations.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] x86-64: Optimize bzero
  2022-05-04 12:52                                     ` Adhemerval Zanella
@ 2022-05-04 14:50                                       ` H.J. Lu
  2022-05-04 14:54                                         ` Adhemerval Zanella
  0 siblings, 1 reply; 4+ messages in thread
From: H.J. Lu @ 2022-05-04 14:50 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Sunil Pandey, Libc-stable Mailing List, GNU C Library

On Wed, May 4, 2022 at 5:52 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 04/05/2022 03:35, Sunil Pandey wrote:
> > On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha
> >> <libc-alpha@sourceware.org> wrote:
> >>>
> >>>
> >>>
> >>> On 14/02/2022 09:41, Noah Goldstein wrote:
> >>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/02/2022 20:46, Noah Goldstein wrote:
> >>>>>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha
> >>>>>> <libc-alpha@sourceware.org> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote:
> >>>>>>>> Just as another point of information, Solaris libc implemented
> >>>>>>>> bzero as moving arguments around appropriately then jumping to
> >>>>>>>> memset. Noone noticed enough to file a complaint. Of course,
> >>>>>>>> short fixed-length bzero was handled with in line stores of zero
> >>>>>>>> by the compiler. For long vector bzeroing, the overhead was
> >>>>>>>> negligible.
> >>>>>>>>
> >>>>>>>> When certain Sparc hardware implementations provided faster methods
> >>>>>>>> for zeroing a cache line at a time on cache line boundaries,
> >>>>>>>> memset added a single test for zero ifandonlyif the length of code
> >>>>>>>> to memset was over a threshold that seemed likely to make it
> >>>>>>>> worthwhile to use the faster method. The principal advantage
> >>>>>>>> of the fast zeroing operation is that it did not require data
> >>>>>>>> to move from memory to cache before writing zeros to memory,
> >>>>>>>> protecting cache locality in the face of large block zeroing.
> >>>>>>>> I was responsible for much of that optimization effort.
> >>>>>>>> Whether that optimization was really worth it is open for debate
> >>>>>>>> for a variety of reasons that I won't go into just now.
> >>>>>>>
> >>>>>>> Afaik this is pretty much what optimized memset implementations
> >>>>>>> does, if architecture allows it. For instance, aarch64 uses
> >>>>>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a
> >>>>>>> similar strategy.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Apps still used bzero or memset(target,zero,length) according to
> >>>>>>>> their preferences, but the code was unified under memset.
> >>>>>>>>
> >>>>>>>> I am inclined to agree with keeping bzero in the API for
> >>>>>>>> compatibility with old code/old binaries/old programmers. :-)
> >>>>>>>
> >>>>>>> The main driver to remove the bzero internal implementation is just
> >>>>>>> the *currently* gcc just do not generate bzero calls as default
> >>>>>>> (I couldn't find a single binary that calls bzero in my system).
> >>>>>>
> >>>>>> Does it make sense then to add '__memsetzero' so that we can have
> >>>>>> a function optimized for setting zero?
> >>>>>
> >>>>> Will it be really a huge gain instead of a microoptimization that will
> >>>>> just a bunch of more ifunc variants along with the maintenance cost
> >>>>> associated with this?
> >>>> Is there any way it can be setup so that one C impl can cover all the
> >>>> arch that want to just leave `__memsetzero` as an alias to `memset`?
> >>>> I know they have incompatible interfaces that make it hard but would
> >>>> a weak static inline in string.h work?
> >>>>
> >>>> For some of the shorter control flows (which are generally small sizes
> >>>> and very hot) we saw reasonable benefits on x86_64.
> >>>>
> >>>> The most significant was the EVEX/AVX2 [32, 64] case where it net
> >>>> us ~25% throughput. This is a pretty hot set value so it may be worth it.
> >>>
> >>> With different prototypes and semantics we won't be able to define an
> >>> alias. What we used to do, but we move away in recent version, was to
> >>> define static inline function that glue the two function if optimization
> >>> is set.
> >>
> >> I have
> >>
> >> /* NB: bzero returns void and __memsetzero returns void *.  */
> >> asm (".weak bzero");
> >> asm ("bzero = __memsetzero");
> >> asm (".global __bzero");
> >> asm ("__bzero = __memsetzero");
> >>
> >>>>
> >>>>>
> >>>>> My understanding is __memsetzero would maybe yield some gain in the
> >>>>> store mask generation (some architecture might have a zero register
> >>>>> or some instruction to generate one), however it would require to
> >>>>> use the same strategy as memset to use specific architecture instruction
> >>>>> that optimize cache utilization (dc zva, dcbz).
> >>>>>
> >>>>> So it would mostly require a lot of arch-specific code to to share
> >>>>> the memset code with __memsetzero (to avoid increasing code size),
> >>>>> so I am not sure if this is really a gain in the long term.
> >>>>
> >>>> It's worth noting that between the two `memset` is the cold function
> >>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and
> >>>> Python3.7.7 setting zero covers 99%+ cases.
> >>>
> >>> This is very workload specific and I think with more advance compiler
> >>> optimization like LTO and PGO such calls could most likely being
> >>> optimized by the compiler itself (either by inline or by create a
> >>> synthetic function to handle it).
> >>>
> >>> What I worried is such symbols might ended up as the AEBI memcpy variants
> >>> that was added as way to optimize when alignment is know to be multiple
> >>> of words, but it ended up not being implemented and also not being generated
> >>> by the compiler (at least not gcc).
> >>
> >>
> >>
> >> --
> >> H.J.
> >
> > I would like to backport this patch to release branches.
> > Any comments or objections?
>
> Nothing really against, but as previous discussion we had on this maillist optimizing
> bzero does not yield much gain compared to memset (compiler won't generate libcall
> for loop transformation, among other shortcomings). My idea is to follow other
> architecture and just remove all x86_64 optimizations.

We'd like to reduce the differences between master and release branches to help
future backports to release branches.

-- 
H.J.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] x86-64: Optimize bzero
  2022-05-04 14:50                                       ` H.J. Lu
@ 2022-05-04 14:54                                         ` Adhemerval Zanella
  0 siblings, 0 replies; 4+ messages in thread
From: Adhemerval Zanella @ 2022-05-04 14:54 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Sunil Pandey, Libc-stable Mailing List, GNU C Library



On 04/05/2022 11:50, H.J. Lu wrote:
> On Wed, May 4, 2022 at 5:52 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 04/05/2022 03:35, Sunil Pandey wrote:
>>> On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha
>>>> <libc-alpha@sourceware.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 14/02/2022 09:41, Noah Goldstein wrote:
>>>>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella
>>>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 12/02/2022 20:46, Noah Goldstein wrote:
>>>>>>>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha
>>>>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote:
>>>>>>>>>> Just as another point of information, Solaris libc implemented
>>>>>>>>>> bzero as moving arguments around appropriately then jumping to
>>>>>>>>>> memset. Noone noticed enough to file a complaint. Of course,
>>>>>>>>>> short fixed-length bzero was handled with in line stores of zero
>>>>>>>>>> by the compiler. For long vector bzeroing, the overhead was
>>>>>>>>>> negligible.
>>>>>>>>>>
>>>>>>>>>> When certain Sparc hardware implementations provided faster methods
>>>>>>>>>> for zeroing a cache line at a time on cache line boundaries,
>>>>>>>>>> memset added a single test for zero ifandonlyif the length of code
>>>>>>>>>> to memset was over a threshold that seemed likely to make it
>>>>>>>>>> worthwhile to use the faster method. The principal advantage
>>>>>>>>>> of the fast zeroing operation is that it did not require data
>>>>>>>>>> to move from memory to cache before writing zeros to memory,
>>>>>>>>>> protecting cache locality in the face of large block zeroing.
>>>>>>>>>> I was responsible for much of that optimization effort.
>>>>>>>>>> Whether that optimization was really worth it is open for debate
>>>>>>>>>> for a variety of reasons that I won't go into just now.
>>>>>>>>>
>>>>>>>>> Afaik this is pretty much what optimized memset implementations
>>>>>>>>> does, if architecture allows it. For instance, aarch64 uses
>>>>>>>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a
>>>>>>>>> similar strategy.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Apps still used bzero or memset(target,zero,length) according to
>>>>>>>>>> their preferences, but the code was unified under memset.
>>>>>>>>>>
>>>>>>>>>> I am inclined to agree with keeping bzero in the API for
>>>>>>>>>> compatibility with old code/old binaries/old programmers. :-)
>>>>>>>>>
>>>>>>>>> The main driver to remove the bzero internal implementation is just
>>>>>>>>> the *currently* gcc just do not generate bzero calls as default
>>>>>>>>> (I couldn't find a single binary that calls bzero in my system).
>>>>>>>>
>>>>>>>> Does it make sense then to add '__memsetzero' so that we can have
>>>>>>>> a function optimized for setting zero?
>>>>>>>
>>>>>>> Will it be really a huge gain instead of a microoptimization that will
>>>>>>> just a bunch of more ifunc variants along with the maintenance cost
>>>>>>> associated with this?
>>>>>> Is there any way it can be setup so that one C impl can cover all the
>>>>>> arch that want to just leave `__memsetzero` as an alias to `memset`?
>>>>>> I know they have incompatible interfaces that make it hard but would
>>>>>> a weak static inline in string.h work?
>>>>>>
>>>>>> For some of the shorter control flows (which are generally small sizes
>>>>>> and very hot) we saw reasonable benefits on x86_64.
>>>>>>
>>>>>> The most significant was the EVEX/AVX2 [32, 64] case where it net
>>>>>> us ~25% throughput. This is a pretty hot set value so it may be worth it.
>>>>>
>>>>> With different prototypes and semantics we won't be able to define an
>>>>> alias. What we used to do, but we move away in recent version, was to
>>>>> define static inline function that glue the two function if optimization
>>>>> is set.
>>>>
>>>> I have
>>>>
>>>> /* NB: bzero returns void and __memsetzero returns void *.  */
>>>> asm (".weak bzero");
>>>> asm ("bzero = __memsetzero");
>>>> asm (".global __bzero");
>>>> asm ("__bzero = __memsetzero");
>>>>
>>>>>>
>>>>>>>
>>>>>>> My understanding is __memsetzero would maybe yield some gain in the
>>>>>>> store mask generation (some architecture might have a zero register
>>>>>>> or some instruction to generate one), however it would require to
>>>>>>> use the same strategy as memset to use specific architecture instruction
>>>>>>> that optimize cache utilization (dc zva, dcbz).
>>>>>>>
>>>>>>> So it would mostly require a lot of arch-specific code to to share
>>>>>>> the memset code with __memsetzero (to avoid increasing code size),
>>>>>>> so I am not sure if this is really a gain in the long term.
>>>>>>
>>>>>> It's worth noting that between the two `memset` is the cold function
>>>>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and
>>>>>> Python3.7.7 setting zero covers 99%+ cases.
>>>>>
>>>>> This is very workload specific and I think with more advance compiler
>>>>> optimization like LTO and PGO such calls could most likely being
>>>>> optimized by the compiler itself (either by inline or by create a
>>>>> synthetic function to handle it).
>>>>>
>>>>> What I worried is such symbols might ended up as the AEBI memcpy variants
>>>>> that was added as way to optimize when alignment is know to be multiple
>>>>> of words, but it ended up not being implemented and also not being generated
>>>>> by the compiler (at least not gcc).
>>>>
>>>>
>>>>
>>>> --
>>>> H.J.
>>>
>>> I would like to backport this patch to release branches.
>>> Any comments or objections?
>>
>> Nothing really against, but as previous discussion we had on this maillist optimizing
>> bzero does not yield much gain compared to memset (compiler won't generate libcall
>> for loop transformation, among other shortcomings). My idea is to follow other
>> architecture and just remove all x86_64 optimizations.
> 
> We'd like to reduce the differences between master and release branches to help
> future backports to release branches.
> 

Ok, fair enough. 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-05-04 14:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220208224319.40271-1-hjl.tools@gmail.com>
     [not found] ` <CAFUsyfJDpMcKkGaVB45b0D+qD=wTzCQ1owvy3ZBz=4=h7MiJ=w@mail.gmail.com>
     [not found]   ` <adf90ef5-25fb-aefb-d234-a25212173920@linaro.org>
     [not found]     ` <AS8PR08MB6534A0F2FCCDD5487CAE3F45832F9@AS8PR08MB6534.eurprd08.prod.outlook.com>
     [not found]       ` <c54303a0-d492-a1c7-30cb-e31c63271bf8@linaro.org>
     [not found]         ` <AS8PR08MB65344203AFB6B1D4FBE29941832F9@AS8PR08MB6534.eurprd08.prod.outlook.com>
     [not found]           ` <1f75bda3-9e89-6860-a042-ef0406b072c1@linaro.org>
     [not found]             ` <78cdba88-9e00-798a-846b-f0f77559bfd5@gmail.com>
     [not found]               ` <AS8PR08MB65348371E9789DAC4B5D5E20832F9@AS8PR08MB6534.eurprd08.prod.outlook.com>
     [not found]                 ` <a8513ca6-7ed7-281d-9162-a4dd7a63e9f6@gmail.com>
     [not found]                   ` <c4560acb-8c0d-4062-efc5-39fc87dc2229@linaro.org>
     [not found]                     ` <0efdd4fe-4e35-cf1d-5731-13ed1c046cc6@oracle.com>
     [not found]                       ` <1ea64f9f-6ce8-5409-8b56-02f7481526d9@linaro.org>
     [not found]                         ` <CAFUsyfLLM-3x8-Yve5GiHe5hbpgtFCiS_ptZLRyPOdrmLLExmg@mail.gmail.com>
     [not found]                           ` <ab078f53-3014-6287-9cb1-27316b91f4c0@linaro.org>
     [not found]                             ` <CAFUsyfJbQsVbKMg+Qgc4PanuZpkd6yB084KGKiZiy0pGGVNYXw@mail.gmail.com>
     [not found]                               ` <1f5d5e63-f79b-9fc6-0f35-77d4abed7480@linaro.org>
     [not found]                                 ` <CAMe9rOot8YEAE1Qvc-LowW-gggfusYzRhcePN4+as1q639dieQ@mail.gmail.com>
2022-05-04  6:35                                   ` [PATCH v2] x86-64: Optimize bzero Sunil Pandey
2022-05-04 12:52                                     ` Adhemerval Zanella
2022-05-04 14:50                                       ` H.J. Lu
2022-05-04 14:54                                         ` Adhemerval Zanella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).