public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
* __lower_bound improvement for arithmetical types
@ 2023-03-09 19:57 Александр Шитов
  2023-03-10 10:26 ` Jonathan Wakely
  0 siblings, 1 reply; 5+ messages in thread
From: Александр Шитов @ 2023-03-09 19:57 UTC (permalink / raw)
  To: libstdc++

[-- Attachment #1: Type: text/plain, Size: 633 bytes --]

I want to propose an improvement to std::__lower_bound for arithmetic types
with the standard comparators.


The main idea is to use linear search on a small number of elements to aid
the branch predictor and CPU caches, but only when it is not observable by
the user. In other words, if a standard comparator (std::less,
std::greater) is used for arithmetic types.


In benchmarks I achieved twice the increase in speed for small vectors(16
elements) and increase for 10-20% in large vectors(1'000, 100'000 elements).


Code: https://gist.github.com/ATGsan/8a1fdec92371d5778a65b01321c43604

PR: https://github.com/ATGsan/gcc/pull/1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __lower_bound improvement for arithmetical types
  2023-03-09 19:57 __lower_bound improvement for arithmetical types Александр Шитов
@ 2023-03-10 10:26 ` Jonathan Wakely
       [not found]   ` <CAP=JgFpBesxfVvcnWqE2ZmGt0s-H+4L6F1bBP2NbJiAnziCE7g@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2023-03-10 10:26 UTC (permalink / raw)
  To: Александр
	Шитов
  Cc: libstdc++

On Thu, 9 Mar 2023 at 19:58, Александр Шитов via Libstdc++
<libstdc++@gcc.gnu.org> wrote:
>
> I want to propose an improvement to std::__lower_bound for arithmetic types
> with the standard comparators.
>
>
> The main idea is to use linear search on a small number of elements to aid
> the branch predictor and CPU caches, but only when it is not observable by
> the user. In other words, if a standard comparator (std::less,
> std::greater) is used for arithmetic types.
>
>
> In benchmarks I achieved twice the increase in speed for small vectors(16
> elements) and increase for 10-20% in large vectors(1'000, 100'000 elements).
>
>
> Code: https://gist.github.com/ATGsan/8a1fdec92371d5778a65b01321c43604
>
> PR: https://github.com/ATGsan/gcc/pull/1

This is an interesting idea, thanks.

You limit the linear search to a single cacheline, but you don't
ensure that the range to search doesn't cross two cachelines, right?
Maybe it doesn't matter in practice, but I wonder if limiting the
linear search to a single cacheline would be even better.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __lower_bound improvement for arithmetical types
       [not found]   ` <CAP=JgFpBesxfVvcnWqE2ZmGt0s-H+4L6F1bBP2NbJiAnziCE7g@mail.gmail.com>
@ 2023-03-29  6:26     ` Александр Шитов
  2023-03-29  8:00       ` Jonathan Wakely
  0 siblings, 1 reply; 5+ messages in thread
From: Александр Шитов @ 2023-03-29  6:26 UTC (permalink / raw)
  To: Jonathan Wakely, libstdc++

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

Benchmark shows that adding an extra check to determine whether two
iterators are in one cache line requires noticeable CPU time. These checks
outweigh the benefits of searching through one cache line.


Given these facts, I 'd rather stick to the previously proposed version. Do
I have to do change the code somehow so that it could be merged to
libstdc++ ?

сб, 25 мар. 2023 г. в 14:16, Александр Шитов <alex.shitov1237@gmail.com>:

> Benchmark shows that adding an extra check to determine whether two
> iterators are in one cache line requires noticeable CPU time. These checks
> outweigh the benefits of searching through one cache line.
>
>
> Given these facts, I 'd rather stick to the previously proposed version.
> Do I have to do change the code somehow so that it could be merged to
> libstdc++ ?
>
> Пт, 10 марта 2023 г. в 14:27, Jonathan Wakely <jwakely.gcc@gmail.com>:
>
>> On Thu, 9 Mar 2023 at 19:58, Александр Шитов via Libstdc++
>> <libstdc++@gcc.gnu.org> wrote:
>> >
>> > I want to propose an improvement to std::__lower_bound for arithmetic
>> types
>> > with the standard comparators.
>> >
>> >
>> > The main idea is to use linear search on a small number of elements to
>> aid
>> > the branch predictor and CPU caches, but only when it is not observable
>> by
>> > the user. In other words, if a standard comparator (std::less,
>> > std::greater) is used for arithmetic types.
>> >
>> >
>> > In benchmarks I achieved twice the increase in speed for small
>> vectors(16
>> > elements) and increase for 10-20% in large vectors(1'000, 100'000
>> elements).
>> >
>> >
>> > Code: https://gist.github.com/ATGsan/8a1fdec92371d5778a65b01321c43604
>> >
>> > PR: https://github.com/ATGsan/gcc/pull/1
>>
>> This is an interesting idea, thanks.
>>
>> You limit the linear search to a single cacheline, but you don't
>> ensure that the range to search doesn't cross two cachelines, right?
>> Maybe it doesn't matter in practice, but I wonder if limiting the
>> linear search to a single cacheline would be even better.
>>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __lower_bound improvement for arithmetical types
  2023-03-29  6:26     ` Александр Шитов
@ 2023-03-29  8:00       ` Jonathan Wakely
  2023-05-13 12:02         ` Александр Шитов
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2023-03-29  8:00 UTC (permalink / raw)
  To: Александр
	Шитов
  Cc: libstdc++

[-- Attachment #1: Type: text/plain, Size: 2519 bytes --]

On Wed, 29 Mar 2023, 07:26 Александр Шитов, <alex.shitov1237@gmail.com>
wrote:

> Benchmark shows that adding an extra check to determine whether two
> iterators are in one cache line requires noticeable CPU time. These checks
> outweigh the benefits of searching through one cache line.
>

OK, thanks for checking.



> Given these facts, I 'd rather stick to the previously proposed version.
> Do I have to do change the code somehow so that it could be merged to
> libstdc++ ?
>

It will have to wait until after the GCC 13 release, so I'll review it
properly in a few weeks.

It looks like we can probably get it merged though, thanks for the
contribution.




> сб, 25 мар. 2023 г. в 14:16, Александр Шитов <alex.shitov1237@gmail.com>:
>
>> Benchmark shows that adding an extra check to determine whether two
>> iterators are in one cache line requires noticeable CPU time. These checks
>> outweigh the benefits of searching through one cache line.
>>
>>
>> Given these facts, I 'd rather stick to the previously proposed version.
>> Do I have to do change the code somehow so that it could be merged to
>> libstdc++ ?
>>
>> Пт, 10 марта 2023 г. в 14:27, Jonathan Wakely <jwakely.gcc@gmail.com>:
>>
>>> On Thu, 9 Mar 2023 at 19:58, Александр Шитов via Libstdc++
>>> <libstdc++@gcc.gnu.org> wrote:
>>> >
>>> > I want to propose an improvement to std::__lower_bound for arithmetic
>>> types
>>> > with the standard comparators.
>>> >
>>> >
>>> > The main idea is to use linear search on a small number of elements to
>>> aid
>>> > the branch predictor and CPU caches, but only when it is not
>>> observable by
>>> > the user. In other words, if a standard comparator (std::less,
>>> > std::greater) is used for arithmetic types.
>>> >
>>> >
>>> > In benchmarks I achieved twice the increase in speed for small
>>> vectors(16
>>> > elements) and increase for 10-20% in large vectors(1'000, 100'000
>>> elements).
>>> >
>>> >
>>> > Code: https://gist.github.com/ATGsan/8a1fdec92371d5778a65b01321c43604
>>> >
>>> > PR: https://github.com/ATGsan/gcc/pull/1
>>>
>>> This is an interesting idea, thanks.
>>>
>>> You limit the linear search to a single cacheline, but you don't
>>> ensure that the range to search doesn't cross two cachelines, right?
>>> Maybe it doesn't matter in practice, but I wonder if limiting the
>>> linear search to a single cacheline would be even better.
>>>
>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __lower_bound improvement for arithmetical types
  2023-03-29  8:00       ` Jonathan Wakely
@ 2023-05-13 12:02         ` Александр Шитов
  0 siblings, 0 replies; 5+ messages in thread
From: Александр Шитов @ 2023-05-13 12:02 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++

[-- Attachment #1: Type: text/plain, Size: 2736 bytes --]

Hello!

I am reminding about my patch.



Ср, 29 марта 2023 г. в 12:00, Jonathan Wakely <jwakely.gcc@gmail.com>:

>
>
> On Wed, 29 Mar 2023, 07:26 Александр Шитов, <alex.shitov1237@gmail.com>
> wrote:
>
>> Benchmark shows that adding an extra check to determine whether two
>> iterators are in one cache line requires noticeable CPU time. These checks
>> outweigh the benefits of searching through one cache line.
>>
>
> OK, thanks for checking.
>
>
>
>> Given these facts, I 'd rather stick to the previously proposed version.
>> Do I have to do change the code somehow so that it could be merged to
>> libstdc++ ?
>>
>
> It will have to wait until after the GCC 13 release, so I'll review it
> properly in a few weeks.
>
> It looks like we can probably get it merged though, thanks for the
> contribution.
>
>
>
>
>> сб, 25 мар. 2023 г. в 14:16, Александр Шитов <alex.shitov1237@gmail.com>:
>>
>>> Benchmark shows that adding an extra check to determine whether two
>>> iterators are in one cache line requires noticeable CPU time. These checks
>>> outweigh the benefits of searching through one cache line.
>>>
>>>
>>> Given these facts, I 'd rather stick to the previously proposed version.
>>> Do I have to do change the code somehow so that it could be merged to
>>> libstdc++ ?
>>>
>>> Пт, 10 марта 2023 г. в 14:27, Jonathan Wakely <jwakely.gcc@gmail.com>:
>>>
>>>> On Thu, 9 Mar 2023 at 19:58, Александр Шитов via Libstdc++
>>>> <libstdc++@gcc.gnu.org> wrote:
>>>> >
>>>> > I want to propose an improvement to std::__lower_bound for arithmetic
>>>> types
>>>> > with the standard comparators.
>>>> >
>>>> >
>>>> > The main idea is to use linear search on a small number of elements
>>>> to aid
>>>> > the branch predictor and CPU caches, but only when it is not
>>>> observable by
>>>> > the user. In other words, if a standard comparator (std::less,
>>>> > std::greater) is used for arithmetic types.
>>>> >
>>>> >
>>>> > In benchmarks I achieved twice the increase in speed for small
>>>> vectors(16
>>>> > elements) and increase for 10-20% in large vectors(1'000, 100'000
>>>> elements).
>>>> >
>>>> >
>>>> > Code: https://gist.github.com/ATGsan/8a1fdec92371d5778a65b01321c43604
>>>> >
>>>> > PR: https://github.com/ATGsan/gcc/pull/1
>>>>
>>>> This is an interesting idea, thanks.
>>>>
>>>> You limit the linear search to a single cacheline, but you don't
>>>> ensure that the range to search doesn't cross two cachelines, right?
>>>> Maybe it doesn't matter in practice, but I wonder if limiting the
>>>> linear search to a single cacheline would be even better.
>>>>
>>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-13 12:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-09 19:57 __lower_bound improvement for arithmetical types Александр Шитов
2023-03-10 10:26 ` Jonathan Wakely
     [not found]   ` <CAP=JgFpBesxfVvcnWqE2ZmGt0s-H+4L6F1bBP2NbJiAnziCE7g@mail.gmail.com>
2023-03-29  6:26     ` Александр Шитов
2023-03-29  8:00       ` Jonathan Wakely
2023-05-13 12:02         ` Александр Шитов

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).