* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)
[not found] <CALnYPH-iONCpF595=MpWP=w_+4xGqU11=Br4Yh=g=wdf2Ken3A@mail.gmail.com>
@ 2018-01-19 20:05 ` Jason Merrill
2018-01-25 22:37 ` chang jc
0 siblings, 1 reply; 12+ messages in thread
From: Jason Merrill @ 2018-01-19 20:05 UTC (permalink / raw)
To: chang jc; +Cc: gcc-patches List, libstdc++
This is a libstdc++ bug and patch, not the C++ front end. So I'm
adding the libstdc++ list to CC.
On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
> Current std::inplace_merge() suffers from performance issue by inefficient
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h (revision 256871)
> +++ include/bits/stl_algo.h (working copy)
> @@ -2437,7 +2437,7 @@
> _BidirectionalIterator __second_cut = __middle;
> _Distance __len11 = 0;
> _Distance __len22 = 0;
> - if (__len1 > __len2)
> + if (__len1 < __len2)
> {
> __len11 = __len1 / 2;
> std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
> const _DistanceType __len1 = std::distance(__first, __middle);
> const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
> - _TmpBuf __buf(__first, __last);
> -
> + _BidirectionalIterator __start, __end;
> + if (__len1 < __len2) {
> + __start = __first; __end = __middle;
> + } else {
> + __start = __middle; __end = __last;
> + }
> + _TmpBuf __buf(__start, ___end);
> if (__buf.begin() == 0)
> std::__merge_without_buffer
> (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h (revision 256871)
> +++ include/bits/stl_tempbuf.h (working copy)
> @@ -95,7 +95,7 @@
> std::nothrow));
> if (__tmp != 0)
> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> - __len /= 2;
> + __len = (__len + 1) / 2;
> }
> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
> }
>
>
>
>
> Thanks
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)
2018-01-19 20:05 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938) Jason Merrill
@ 2018-01-25 22:37 ` chang jc
2018-01-30 21:42 ` François Dumont
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: chang jc @ 2018-01-25 22:37 UTC (permalink / raw)
To: Jason Merrill; +Cc: gcc-patches List, libstdc++
Hi:
1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);
2. The coding gain has been shown PR c++/83938; I re-post here
21
20
19
18
17
16
0.471136
0.625695
0.767262
0.907461
1.04838
1.19508
0.340845
0.48651
0.639139
0.770133
0.898454
1.04632
it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers
The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.
3. As your suggestion, _TmpBuf __buf should be rewrite.
4. It represents a fact that the intuitive idea to split from larger
part is wrong.
For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.
Current it tries to split from B, right?
Then we have:
A1 | A2 B1 | B2
B1 & B2 has 12 integers each, right?
Current algorithm selects pivot as 13th integer from B, right?
If the corresponding upper bound of A is 6th integer.
Then it split in
A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
After rotate, we have two arrays to merge
[A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12]
Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
Sadly, [A1 = 5 | B1 = 12] CANNOT.
So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.
But wait, if we always split from the smaller one instead of larger one.
After rotate, it promises two split arrays both contain ceiling[small/2].
Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].
It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).
Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use tmp buffer to do buffered merge.
As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),
the way should provide speedup.
Thanks.
On 24/01/2018 18:23, François Dumont wrote:
Hi
It sounds like a very sensitive change to make but nothing worth figures.
Do you have any bench showing the importance of the gain ?
At least the memory usage optimization is obvious.
On 19/01/2018 10:43, chang jc wrote:
Current std::inplace_merge() suffers from performance issue by inefficient
logic under limited memory,
It leads to performance downgrade.
Please help to review it.
Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h (revision 256871)
+++ include/bits/stl_algo.h (working copy)
@@ -2437,7 +2437,7 @@
_BidirectionalIterator __second_cut = __middle;
_Distance __len11 = 0;
_Distance __len22 = 0;
- if (__len1 > __len2)
+ if (__len1 < __len2)
{
__len11 = __len1 / 2;
std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
const _DistanceType __len1 = std::distance(__first, __middle);
const _DistanceType __len2 = std::distance(__middle, __last);
+
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;
- _TmpBuf __buf(__first, __last);
-
+ _BidirectionalIterator __start, __end;
+ if (__len1 < __len2) {
+ __start = __first; __end = __middle;
+ } else {
+ __start = __middle; __end = __last;
+ }
+ _TmpBuf __buf(__start, ___end);
Note another optimization, we could introduce a _Temporary_buffer<> constructor
in order to write:
_TmpBuf __buf(std::min(__len1, __len2), __first);
So that std::distance do not need to be called again.
if (__buf.begin() == 0)
std::__merge_without_buffer
(__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h (revision 256871)
+++ include/bits/stl_tempbuf.h (working copy)
@@ -95,7 +95,7 @@
std::nothrow));
if (__tmp != 0)
return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
- __len /= 2;
+ __len = (__len + 1) / 2;
This part is problematic, if __len is 1 and allocation fails it will loop
forever.
It doesn't seem really necessary for your patch.
2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>:
> This is a libstdc++ bug and patch, not the C++ front end. So I'm
> adding the libstdc++ list to CC.
>
> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
> > Current std::inplace_merge() suffers from performance issue by
> inefficient
> > logic under limited memory,
> >
> > It leads to performance downgrade.
> >
> > Please help to review it.
> >
> > Index: include/bits/stl_algo.h
> > ===================================================================
> > --- include/bits/stl_algo.h (revision 256871)
> > +++ include/bits/stl_algo.h (working copy)
> > @@ -2437,7 +2437,7 @@
> > _BidirectionalIterator __second_cut = __middle;
> > _Distance __len11 = 0;
> > _Distance __len22 = 0;
> > - if (__len1 > __len2)
> > + if (__len1 < __len2)
> > {
> > __len11 = __len1 / 2;
> > std::advance(__first_cut, __len11);
> > @@ -2539,9 +2539,15 @@
> > const _DistanceType __len1 = std::distance(__first, __middle);
> > const _DistanceType __len2 = std::distance(__middle, __last);
> >
> > +
> > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
> > - _TmpBuf __buf(__first, __last);
> > -
> > + _BidirectionalIterator __start, __end;
> > + if (__len1 < __len2) {
> > + __start = __first; __end = __middle;
> > + } else {
> > + __start = __middle; __end = __last;
> > + }
> > + _TmpBuf __buf(__start, ___end);
> > if (__buf.begin() == 0)
> > std::__merge_without_buffer
> > (__first, __middle, __last, __len1, __len2, __comp);
> > Index: include/bits/stl_tempbuf.h
> > ===================================================================
> > --- include/bits/stl_tempbuf.h (revision 256871)
> > +++ include/bits/stl_tempbuf.h (working copy)
> > @@ -95,7 +95,7 @@
> > std::nothrow));
> > if (__tmp != 0)
> > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> > - __len /= 2;
> > + __len = (__len + 1) / 2;
> > }
> > return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
> > }
> >
> >
> >
> >
> > Thanks
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)
2018-01-25 22:37 ` chang jc
@ 2018-01-30 21:42 ` François Dumont
2018-02-02 17:38 ` François Dumont
2018-06-06 16:39 ` François Dumont
2 siblings, 0 replies; 12+ messages in thread
From: François Dumont @ 2018-01-30 21:42 UTC (permalink / raw)
To: libstdc++, gcc-patches
Your change looks good, don't hesitate to re-submit a proper patch with
the point 1 below fix. You should add gcc-patches@gcc.gnu.org when you
do so.
Remaining questions will be:
Copyright assignment, your change seems limited enough to avoid it but
you will need an official maintainer confirmation.
Stage 1: We are in a Fix-Only development phase. Your change brings such
an important improvement that it could be considered as a bug fix but
once again an official maintainer will have to decide. Otherwise you
will have to wait for a couple of month to see you patch committed.
François
On 25/01/2018 23:37, chang jc wrote:
> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown PR c++/83938; I re-post here
>
>
>
>
> 21
> 20
> 19
> 18
> 17
> 16
>
>
> 0.471136
> 0.625695
> 0.767262
> 0.907461
> 1.04838
> 1.19508
>
>
> 0.340845
> 0.48651
> 0.639139
> 0.770133
> 0.898454
> 1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2 B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
> It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
> At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h (revision 256871)
> +++ include/bits/stl_algo.h (working copy)
> @@ -2437,7 +2437,7 @@
> _BidirectionalIterator __second_cut = __middle;
> _Distance __len11 = 0;
> _Distance __len22 = 0;
> - if (__len1 > __len2)
> + if (__len1 < __len2)
> {
> __len11 = __len1 / 2;
> std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
> const _DistanceType __len1 = std::distance(__first, __middle);
> const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> - _TmpBuf __buf(__first, __last);
> -
> + _BidirectionalIterator __start, __end;
> + if (__len1 < __len2) {
> + __start = __first; __end = __middle;
> + } else {
> + __start = __middle; __end = __last;
> + }
> + _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
> if (__buf.begin() == 0)
> std::__merge_without_buffer
> (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h (revision 256871)
> +++ include/bits/stl_tempbuf.h (working copy)
> @@ -95,7 +95,7 @@
> std::nothrow));
> if (__tmp != 0)
> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> - __len /= 2;
> + __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>:
>
>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h (revision 256871)
>>> +++ include/bits/stl_algo.h (working copy)
>>> @@ -2437,7 +2437,7 @@
>>> _BidirectionalIterator __second_cut = __middle;
>>> _Distance __len11 = 0;
>>> _Distance __len22 = 0;
>>> - if (__len1 > __len2)
>>> + if (__len1 < __len2)
>>> {
>>> __len11 = __len1 / 2;
>>> std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>> const _DistanceType __len1 = std::distance(__first, __middle);
>>> const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> - _TmpBuf __buf(__first, __last);
>>> -
>>> + _BidirectionalIterator __start, __end;
>>> + if (__len1 < __len2) {
>>> + __start = __first; __end = __middle;
>>> + } else {
>>> + __start = __middle; __end = __last;
>>> + }
>>> + _TmpBuf __buf(__start, ___end);
>>> if (__buf.begin() == 0)
>>> std::__merge_without_buffer
>>> (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>> +++ include/bits/stl_tempbuf.h (working copy)
>>> @@ -95,7 +95,7 @@
>>> std::nothrow));
>>> if (__tmp != 0)
>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> - __len /= 2;
>>> + __len = (__len + 1) / 2;
>>> }
>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>> }
>>>
>>>
>>>
>>>
>>> Thanks
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)
2018-01-25 22:37 ` chang jc
2018-01-30 21:42 ` François Dumont
@ 2018-02-02 17:38 ` François Dumont
2018-06-06 16:39 ` François Dumont
2 siblings, 0 replies; 12+ messages in thread
From: François Dumont @ 2018-02-02 17:38 UTC (permalink / raw)
To: chang jc, Jason Merrill; +Cc: libstdc++
Hi
   If I read you well the main advantage of your approach is that it
limits the number of calls to __rotate_adaptive. So did you try to use
the __buffer_size as the __len11 size when buffer is too small rather
than __len1 / 2 ? That is to say:
    if (__len1 < __len2)
      {
-Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
+Â Â Â Â Â Â Â Â __len11 = __buffer_size;
   I wonder what your bench is showing with this approach.
    Maybe you can post your bench on bugzilla too, we could use it to
create a performance test.
   I also wonder why when heap allocation fails we don't use the stack
? Is that forbidden in our algo implementations ?
François
On 25/01/2018 23:37, chang jc wrote:
> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown PR c++/83938; I re-post here
>
>
>
>
> 21
> 20
> 19
> 18
> 17
> 16
>
>
> 0.471136
> 0.625695
> 0.767262
> 0.907461
> 1.04838
> 1.19508
>
>
> 0.340845
> 0.48651
> 0.639139
> 0.770133
> 0.898454
> 1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2 B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
> It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
> At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h (revision 256871)
> +++ include/bits/stl_algo.h (working copy)
> @@ -2437,7 +2437,7 @@
> _BidirectionalIterator __second_cut = __middle;
> _Distance __len11 = 0;
> _Distance __len22 = 0;
> - if (__len1 > __len2)
> + if (__len1 < __len2)
> {
> __len11 = __len1 / 2;
> std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
> const _DistanceType __len1 = std::distance(__first, __middle);
> const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> - _TmpBuf __buf(__first, __last);
> -
> + _BidirectionalIterator __start, __end;
> + if (__len1 < __len2) {
> + __start = __first; __end = __middle;
> + } else {
> + __start = __middle; __end = __last;
> + }
> + _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
> if (__buf.begin() == 0)
> std::__merge_without_buffer
> (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h (revision 256871)
> +++ include/bits/stl_tempbuf.h (working copy)
> @@ -95,7 +95,7 @@
> std::nothrow));
> if (__tmp != 0)
> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> - __len /= 2;
> + __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>:
>
>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h (revision 256871)
>>> +++ include/bits/stl_algo.h (working copy)
>>> @@ -2437,7 +2437,7 @@
>>> _BidirectionalIterator __second_cut = __middle;
>>> _Distance __len11 = 0;
>>> _Distance __len22 = 0;
>>> - if (__len1 > __len2)
>>> + if (__len1 < __len2)
>>> {
>>> __len11 = __len1 / 2;
>>> std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>> const _DistanceType __len1 = std::distance(__first, __middle);
>>> const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> - _TmpBuf __buf(__first, __last);
>>> -
>>> + _BidirectionalIterator __start, __end;
>>> + if (__len1 < __len2) {
>>> + __start = __first; __end = __middle;
>>> + } else {
>>> + __start = __middle; __end = __last;
>>> + }
>>> + _TmpBuf __buf(__start, ___end);
>>> if (__buf.begin() == 0)
>>> std::__merge_without_buffer
>>> (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>> +++ include/bits/stl_tempbuf.h (working copy)
>>> @@ -95,7 +95,7 @@
>>> std::nothrow));
>>> if (__tmp != 0)
>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> - __len /= 2;
>>> + __len = (__len + 1) / 2;
>>> }
>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>> }
>>>
>>>
>>>
>>>
>>> Thanks
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)
2018-01-25 22:37 ` chang jc
2018-01-30 21:42 ` François Dumont
2018-02-02 17:38 ` François Dumont
@ 2018-06-06 16:39 ` François Dumont
2018-07-24 10:22 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938) François Dumont
2 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2018-06-06 16:39 UTC (permalink / raw)
To: libstdc++, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 8429 bytes --]
Hi
   I review and rework this proposal. I noticed that the same idea to
limit buffer size within inplace_merge also apply to stable_sort.
   I also change the decision when buffer is too small to consider the
buffer size rather than going through successive cuts of the original
ranges. This way we won't cut the range more than necessary. Note that
if you prefer this second part could be committed seperatly.
   PR libstdc++/83938
   * include/bits/stl_algo.h:
   (__stable_partition_adaptive): When buffer to small cut range using
   buffer size.
   (__inplace_merge): Take temporary buffer length from smallest range.
   (__merge_adaptive): When buffer too small consider smallest range first
   and cut based on buffer size.
   (__stable_sort_adaptive): When buffer too small cut based on buffer
   size.
   (__stable_sort): Limit temporary buffer length.
   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
   to reduce requested buffer length on allocation failure.
Tested under Linux x86_64.
Ok to commit ?
François
On 25/01/2018 23:37, chang jc wrote:
> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown PR c++/83938; I re-post here
>
>
>
>
> 21
> 20
> 19
> 18
> 17
> 16
>
>
> 0.471136
> 0.625695
> 0.767262
> 0.907461
> 1.04838
> 1.19508
>
>
> 0.340845
> 0.48651
> 0.639139
> 0.770133
> 0.898454
> 1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2 B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
> It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
> At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h (revision 256871)
> +++ include/bits/stl_algo.h (working copy)
> @@ -2437,7 +2437,7 @@
> _BidirectionalIterator __second_cut = __middle;
> _Distance __len11 = 0;
> _Distance __len22 = 0;
> - if (__len1 > __len2)
> + if (__len1 < __len2)
> {
> __len11 = __len1 / 2;
> std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
> const _DistanceType __len1 = std::distance(__first, __middle);
> const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> - _TmpBuf __buf(__first, __last);
> -
> + _BidirectionalIterator __start, __end;
> + if (__len1 < __len2) {
> + __start = __first; __end = __middle;
> + } else {
> + __start = __middle; __end = __last;
> + }
> + _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
> if (__buf.begin() == 0)
> std::__merge_without_buffer
> (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h (revision 256871)
> +++ include/bits/stl_tempbuf.h (working copy)
> @@ -95,7 +95,7 @@
> std::nothrow));
> if (__tmp != 0)
> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> - __len /= 2;
> + __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>
>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com> wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h (revision 256871)
>>> +++ include/bits/stl_algo.h (working copy)
>>> @@ -2437,7 +2437,7 @@
>>> _BidirectionalIterator __second_cut = __middle;
>>> _Distance __len11 = 0;
>>> _Distance __len22 = 0;
>>> - if (__len1 > __len2)
>>> + if (__len1 < __len2)
>>> {
>>> __len11 = __len1 / 2;
>>> std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>> const _DistanceType __len1 = std::distance(__first, __middle);
>>> const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> - _TmpBuf __buf(__first, __last);
>>> -
>>> + _BidirectionalIterator __start, __end;
>>> + if (__len1 < __len2) {
>>> + __start = __first; __end = __middle;
>>> + } else {
>>> + __start = __middle; __end = __last;
>>> + }
>>> + _TmpBuf __buf(__start, ___end);
>>> if (__buf.begin() == 0)
>>> std::__merge_without_buffer
>>> (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>> +++ include/bits/stl_tempbuf.h (working copy)
>>> @@ -95,7 +95,7 @@
>>> std::nothrow));
>>> if (__tmp != 0)
>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> - __len /= 2;
>>> + __len = (__len + 1) / 2;
>>> }
>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>> }
>>>
>>>
>>>
>>>
>>> Thanks
[-- Attachment #2: pr83938.patch --]
[-- Type: text/x-patch, Size: 4684 bytes --]
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index d8488a7..bf5b25f 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -1577,15 +1577,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
_ForwardIterator __middle = __first;
- std::advance(__middle, __len / 2);
+ std::advance(__middle, __buffer_size);
_ForwardIterator __left_split =
- std::__stable_partition_adaptive(__first, __middle, __pred,
- __len / 2, __buffer,
- __buffer_size);
+ std::__stable_partition_adaptive(__first, __middle,
+ __pred, __buffer_size,
+ __buffer, __buffer_size);
// Advance past true-predicate values to satisfy this
// function's preconditions.
- _Distance __right_len = __len - __len / 2;
+ _Distance __right_len = __len - __buffer_size;
_ForwardIterator __right_split =
std::__find_if_not_n(__middle, __right_len, __pred);
@@ -2432,9 +2432,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __second_cut = __middle;
_Distance __len11 = 0;
_Distance __len22 = 0;
- if (__len1 > __len2)
+ if (__len1 < __len2)
{
- __len11 = __len1 / 2;
+ __len11 = __buffer_size;
std::advance(__first_cut, __len11);
__second_cut
= std::__lower_bound(__middle, __last, *__first_cut,
@@ -2443,7 +2443,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
else
{
- __len22 = __len2 / 2;
+ __len22 = __buffer_size;
std::advance(__second_cut, __len22);
__first_cut
= std::__upper_bound(__first, __middle, *__second_cut,
@@ -2453,14 +2453,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __new_middle
= std::__rotate_adaptive(__first_cut, __middle, __second_cut,
- __len1 - __len11, __len22, __buffer,
- __buffer_size);
- std::__merge_adaptive(__first, __first_cut, __new_middle, __len11,
- __len22, __buffer, __buffer_size, __comp);
+ __len1 - __len11, __len22,
+ __buffer, __buffer_size);
+ std::__merge_adaptive(__first, __first_cut, __new_middle,
+ __len11, __len22,
+ __buffer, __buffer_size, __comp);
std::__merge_adaptive(__new_middle, __second_cut, __last,
- __len1 - __len11,
- __len2 - __len22, __buffer,
- __buffer_size, __comp);
+ __len1 - __len11, __len2 - __len22,
+ __buffer, __buffer_size, __comp);
}
}
@@ -2535,7 +2535,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
const _DistanceType __len2 = std::distance(__middle, __last);
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, __len1 + __len2);
+ _TmpBuf __buf(__first, std::min(__len1, __len2));
if (__buf.begin() == 0)
std::__merge_without_buffer
@@ -2730,9 +2730,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_Compare __comp)
{
const _Distance __len = (__last - __first + 1) / 2;
- const _RandomAccessIterator __middle = __first + __len;
+ _RandomAccessIterator __middle;
if (__len > __buffer_size)
{
+ __middle = __first + __buffer_size + __buffer_size;
std::__stable_sort_adaptive(__first, __middle, __buffer,
__buffer_size, __comp);
std::__stable_sort_adaptive(__middle, __last, __buffer,
@@ -2740,9 +2741,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
else
{
+ __middle = __first + __len;
std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
}
+
std::__merge_adaptive(__first, __middle, __last,
_Distance(__middle - __first),
_Distance(__last - __middle),
@@ -4992,8 +4995,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
typedef typename iterator_traits<_RandomAccessIterator>::difference_type
_DistanceType;
+ if (__first == __last)
+ return;
+
typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, std::distance(__first, __last));
+ _TmpBuf __buf(__first, (__last - __first + 1) / 2);
if (__buf.begin() == 0)
std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index 159ee27..7fb97d0 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::nothrow));
if (__tmp != 0)
return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
- __len /= 2;
+ __len = __len == 1 ? 0 : ((__len + 1) / 2);
}
return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2018-06-06 16:39 ` François Dumont
@ 2018-07-24 10:22 ` François Dumont
2018-08-21 20:34 ` François Dumont
0 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2018-07-24 10:22 UTC (permalink / raw)
To: libstdc++, gcc-patches
Hi
   Any chance to review this patch ?
François
On 06/06/2018 18:39, François Dumont wrote:
> Hi
>
> Â Â Â I review and rework this proposal. I noticed that the same idea to
> limit buffer size within inplace_merge also apply to stable_sort.
>
> Â Â Â I also change the decision when buffer is too small to consider
> the buffer size rather than going through successive cuts of the
> original ranges. This way we won't cut the range more than necessary.
> Note that if you prefer this second part could be committed seperatly.
>
> Â Â Â PR libstdc++/83938
> Â Â Â * include/bits/stl_algo.h:
> Â Â Â (__stable_partition_adaptive): When buffer to small cut range using
> Â Â Â buffer size.
> Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
> Â Â Â (__merge_adaptive): When buffer too small consider smallest range
> first
> Â Â Â and cut based on buffer size.
> Â Â Â (__stable_sort_adaptive): When buffer too small cut based on buffer
> Â Â Â size.
> Â Â Â (__stable_sort): Limit temporary buffer length.
> Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
> Â Â Â to reduce requested buffer length on allocation failure.
>
> Tested under Linux x86_64.
>
> Ok to commit ?
>
> François
>
>
> On 25/01/2018 23:37, chang jc wrote:
>> Hi:
>>
>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>
>> 2. The coding gain has been shown PR c++/83938; I re-post here
>>
>>
>>
>>
>> Â Â 21
>> Â Â 20
>> Â Â 19
>> Â Â 18
>> Â Â 17
>> Â Â 16
>>
>>
>> Â Â 0.471136
>> Â Â 0.625695
>> Â Â 0.767262
>> Â Â 0.907461
>> Â Â 1.04838
>> Â Â 1.19508
>>
>>
>> Â Â 0.340845
>> Â Â 0.48651
>> Â Â 0.639139
>> Â Â 0.770133
>> Â Â 0.898454
>> Â Â 1.04632
>>
>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>> 4325376 & B with 12451840 elements, total with 16M integers
>>
>> The proposed method has the speed up under given buffer size =, ex
>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>> given sizof(int)*64K bytes.
>>
>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>
>> 4. It represents a fact that the intuitive idea to split from larger
>> part is wrong.
>>
>> For example, if you have an input sorted array A & B, A has 8 integers
>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>
>> Current it tries to split from B, right?
>>
>> Then we have:
>>
>> A1 | A2Â B1 | B2
>>
>> B1 & B2 has 12 integers each, right?
>>
>> Current algorithm selects pivot as 13th integer from B, right?
>>
>> If the corresponding upper bound of A is 6th integer.
>>
>> Then it split in
>>
>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>
>> After rotate, we have two arrays to merge
>>
>> [A1 = 5 | B1 = 12]Â & [A2 = 3 | B2 = 12]
>>
>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>
>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>
>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>> | B1 = 12] again.
>>
>>
>> But wait, if we always split from the smaller one instead of larger one.
>>
>> After rotate, it promises two split arrays both contain
>> ceiling[small/2].
>>
>> Since tmp buffer also allocate by starting from sizeof(small) &
>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>
>> It means the allocated tmp buffer promises to be sufficient at the
>> level of (# of fail allocate).
>>
>> Instead, you can see if split from large at level (# of fail allocate)
>> several split array still CANNOT use tmp buffer to do buffered merge.
>>
>>
>> As you know, buffered merge is far speed then (split, rotate, and
>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>
>> the way should provide speedup.
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 24/01/2018 18:23, François Dumont wrote:
>>
>> Hi
>>
>>
>> Â Â Â Â It sounds like a very sensitive change to make but nothing worth
>> figures.
>> Do you have any bench showing the importance of the gain ?
>>
>> Â Â Â Â At least the memory usage optimization is obvious.
>>
>> On 19/01/2018 10:43, chang jc wrote:
>>
>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>
>> logic under limited memory,
>>
>> It leads to performance downgrade.
>>
>> Please help to review it.
>>
>> Index: include/bits/stl_algo.h
>> ===================================================================
>> --- include/bits/stl_algo.h   (revision 256871)
>> +++ include/bits/stl_algo.h   (working copy)
>> @@ -2437,7 +2437,7 @@
>> Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>> Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>> Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>> -Â Â Â Â Â if (__len1 > __len2)
>> +Â Â Â Â Â if (__len1 < __len2)
>> Â Â Â Â Â Â Â Â Â Â {
>> Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>> Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>> @@ -2539,9 +2539,15 @@
>> Â Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first, __middle);
>> Â Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>
>> +
>>
>> Â Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>
>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>> -
>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>> +Â Â Â Â Â if (__len1 < __len2) {
>> +Â Â Â __start = __first; __end = __middle;
>> +Â Â Â Â Â } else {
>> +Â Â Â __start = __middle; __end = __last;
>> +Â Â Â Â Â }
>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>
>> Note another optimization, we could introduce a _Temporary_buffer<>
>> constructor
>> in order to write:
>>
>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>
>> So that std::distance do not need to be called again.
>>
>> Â Â Â Â Â Â Â Â if (__buf.begin() == 0)
>> Â Â Â Â Â Â std::__merge_without_buffer
>> Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>> Index: include/bits/stl_tempbuf.h
>> ===================================================================
>> --- include/bits/stl_tempbuf.h   (revision 256871)
>> +++ include/bits/stl_tempbuf.h   (working copy)
>> @@ -95,7 +95,7 @@
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::nothrow));
>> Â Â Â Â Â Â Â Â if (__tmp != 0)
>> Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>> -Â Â Â Â Â __len /= 2;
>> +Â Â Â Â Â __len = (__len + 1) / 2;
>>
>> This part is problematic, if __len is 1 and allocation fails it will
>> loop
>> forever.
>>
>> It doesn't seem really necessary for your patch.
>>
>>
>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>
>>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>>> adding the libstdc++ list to CC.
>>>
>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>Â wrote:
>>>> Current std::inplace_merge() suffers from performance issue by
>>> inefficient
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.h    (revision 256871)
>>>> +++ include/bits/stl_algo.h    (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>> Â Â Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>> -Â Â Â Â Â Â Â Â if (__len1 > __len2)
>>>> +Â Â Â Â Â Â Â Â if (__len1 < __len2)
>>>> Â Â Â Â Â Â Â Â Â Â Â Â {
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>> Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first, __middle);
>>>> Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>> Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>> _TmpBuf;
>>>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>> -
>>>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>> +Â Â Â Â Â if (__len1 < __len2) {
>>>> +Â Â Â Â Â Â __start = __first; __end = __middle;
>>>> +Â Â Â Â Â } else {
>>>> +Â Â Â Â Â Â __start = __middle; __end = __last;
>>>> +Â Â Â Â Â }
>>>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>> Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>> Â Â Â Â Â Â Â Â std::__merge_without_buffer
>>>> Â Â Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>>> +++ include/bits/stl_tempbuf.h (working copy)
>>>> @@ -95,7 +95,7 @@
>>>> std::nothrow));
>>>> Â Â Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>> Â Â Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -Â Â Â Â Â Â Â Â __len /= 2;
>>>> +Â Â Â Â Â Â Â Â __len = (__len + 1) / 2;
>>>> Â Â Â Â Â Â Â Â }
>>>> Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>> Â Â Â Â Â }
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2018-07-24 10:22 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938) François Dumont
@ 2018-08-21 20:34 ` François Dumont
2018-10-29 6:07 ` François Dumont
0 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2018-08-21 20:34 UTC (permalink / raw)
To: libstdc++, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 10875 bytes --]
I missed a test that was failing because of this patch. So I revert a
small part of it and here is the new proposal.
Tested under Linux x86_64, ok to commit ?
François
On 24/07/2018 12:22, François Dumont wrote:
> Hi
>
> Â Â Â Any chance to review this patch ?
>
> François
>
>
> On 06/06/2018 18:39, François Dumont wrote:
>> Hi
>>
>> Â Â Â I review and rework this proposal. I noticed that the same idea
>> to limit buffer size within inplace_merge also apply to stable_sort.
>>
>> Â Â Â I also change the decision when buffer is too small to consider
>> the buffer size rather than going through successive cuts of the
>> original ranges. This way we won't cut the range more than necessary.
>> Note that if you prefer this second part could be committed seperatly.
>>
>> Â Â Â PR libstdc++/83938
>> Â Â Â * include/bits/stl_algo.h:
>> Â Â Â (__stable_partition_adaptive): When buffer to small cut range using
>> Â Â Â buffer size.
>> Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
>> Â Â Â (__merge_adaptive): When buffer too small consider smallest range
>> first
>> Â Â Â and cut based on buffer size.
>> Â Â Â (__stable_sort_adaptive): When buffer too small cut based on buffer
>> Â Â Â size.
>> Â Â Â (__stable_sort): Limit temporary buffer length.
>> Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>> Â Â Â to reduce requested buffer length on allocation failure.
>>
>> Tested under Linux x86_64.
>>
>> Ok to commit ?
>>
>> François
>>
>>
>> On 25/01/2018 23:37, chang jc wrote:
>>> Hi:
>>>
>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>
>>> 2. The coding gain has been shown PR c++/83938; I re-post here
>>>
>>>
>>>
>>>
>>> Â Â 21
>>> Â Â 20
>>> Â Â 19
>>> Â Â 18
>>> Â Â 17
>>> Â Â 16
>>>
>>>
>>> Â Â 0.471136
>>> Â Â 0.625695
>>> Â Â 0.767262
>>> Â Â 0.907461
>>> Â Â 1.04838
>>> Â Â 1.19508
>>>
>>>
>>> Â Â 0.340845
>>> Â Â 0.48651
>>> Â Â 0.639139
>>> Â Â 0.770133
>>> Â Â 0.898454
>>> Â Â 1.04632
>>>
>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>
>>> The proposed method has the speed up under given buffer size =, ex
>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>> given sizof(int)*64K bytes.
>>>
>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>
>>> 4. It represents a fact that the intuitive idea to split from larger
>>> part is wrong.
>>>
>>> For example, if you have an input sorted array A & B, A has 8 integers
>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>
>>> Current it tries to split from B, right?
>>>
>>> Then we have:
>>>
>>> A1 | A2Â B1 | B2
>>>
>>> B1 & B2 has 12 integers each, right?
>>>
>>> Current algorithm selects pivot as 13th integer from B, right?
>>>
>>> If the corresponding upper bound of A is 6th integer.
>>>
>>> Then it split in
>>>
>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>
>>> After rotate, we have two arrays to merge
>>>
>>> [A1 = 5 | B1 = 12]Â & [A2 = 3 | B2 = 12]
>>>
>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>
>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>
>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>> | B1 = 12] again.
>>>
>>>
>>> But wait, if we always split from the smaller one instead of larger
>>> one.
>>>
>>> After rotate, it promises two split arrays both contain
>>> ceiling[small/2].
>>>
>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>
>>> It means the allocated tmp buffer promises to be sufficient at the
>>> level of (# of fail allocate).
>>>
>>> Instead, you can see if split from large at level (# of fail allocate)
>>> several split array still CANNOT use tmp buffer to do buffered merge.
>>>
>>>
>>> As you know, buffered merge is far speed then (split, rotate, and
>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>
>>> the way should provide speedup.
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 24/01/2018 18:23, François Dumont wrote:
>>>
>>> Hi
>>>
>>>
>>> Â Â Â Â It sounds like a very sensitive change to make but nothing
>>> worth figures.
>>> Do you have any bench showing the importance of the gain ?
>>>
>>> Â Â Â Â At least the memory usage optimization is obvious.
>>>
>>> On 19/01/2018 10:43, chang jc wrote:
>>>
>>> Current std::inplace_merge() suffers from performance issue by
>>> inefficient
>>>
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h   (revision 256871)
>>> +++ include/bits/stl_algo.h   (working copy)
>>> @@ -2437,7 +2437,7 @@
>>> Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>> Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>> Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>> -Â Â Â Â Â if (__len1 > __len2)
>>> +Â Â Â Â Â if (__len1 < __len2)
>>> Â Â Â Â Â Â Â Â Â Â {
>>> Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>> Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>> Â Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first, __middle);
>>> Â Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>
>>> Â Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>> _TmpBuf;
>>>
>>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>> -
>>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>>> +Â Â Â Â Â if (__len1 < __len2) {
>>> +Â Â Â __start = __first; __end = __middle;
>>> +Â Â Â Â Â } else {
>>> +Â Â Â __start = __middle; __end = __last;
>>> +Â Â Â Â Â }
>>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>
>>> Note another optimization, we could introduce a _Temporary_buffer<>
>>> constructor
>>> in order to write:
>>>
>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>
>>> So that std::distance do not need to be called again.
>>>
>>> Â Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>> Â Â Â Â Â Â std::__merge_without_buffer
>>> Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h   (revision 256871)
>>> +++ include/bits/stl_tempbuf.h   (working copy)
>>> @@ -95,7 +95,7 @@
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::nothrow));
>>> Â Â Â Â Â Â Â Â if (__tmp != 0)
>>> Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -Â Â Â Â Â __len /= 2;
>>> +Â Â Â Â Â __len = (__len + 1) / 2;
>>>
>>> This part is problematic, if __len is 1 and allocation fails it will
>>> loop
>>> forever.
>>>
>>> It doesn't seem really necessary for your patch.
>>>
>>>
>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>
>>>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>>>> adding the libstdc++ list to CC.
>>>>
>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>Â wrote:
>>>>> Current std::inplace_merge() suffers from performance issue by
>>>> inefficient
>>>>> logic under limited memory,
>>>>>
>>>>> It leads to performance downgrade.
>>>>>
>>>>> Please help to review it.
>>>>>
>>>>> Index: include/bits/stl_algo.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_algo.h    (revision 256871)
>>>>> +++ include/bits/stl_algo.h    (working copy)
>>>>> @@ -2437,7 +2437,7 @@
>>>>> Â Â Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>>> -Â Â Â Â Â Â Â Â if (__len1 > __len2)
>>>>> +Â Â Â Â Â Â Â Â if (__len1 < __len2)
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â {
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>>> @@ -2539,9 +2539,15 @@
>>>>> Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first,
>>>>> __middle);
>>>>> Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>> +
>>>>> Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>>> -
>>>>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>>> +Â Â Â Â Â if (__len1 < __len2) {
>>>>> +Â Â Â Â Â Â __start = __first; __end = __middle;
>>>>> +Â Â Â Â Â } else {
>>>>> +Â Â Â Â Â Â __start = __middle; __end = __last;
>>>>> +Â Â Â Â Â }
>>>>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>>> Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>>> Â Â Â Â Â Â Â Â std::__merge_without_buffer
>>>>> Â Â Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>>> Index: include/bits/stl_tempbuf.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>>>> +++ include/bits/stl_tempbuf.h (working copy)
>>>>> @@ -95,7 +95,7 @@
>>>>> std::nothrow));
>>>>> Â Â Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>> -Â Â Â Â Â Â Â Â __len /= 2;
>>>>> +Â Â Â Â Â Â Â Â __len = (__len + 1) / 2;
>>>>> Â Â Â Â Â Â Â Â }
>>>>> Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>> Â Â Â Â Â }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>
>>
>
[-- Attachment #2: pr83938.patch --]
[-- Type: text/x-patch, Size: 3921 bytes --]
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index 04e152a..eb58efe 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2416,9 +2416,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __second_cut = __middle;
_Distance __len11 = 0;
_Distance __len22 = 0;
- if (__len1 > __len2)
+ if (__len1 < __len2)
{
- __len11 = __len1 / 2;
+ __len11 = __buffer_size;
std::advance(__first_cut, __len11);
__second_cut
= std::__lower_bound(__middle, __last, *__first_cut,
@@ -2427,7 +2427,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
else
{
- __len22 = __len2 / 2;
+ __len22 = __buffer_size;
std::advance(__second_cut, __len22);
__first_cut
= std::__upper_bound(__first, __middle, *__second_cut,
@@ -2437,14 +2437,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __new_middle
= std::__rotate_adaptive(__first_cut, __middle, __second_cut,
- __len1 - __len11, __len22, __buffer,
- __buffer_size);
- std::__merge_adaptive(__first, __first_cut, __new_middle, __len11,
- __len22, __buffer, __buffer_size, __comp);
+ __len1 - __len11, __len22,
+ __buffer, __buffer_size);
+ std::__merge_adaptive(__first, __first_cut, __new_middle,
+ __len11, __len22,
+ __buffer, __buffer_size, __comp);
std::__merge_adaptive(__new_middle, __second_cut, __last,
- __len1 - __len11,
- __len2 - __len22, __buffer,
- __buffer_size, __comp);
+ __len1 - __len11, __len2 - __len22,
+ __buffer, __buffer_size, __comp);
}
}
@@ -2518,7 +2518,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
const _DistanceType __len2 = std::distance(__middle, __last);
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, __len1 + __len2);
+ _TmpBuf __buf(__first, std::min(__len1, __len2));
if (__buf.begin() == 0)
std::__merge_without_buffer
@@ -2713,9 +2713,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_Compare __comp)
{
const _Distance __len = (__last - __first + 1) / 2;
- const _RandomAccessIterator __middle = __first + __len;
+ _RandomAccessIterator __middle;
if (__len > __buffer_size)
{
+ __middle = __first + __buffer_size + __buffer_size;
std::__stable_sort_adaptive(__first, __middle, __buffer,
__buffer_size, __comp);
std::__stable_sort_adaptive(__middle, __last, __buffer,
@@ -2723,9 +2724,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
else
{
+ __middle = __first + __len;
std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
}
+
std::__merge_adaptive(__first, __middle, __last,
_Distance(__middle - __first),
_Distance(__last - __middle),
@@ -4975,8 +4978,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
typedef typename iterator_traits<_RandomAccessIterator>::difference_type
_DistanceType;
+ if (__first == __last)
+ return;
+
typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, std::distance(__first, __last));
+ _TmpBuf __buf(__first, (__last - __first + 1) / 2);
if (__buf.begin() == 0)
std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index 0abd3c12..e177272 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::nothrow));
if (__tmp != 0)
return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
- __len /= 2;
+ __len = __len == 1 ? 0 : ((__len + 1) / 2);
}
return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2018-08-21 20:34 ` François Dumont
@ 2018-10-29 6:07 ` François Dumont
2018-12-21 20:57 ` Jonathan Wakely
0 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2018-10-29 6:07 UTC (permalink / raw)
To: libstdc++, gcc-patches
Hi
   Some feedback regarding this patch ?
Thanks,
François
On 8/21/18 10:34 PM, François Dumont wrote:
> I missed a test that was failing because of this patch. So I revert a
> small part of it and here is the new proposal.
>
> Tested under Linux x86_64, ok to commit ?
>
> François
>
>
> On 24/07/2018 12:22, François Dumont wrote:
>> Hi
>>
>> Â Â Â Any chance to review this patch ?
>>
>> François
>>
>>
>> On 06/06/2018 18:39, François Dumont wrote:
>>> Hi
>>>
>>> Â Â Â I review and rework this proposal. I noticed that the same idea
>>> to limit buffer size within inplace_merge also apply to stable_sort.
>>>
>>> Â Â Â I also change the decision when buffer is too small to consider
>>> the buffer size rather than going through successive cuts of the
>>> original ranges. This way we won't cut the range more than
>>> necessary. Note that if you prefer this second part could be
>>> committed seperatly.
>>>
>>> Â Â Â PR libstdc++/83938
>>> Â Â Â * include/bits/stl_algo.h:
>>> Â Â Â (__stable_partition_adaptive): When buffer to small cut range using
>>> Â Â Â buffer size.
>>> Â Â Â (__inplace_merge): Take temporary buffer length from smallest
>>> range.
>>> Â Â Â (__merge_adaptive): When buffer too small consider smallest
>>> range first
>>> Â Â Â and cut based on buffer size.
>>> Â Â Â (__stable_sort_adaptive): When buffer too small cut based on buffer
>>> Â Â Â size.
>>> Â Â Â (__stable_sort): Limit temporary buffer length.
>>> Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change
>>> function
>>> Â Â Â to reduce requested buffer length on allocation failure.
>>>
>>> Tested under Linux x86_64.
>>>
>>> Ok to commit ?
>>>
>>> François
>>>
>>>
>>> On 25/01/2018 23:37, chang jc wrote:
>>>> Hi:
>>>>
>>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>
>>>> 2. The coding gain has been shown PR c++/83938; I re-post here
>>>>
>>>>
>>>>
>>>>
>>>> Â Â 21
>>>> Â Â 20
>>>> Â Â 19
>>>> Â Â 18
>>>> Â Â 17
>>>> Â Â 16
>>>>
>>>>
>>>> Â Â 0.471136
>>>> Â Â 0.625695
>>>> Â Â 0.767262
>>>> Â Â 0.907461
>>>> Â Â 1.04838
>>>> Â Â 1.19508
>>>>
>>>>
>>>> Â Â 0.340845
>>>> Â Â 0.48651
>>>> Â Â 0.639139
>>>> Â Â 0.770133
>>>> Â Â 0.898454
>>>> Â Â 1.04632
>>>>
>>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>>
>>>> The proposed method has the speed up under given buffer size =, ex
>>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>> given sizof(int)*64K bytes.
>>>>
>>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>
>>>> 4. It represents a fact that the intuitive idea to split from larger
>>>> part is wrong.
>>>>
>>>> For example, if you have an input sorted array A & B, A has 8 integers
>>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>
>>>> Current it tries to split from B, right?
>>>>
>>>> Then we have:
>>>>
>>>> A1 | A2Â B1 | B2
>>>>
>>>> B1 & B2 has 12 integers each, right?
>>>>
>>>> Current algorithm selects pivot as 13th integer from B, right?
>>>>
>>>> If the corresponding upper bound of A is 6th integer.
>>>>
>>>> Then it split in
>>>>
>>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>
>>>> After rotate, we have two arrays to merge
>>>>
>>>> [A1 = 5 | B1 = 12]Â & [A2 = 3 | B2 = 12]
>>>>
>>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>
>>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>
>>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>> | B1 = 12] again.
>>>>
>>>>
>>>> But wait, if we always split from the smaller one instead of larger
>>>> one.
>>>>
>>>> After rotate, it promises two split arrays both contain
>>>> ceiling[small/2].
>>>>
>>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>
>>>> It means the allocated tmp buffer promises to be sufficient at the
>>>> level of (# of fail allocate).
>>>>
>>>> Instead, you can see if split from large at level (# of fail allocate)
>>>> several split array still CANNOT use tmp buffer to do buffered merge.
>>>>
>>>>
>>>> As you know, buffered merge is far speed then (split, rotate, and
>>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>
>>>> the way should provide speedup.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 24/01/2018 18:23, François Dumont wrote:
>>>>
>>>> Hi
>>>>
>>>>
>>>> Â Â Â Â It sounds like a very sensitive change to make but nothing
>>>> worth figures.
>>>> Do you have any bench showing the importance of the gain ?
>>>>
>>>> Â Â Â Â At least the memory usage optimization is obvious.
>>>>
>>>> On 19/01/2018 10:43, chang jc wrote:
>>>>
>>>> Current std::inplace_merge() suffers from performance issue by
>>>> inefficient
>>>>
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.h   (revision 256871)
>>>> +++ include/bits/stl_algo.h   (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>> Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>> Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>> Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>> -Â Â Â Â Â if (__len1 > __len2)
>>>> +Â Â Â Â Â if (__len1 < __len2)
>>>> Â Â Â Â Â Â Â Â Â Â {
>>>> Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>> Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>> Â Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first,
>>>> __middle);
>>>> Â Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>>
>>>> Â Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>
>>>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>> -
>>>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>> +Â Â Â Â Â if (__len1 < __len2) {
>>>> +Â Â Â __start = __first; __end = __middle;
>>>> +Â Â Â Â Â } else {
>>>> +Â Â Â __start = __middle; __end = __last;
>>>> +Â Â Â Â Â }
>>>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>>
>>>> Note another optimization, we could introduce a _Temporary_buffer<>
>>>> constructor
>>>> in order to write:
>>>>
>>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>
>>>> So that std::distance do not need to be called again.
>>>>
>>>> Â Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>> Â Â Â Â Â Â std::__merge_without_buffer
>>>> Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.h   (revision 256871)
>>>> +++ include/bits/stl_tempbuf.h   (working copy)
>>>> @@ -95,7 +95,7 @@
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::nothrow));
>>>> Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>> Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -Â Â Â Â Â __len /= 2;
>>>> +Â Â Â Â Â __len = (__len + 1) / 2;
>>>>
>>>> This part is problematic, if __len is 1 and allocation fails it
>>>> will loop
>>>> forever.
>>>>
>>>> It doesn't seem really necessary for your patch.
>>>>
>>>>
>>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>
>>>>> This is a libstdc++ bug and patch, not the C++ front end. So I'm
>>>>> adding the libstdc++ list to CC.
>>>>>
>>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>Â
>>>>> wrote:
>>>>>> Current std::inplace_merge() suffers from performance issue by
>>>>> inefficient
>>>>>> logic under limited memory,
>>>>>>
>>>>>> It leads to performance downgrade.
>>>>>>
>>>>>> Please help to review it.
>>>>>>
>>>>>> Index: include/bits/stl_algo.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_algo.h    (revision 256871)
>>>>>> +++ include/bits/stl_algo.h    (working copy)
>>>>>> @@ -2437,7 +2437,7 @@
>>>>>> Â Â Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>>>> Â Â Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>>>> -Â Â Â Â Â Â Â Â if (__len1 > __len2)
>>>>>> +Â Â Â Â Â Â Â Â if (__len1 < __len2)
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â {
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>>>> @@ -2539,9 +2539,15 @@
>>>>>> Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first,
>>>>>> __middle);
>>>>>> Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle,
>>>>>> __last);
>>>>>>
>>>>>> +
>>>>>> Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator,
>>>>>> _ValueType>
>>>>> _TmpBuf;
>>>>>> -Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>>>> -
>>>>>> +Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>>>> +Â Â Â Â Â if (__len1 < __len2) {
>>>>>> +Â Â Â Â Â Â __start = __first; __end = __middle;
>>>>>> +Â Â Â Â Â } else {
>>>>>> +Â Â Â Â Â Â __start = __middle; __end = __last;
>>>>>> +Â Â Â Â Â }
>>>>>> +Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>>>> Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>>>> Â Â Â Â Â Â Â Â std::__merge_without_buffer
>>>>>> Â Â Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>>>> Index: include/bits/stl_tempbuf.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_tempbuf.h (revision 256871)
>>>>>> +++ include/bits/stl_tempbuf.h (working copy)
>>>>>> @@ -95,7 +95,7 @@
>>>>>> std::nothrow));
>>>>>> Â Â Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>> -Â Â Â Â Â Â Â Â __len /= 2;
>>>>>> +Â Â Â Â Â Â Â Â __len = (__len + 1) / 2;
>>>>>> Â Â Â Â Â Â Â Â }
>>>>>> Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>> Â Â Â Â Â }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2018-10-29 6:07 ` François Dumont
@ 2018-12-21 20:57 ` Jonathan Wakely
2019-06-09 14:27 ` François Dumont
0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Wakely @ 2018-12-21 20:57 UTC (permalink / raw)
To: François Dumont; +Cc: libstdc++, gcc-patches, chang jc
On 29/10/18 07:06 +0100, François Dumont wrote:
>Hi
>
>Â Â Â Some feedback regarding this patch ?
Sorry this got missed, please resubmit during stage 1.
You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.
The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.
>
>Thanks,
>François
>
>On 8/21/18 10:34 PM, François Dumont wrote:
>>I missed a test that was failing because of this patch. So I revert
>>a small part of it and here is the new proposal.
>>
>>Tested under Linux x86_64, ok to commit ?
>>
>>François
>>
>>
>>On 24/07/2018 12:22, François Dumont wrote:
>>>Hi
>>>
>>>Â Â Â Any chance to review this patch ?
>>>
>>>François
>>>
>>>
>>>On 06/06/2018 18:39, François Dumont wrote:
>>>>Hi
>>>>
>>>>Â Â Â I review and rework this proposal. I noticed that the same
>>>>idea to limit buffer size within inplace_merge also apply to
>>>>stable_sort.
>>>>
>>>>Â Â Â I also change the decision when buffer is too small to
>>>>consider the buffer size rather than going through successive
>>>>cuts of the original ranges. This way we won't cut the range
>>>>more than necessary. Note that if you prefer this second part
>>>>could be committed seperatly.
>>>>
>>>>Â Â Â PR libstdc++/83938
>>>>Â Â Â * include/bits/stl_algo.h:
>>>>Â Â Â (__stable_partition_adaptive): When buffer to small cut range using
>>>>Â Â Â buffer size.
>>>>Â Â Â (__inplace_merge): Take temporary buffer length from
>>>>smallest range.
>>>>Â Â Â (__merge_adaptive): When buffer too small consider smallest
>>>>range first
>>>>Â Â Â and cut based on buffer size.
>>>>Â Â Â (__stable_sort_adaptive): When buffer too small cut based on buffer
>>>>Â Â Â size.
>>>>Â Â Â (__stable_sort): Limit temporary buffer length.
>>>>Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change
>>>>function
>>>>Â Â Â to reduce requested buffer length on allocation failure.
>>>>
>>>>Tested under Linux x86_64.
>>>>
>>>>Ok to commit ?
>>>>
>>>>François
>>>>
>>>>
>>>>On 25/01/2018 23:37, chang jc wrote:
>>>>>Hi:
>>>>>
>>>>>1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>>>__len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>>
>>>>>2. The coding gain has been shown PR c++/83938; I re-post here
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>Â Â 21
>>>>>Â Â 20
>>>>>Â Â 19
>>>>>Â Â 18
>>>>>Â Â 17
>>>>>Â Â 16
>>>>>
>>>>>
>>>>>Â Â 0.471136
>>>>>Â Â 0.625695
>>>>>Â Â 0.767262
>>>>>Â Â 0.907461
>>>>>Â Â 1.04838
>>>>>Â Â 1.19508
>>>>>
>>>>>
>>>>>Â Â 0.340845
>>>>>Â Â 0.48651
>>>>>Â Â 0.639139
>>>>>Â Â 0.770133
>>>>>Â Â 0.898454
>>>>>Â Â 1.04632
>>>>>
>>>>>it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>>>4325376 & B with 12451840 elements, total with 16M integers
>>>>>
>>>>>The proposed method has the speed up under given buffer size =, ex
>>>>>2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>>>given sizof(int)*64K bytes.
>>>>>
>>>>>3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>>
>>>>>4. It represents a fact that the intuitive idea to split from larger
>>>>>part is wrong.
>>>>>
>>>>>For example, if you have an input sorted array A & B, A has 8 integers
>>>>>& B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>>
>>>>>Current it tries to split from B, right?
>>>>>
>>>>>Then we have:
>>>>>
>>>>>A1 | A2Â B1 | B2
>>>>>
>>>>>B1 & B2 has 12 integers each, right?
>>>>>
>>>>>Current algorithm selects pivot as 13th integer from B, right?
>>>>>
>>>>>If the corresponding upper bound of A is 6th integer.
>>>>>
>>>>>Then it split in
>>>>>
>>>>>A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>>
>>>>>After rotate, we have two arrays to merge
>>>>>
>>>>>[A1 = 5 | B1 = 12]Â & [A2 = 3 | B2 = 12]
>>>>>
>>>>>Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>>
>>>>>Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>>
>>>>>So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>>>| B1 = 12] again.
>>>>>
>>>>>
>>>>>But wait, if we always split from the smaller one instead of
>>>>>larger one.
>>>>>
>>>>>After rotate, it promises two split arrays both contain
>>>>>ceiling[small/2].
>>>>>
>>>>>Since tmp buffer also allocate by starting from sizeof(small) &
>>>>>recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>>
>>>>>It means the allocated tmp buffer promises to be sufficient at the
>>>>>level of (# of fail allocate).
>>>>>
>>>>>Instead, you can see if split from large at level (# of fail allocate)
>>>>>several split array still CANNOT use tmp buffer to do buffered merge.
>>>>>
>>>>>
>>>>>As you know, buffered merge is far speed then (split, rotate, and
>>>>>merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>>
>>>>>the way should provide speedup.
>>>>>
>>>>>
>>>>>Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On 24/01/2018 18:23, François Dumont wrote:
>>>>>
>>>>>Hi
>>>>>
>>>>>
>>>>>Â Â Â Â It sounds like a very sensitive change to make but
>>>>>nothing worth figures.
>>>>>Do you have any bench showing the importance of the gain ?
>>>>>
>>>>>Â Â Â Â At least the memory usage optimization is obvious.
>>>>>
>>>>>On 19/01/2018 10:43, chang jc wrote:
>>>>>
>>>>>Current std::inplace_merge() suffers from performance issue by
>>>>>inefficient
>>>>>
>>>>>logic under limited memory,
>>>>>
>>>>>It leads to performance downgrade.
>>>>>
>>>>>Please help to review it.
>>>>>
>>>>>Index: include/bits/stl_algo.h
>>>>>===================================================================
>>>>>--- include/bits/stl_algo.h   (revision 256871)
>>>>>+++ include/bits/stl_algo.h   (working copy)
>>>>>@@ -2437,7 +2437,7 @@
>>>>>Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>>>Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>>>Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>>>-Â Â Â Â Â if (__len1 > __len2)
>>>>>+Â Â Â Â Â if (__len1 < __len2)
>>>>>Â Â Â Â Â Â Â Â Â Â {
>>>>>Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>>>Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>>>@@ -2539,9 +2539,15 @@
>>>>>Â Â Â Â Â Â Â Â const _DistanceType __len1 = std::distance(__first,
>>>>>__middle);
>>>>>Â Â Â Â Â Â Â Â const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>>+
>>>>>
>>>>>Â Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>>>_TmpBuf;
>>>>>
>>>>>-Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>>>-
>>>>>+Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>>>+Â Â Â Â Â if (__len1 < __len2) {
>>>>>+Â Â Â __start = __first; __end = __middle;
>>>>>+Â Â Â Â Â } else {
>>>>>+Â Â Â __start = __middle; __end = __last;
>>>>>+Â Â Â Â Â }
>>>>>+Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>>>
>>>>>Note another optimization, we could introduce a
>>>>>_Temporary_buffer<> constructor
>>>>>in order to write:
>>>>>
>>>>>_TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>>
>>>>>So that std::distance do not need to be called again.
>>>>>
>>>>>Â Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>>>Â Â Â Â Â Â std::__merge_without_buffer
>>>>>Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>>>Index: include/bits/stl_tempbuf.h
>>>>>===================================================================
>>>>>--- include/bits/stl_tempbuf.h   (revision 256871)
>>>>>+++ include/bits/stl_tempbuf.h   (working copy)
>>>>>@@ -95,7 +95,7 @@
>>>>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::nothrow));
>>>>>Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>>>Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>-Â Â Â Â Â __len /= 2;
>>>>>+Â Â Â Â Â __len = (__len + 1) / 2;
>>>>>
>>>>>This part is problematic, if __len is 1 and allocation fails
>>>>>it will loop
>>>>>forever.
>>>>>
>>>>>It doesn't seem really necessary for your patch.
>>>>>
>>>>>
>>>>>2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>>
>>>>>>This is a libstdc++ bug and patch, not the C++ front end. So I'm
>>>>>>adding the libstdc++ list to CC.
>>>>>>
>>>>>>On Fri, Jan 19, 2018 at 3:02 AM, chang
>>>>>>jc<r97922153@gmail.com>Â wrote:
>>>>>>>Current std::inplace_merge() suffers from performance issue by
>>>>>>inefficient
>>>>>>>logic under limited memory,
>>>>>>>
>>>>>>>It leads to performance downgrade.
>>>>>>>
>>>>>>>Please help to review it.
>>>>>>>
>>>>>>>Index: include/bits/stl_algo.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_algo.h    (revision 256871)
>>>>>>>+++ include/bits/stl_algo.h    (working copy)
>>>>>>>@@ -2437,7 +2437,7 @@
>>>>>>>Â Â Â Â Â Â Â Â Â Â _BidirectionalIterator __second_cut = __middle;
>>>>>>>Â Â Â Â Â Â Â Â Â Â _Distance __len11 = 0;
>>>>>>>Â Â Â Â Â Â Â Â Â Â _Distance __len22 = 0;
>>>>>>>-Â Â Â Â Â Â Â Â if (__len1 > __len2)
>>>>>>>+Â Â Â Â Â Â Â Â if (__len1 < __len2)
>>>>>>>Â Â Â Â Â Â Â Â Â Â Â Â {
>>>>>>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â __len11 = __len1 / 2;
>>>>>>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::advance(__first_cut, __len11);
>>>>>>>@@ -2539,9 +2539,15 @@
>>>>>>>Â Â Â Â Â Â Â const _DistanceType __len1 =
>>>>>>>std::distance(__first, __middle);
>>>>>>>Â Â Â Â Â Â Â const _DistanceType __len2 =
>>>>>>>std::distance(__middle, __last);
>>>>>>>
>>>>>>>+
>>>>>>>Â Â Â Â Â Â Â typedef _Temporary_buffer<_BidirectionalIterator,
>>>>>>>_ValueType>
>>>>>>_TmpBuf;
>>>>>>>-Â Â Â Â Â _TmpBuf __buf(__first, __last);
>>>>>>>-
>>>>>>>+Â Â Â Â Â _BidirectionalIterator __start, __end;
>>>>>>>+Â Â Â Â Â if (__len1 < __len2) {
>>>>>>>+Â Â Â Â Â Â __start = __first; __end = __middle;
>>>>>>>+Â Â Â Â Â } else {
>>>>>>>+Â Â Â Â Â Â __start = __middle; __end = __last;
>>>>>>>+Â Â Â Â Â }
>>>>>>>+Â Â Â Â Â _TmpBuf __buf(__start, ___end);
>>>>>>>Â Â Â Â Â Â Â if (__buf.begin() == 0)
>>>>>>>Â Â Â Â Â Â Â Â std::__merge_without_buffer
>>>>>>>Â Â Â Â Â Â Â Â Â Â (__first, __middle, __last, __len1, __len2, __comp);
>>>>>>>Index: include/bits/stl_tempbuf.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_tempbuf.h (revision 256871)
>>>>>>>+++ include/bits/stl_tempbuf.h (working copy)
>>>>>>>@@ -95,7 +95,7 @@
>>>>>>>std::nothrow));
>>>>>>>Â Â Â Â Â Â Â Â Â Â if (__tmp != 0)
>>>>>>>Â Â Â Â Â Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>>>-Â Â Â Â Â Â Â Â __len /= 2;
>>>>>>>+Â Â Â Â Â Â Â Â __len = (__len + 1) / 2;
>>>>>>>Â Â Â Â Â Â Â Â }
>>>>>>>Â Â Â Â Â Â Â return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>>>Â Â Â Â Â }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2018-12-21 20:57 ` Jonathan Wakely
@ 2019-06-09 14:27 ` François Dumont
2019-07-16 16:40 ` François Dumont
0 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2019-06-09 14:27 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: libstdc++, gcc-patches, chang jc
[-- Attachment #1: Type: text/plain, Size: 2167 bytes --]
On 12/21/18 9:57 PM, Jonathan Wakely wrote:
> On 29/10/18 07:06 +0100, François Dumont wrote:
>> Hi
>>
>> Â Â Â Some feedback regarding this patch ?
>
> Sorry this got missed, please resubmit during stage 1.
>
> You haven't CC'd the original patch author (chang jc) to give them a
> chance to comment on your proposed changes to the patch.
>
> The attached PDF on PR libstdc++/83938 has extensive discussion of the
> performance issue, but I don't see any for your version. Some
> performance benchmarks for your version would be helpful.
Here is this patch again.
This time it is much closer to John's original one, I just kept my
change on the size of the temporary buffer which doesn't need to be as
large as it is currently allocated, especially with John's patch.
The performance tests are showing the good improvements, attached are
the before/after. Surprisingly I do not see any change regarding
allocated memory, I guess measures are not accurate enough. However
working on them also show that current implementation is fragile. If you
reduce the allowed allocated memory too much you end up with a stack
overflow because of the recursive implementation.
I already have a patch to keep on trying to allocate memory as long as
above a given threshold rather than 0 but I am also working on making
implementation non-recursive. We'll see if even a buffer of size 1 is
better than no buffer at all then.
   PR libstdc++/83938
   * include/bits/stl_algo.h:
   (__merge_adaptive): When buffer too small consider smallest range first
   and cut based on buffer size.
   (__inplace_merge): Take temporary buffer length from smallest range.
   (__stable_sort): Limit temporary buffer length.
   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
   to reduce requested buffer length on allocation failure.
   * testsuite/performance/25_algorithms/inplace_merge.cc: New.
   * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
   execution with different level of memory.
François
[-- Attachment #2: after_libstdc++-performance.sum --]
[-- Type: text/plain, Size: 3367 bytes --]
inplace_merge.cc reverse 19r 19u 0s 48mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 18r 18u 0s 0mem 0pf
inplace_merge.cc reverse 10r 10u 0s 0mem 0pf
inplace_merge.cc forwards 8r 8u 0s 0mem 0pf
inplace_merge.cc random 22r 22u 0s 0mem 0pf
inplace_merge.cc bench 1/2 memory 1891r 1884u 7s 928mem 0pf
inplace_merge.cc reverse 19r 18u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 18r 18u 0s 0mem 0pf
inplace_merge.cc reverse 10r 10u 0s 0mem 0pf
inplace_merge.cc forwards 3r 3u 0s 0mem 0pf
inplace_merge.cc random 22r 22u 0s 0mem 0pf
inplace_merge.cc bench 1/4 memory 1867r 1861u 6s 192mem 0pf
inplace_merge.cc reverse 17r 17u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 25r 24u 0s 0mem 0pf
inplace_merge.cc reverse 28r 28u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 289r 289u 0s 0mem 0pf
inplace_merge.cc bench no memory 2155r 2149u 6s 192mem 0pf
stable_sort.cc reverse 240r 240u 1s 0mem 0pf
stable_sort.cc forwards 205r 205u 0s 0mem 0pf
stable_sort.cc random 493r 493u 0s 0mem 0pf
stable_sort.cc bench full memory 945r 939u 5s 752mem 0pf
stable_sort.cc reverse 236r 236u 0s 0mem 0pf
stable_sort.cc forwards 199r 199u 0s 0mem 0pf
stable_sort.cc random 487r 487u 0s 0mem 0pf
stable_sort.cc bench 1/2 memory 928r 925u 3s 96mem 0pf
stable_sort.cc reverse 240r 240u 0s 0mem 0pf
stable_sort.cc forwards 203r 203u 0s 0mem 0pf
stable_sort.cc random 486r 487u 0s 0mem 0pf
stable_sort.cc bench 1/4 memory 935r 932u 5s 96mem 0pf
stable_sort.cc reverse 829r 829u 0s 0mem 0pf
stable_sort.cc forwards 143r 143u 0s 0mem 0pf
stable_sort.cc random 487r 486u 0s 0mem 0pf
stable_sort.cc bench no memory 1465r 1461u 2s 96mem 0pf
[-- Attachment #3: before_libstdc++-performance.sum --]
[-- Type: text/plain, Size: 3367 bytes --]
inplace_merge.cc reverse 19r 19u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 19r 19u 0s 0mem 0pf
inplace_merge.cc reverse 11r 10u 0s 0mem 0pf
inplace_merge.cc forwards 8r 8u 0s 0mem 0pf
inplace_merge.cc random 22r 22u 0s 0mem 0pf
inplace_merge.cc bench 1/2 memory 1938r 1925u 12s 928mem 0pf
inplace_merge.cc reverse 19r 19u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 18r 19u 0s 0mem 0pf
inplace_merge.cc reverse 10r 11u 0s 0mem 0pf
inplace_merge.cc forwards 3r 2u 0s 0mem 0pf
inplace_merge.cc random 22r 22u 0s 0mem 0pf
inplace_merge.cc bench 1/4 memory 1922r 1915u 7s 192mem 0pf
inplace_merge.cc reverse 18r 17u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 23r 24u 0s 0mem 0pf
inplace_merge.cc reverse 28r 27u 0s 0mem 0pf
inplace_merge.cc forwards 0r 0u 0s 0mem 0pf
inplace_merge.cc random 288r 289u 0s 0mem 0pf
inplace_merge.cc bench no memory 2204r 2196u 9s 192mem 0pf
stable_sort.cc reverse 234r 234u 0s 0mem 0pf
stable_sort.cc forwards 201r 200u 1s 0mem 0pf
stable_sort.cc random 485r 484u 1s 0mem 0pf
stable_sort.cc bench full memory 927r 920u 5s 752mem 0pf
stable_sort.cc reverse 233r 231u 1s 0mem 0pf
stable_sort.cc forwards 198r 198u 0s 0mem 0pf
stable_sort.cc random 478r 478u 0s 0mem 0pf
stable_sort.cc bench 1/2 memory 915r 910u 4s 96mem 0pf
stable_sort.cc reverse 236r 236u 0s 0mem 0pf
stable_sort.cc forwards 201r 200u 0s 0mem 0pf
stable_sort.cc random 478r 478u 1s 0mem 0pf
stable_sort.cc bench 1/4 memory 921r 916u 4s 96mem 0pf
stable_sort.cc reverse 830r 830u 0s 0mem 0pf
stable_sort.cc forwards 146r 146u 0s 0mem 0pf
stable_sort.cc random 479r 478u 0s 0mem 0pf
stable_sort.cc bench no memory 1461r 1456u 5s 96mem 0pf
[-- Attachment #4: pr83938.patch --]
[-- Type: text/x-patch, Size: 10790 bytes --]
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index 1ba447bcb6e..e36efb515e8 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2418,7 +2418,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __second_cut = __middle;
_Distance __len11 = 0;
_Distance __len22 = 0;
- if (__len1 > __len2)
+ if (__len1 < __len2)
{
__len11 = __len1 / 2;
std::advance(__first_cut, __len11);
@@ -2439,14 +2439,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_BidirectionalIterator __new_middle
= std::__rotate_adaptive(__first_cut, __middle, __second_cut,
- __len1 - __len11, __len22, __buffer,
- __buffer_size);
- std::__merge_adaptive(__first, __first_cut, __new_middle, __len11,
- __len22, __buffer, __buffer_size, __comp);
+ __len1 - __len11, __len22,
+ __buffer, __buffer_size);
+ std::__merge_adaptive(__first, __first_cut, __new_middle,
+ __len11, __len22,
+ __buffer, __buffer_size, __comp);
std::__merge_adaptive(__new_middle, __second_cut, __last,
- __len1 - __len11,
- __len2 - __len22, __buffer,
- __buffer_size, __comp);
+ __len1 - __len11, __len2 - __len22,
+ __buffer, __buffer_size, __comp);
}
}
@@ -2520,7 +2520,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
const _DistanceType __len2 = std::distance(__middle, __last);
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, __len1 + __len2);
+ _TmpBuf __buf(__first, std::min(__len1, __len2));
if (__buf.begin() == 0)
std::__merge_without_buffer
@@ -2728,6 +2728,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
}
+
std::__merge_adaptive(__first, __middle, __last,
_Distance(__middle - __first),
_Distance(__last - __middle),
@@ -4980,8 +4981,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
typedef typename iterator_traits<_RandomAccessIterator>::difference_type
_DistanceType;
+ if (__first == __last)
+ return;
+
typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, std::distance(__first, __last));
+ _TmpBuf __buf(__first, (__last - __first + 1) / 2);
if (__buf.begin() == 0)
std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index fa78ee53eb4..b6ad9ee6a46 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::nothrow));
if (__tmp != 0)
return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
- __len /= 2;
+ __len = __len == 1 ? 0 : ((__len + 1) / 2);
}
return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
}
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
new file mode 100644
index 00000000000..8522fcc8044
--- /dev/null
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
@@ -0,0 +1,130 @@
+// Copyright (C) 2019 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library. This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3. If not see
+// <http://www.gnu.org/licenses/>.
+
+#include <vector>
+#include <algorithm>
+
+#include <testsuite_new_operators.h>
+#include <testsuite_performance.h>
+
+const int max_size = 10000000;
+
+void bench(int mem_threshold, int pivot_index,
+ std::vector<int> revv,
+ std::vector<int> fwdv,
+ std::vector<int> rndv)
+{
+ using namespace __gnu_test;
+
+ time_counter time;
+ resource_counter resource;
+
+ std::vector<int>::iterator pivot = revv.begin() + pivot_index;
+ std::stable_sort(revv.begin(), pivot);
+ std::stable_sort(pivot, revv.end());
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(revv.begin(), pivot, revv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+
+ report_performance(__FILE__, "reverse", time, resource);
+ clear_counters(time, resource);
+
+ pivot = fwdv.begin() + pivot_index;
+ std::stable_sort(fwdv.begin(), pivot);
+ std::stable_sort(pivot, fwdv.end());
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(fwdv.begin(), pivot, fwdv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+
+ report_performance(__FILE__, "forwards", time, resource);
+ clear_counters(time, resource);
+
+ pivot = rndv.begin() + pivot_index;
+ std::stable_sort(rndv.begin(), pivot);
+ std::stable_sort(pivot, rndv.end());
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(rndv.begin(), pivot, rndv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+ report_performance(__FILE__, "random", time, resource);
+}
+
+int main()
+{
+ using namespace __gnu_test;
+
+ // No constraint to build vectors.
+ set_new_limit(~size_t(0));
+
+ std::vector<int> revv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ revv[i] = -i;
+
+ std::vector<int> fwdv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ fwdv[i] = i;
+
+ // a simple pseudo-random series which does not rely on rand() and friends
+ std::vector<int> rndv(max_size);
+ rndv[0] = 0;
+ for (int i = 1; i < max_size; ++i)
+ rndv[i] = (rndv[i-1] + 110211473) * 745988807;
+
+ time_counter time;
+ resource_counter resource;
+
+ start_counters(time, resource);
+ // Limit to half the expected size of the sorted array.
+ bench(max_size * sizeof(int) / 2, 10, revv, fwdv, rndv);
+ bench(max_size * sizeof(int) / 2, max_size / 2, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1/2 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Limit to the fourth.
+ bench(max_size * sizeof(int) / 4, 10, revv, fwdv, rndv);
+ bench(max_size * sizeof(int) / 4, max_size / 2, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1/4 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Forbid any allocation.
+ bench(0, 10, revv, fwdv, rndv);
+ bench(0, max_size / 2, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench no memory", time, resource);
+ return 0;
+}
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
index e79e0a7f6b2..f997b2a72ce 100644
--- a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
@@ -17,49 +17,102 @@
#include <vector>
#include <algorithm>
+
+#include <testsuite_new_operators.h>
#include <testsuite_performance.h>
-int main()
+const int max_size = 10000000;
+
+void bench(size_t mem_threshold,
+ std::vector<int> revv,
+ std::vector<int> fwdv,
+ std::vector<int> rndv)
{
using namespace __gnu_test;
time_counter time;
resource_counter resource;
- const int max_size = 10000000;
-
- std::vector<int> v(max_size);
-
- for (int i = 0; i < max_size; ++i)
- v[i] = -i;
+ set_new_limit(mem_threshold);
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ std::stable_sort(revv.begin(), revv.end());
stop_counters(time, resource);
+ set_new_limit(~size_t(0));
report_performance(__FILE__, "reverse", time, resource);
clear_counters(time, resource);
- for (int i = 0; i < max_size; ++i)
- v[i] = i;
+ set_new_limit(mem_threshold);
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ std::stable_sort(fwdv.begin(), fwdv.end());
stop_counters(time, resource);
+ set_new_limit(~size_t(0));
report_performance(__FILE__, "forwards", time, resource);
clear_counters(time, resource);
- // a simple psuedo-random series which does not rely on rand() and friends
- v[0] = 0;
+ start_counters(time, resource);
+ std::stable_sort(rndv.begin(), rndv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+ report_performance(__FILE__, "random", time, resource);
+}
+
+int main()
+{
+ using namespace __gnu_test;
+
+ // No memory constraint.
+ set_new_limit(~size_t(0));
+
+ std::vector<int> revv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ revv[i] = -i;
+
+ std::vector<int> fwdv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ fwdv[i] = i;
+
+ // a simple pseudo-random series which does not rely on rand() and friends
+ std::vector<int> rndv(max_size);
+ rndv[0] = 0;
for (int i = 1; i < max_size; ++i)
- v[i] = (v[i-1] + 110211473) * 745988807;
+ rndv[i] = (rndv[i-1] + 110211473) * 745988807;
+
+ time_counter time;
+ resource_counter resource;
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ bench(~size_t(0), revv, fwdv, rndv);
stop_counters(time, resource);
- report_performance(__FILE__, "random", time, resource);
+ report_performance(__FILE__, "bench full memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Limit to half the expected size of the sorted array.
+ bench(max_size * sizeof(int) / 2, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1/2 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Limit to the fourth.
+ bench(max_size * sizeof(int) / 4, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1/4 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Forbid any allocation.
+ bench(0, revv, fwdv, rndv);
+ stop_counters(time, resource);
+ report_performance(__FILE__, "bench no memory", time, resource);
return 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
2019-06-09 14:27 ` François Dumont
@ 2019-07-16 16:40 ` François Dumont
2020-11-19 12:08 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)# Jonathan Wakely
0 siblings, 1 reply; 12+ messages in thread
From: François Dumont @ 2019-07-16 16:40 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: libstdc++, gcc-patches, chang jc
[-- Attachment #1: Type: text/plain, Size: 6706 bytes --]
Hi
   I eventually spent much more time working on the inplace_merge
performance bench.
   And the results do not confirm the theory:
Before patch:
inplace_merge.cc           bench 1 / 1 memory          243r 227u Â
17s     1216mem   5pf
inplace_merge.cc           bench 1 / 4 memory          297r 278u Â
18s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory         373r 354u Â
18s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          12r 11u   Â
0s      480mem   0pf
After the patch to reduce memory allocation:
inplace_merge.cc           bench 1 / 1 memory          245r 227u Â
18s     1216mem   0pf
inplace_merge.cc           bench 1 / 4 memory          292r 273u Â
19s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory         373r 356u Â
17s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          11r 11u   Â
0s      480mem   0pf
With the __len1 > __len2 condition change:
inplace_merge.cc           bench 1 / 1 memory          244r 225u Â
20s     1216mem   0pf
inplace_merge.cc           bench 1 / 4 memory          379r 361u Â
17s      480mem   0pf
inplace_merge.cc           bench 1 /64 memory          640r 625u Â
16s      480mem   0pf
inplace_merge.cc           bench 0 / 1 memory          11r 11u  Â
0s      480mem   0pf
When there is no memory limit the results are identical of course.
Otherwise as soon as memory is limited performance starts to decrease
with the condition change on __len1 vs __len2.
Could you give the bench you use to demonstrate the enhancement ? I also
wonder why your patch doesn't change consistently the same condition in
__merge_without_buffer ?
For the moment I'd like to propose the attached patch that is to say the
reduction on the amount of allocated memory and the new/modified benches.
Note that as soon as I forbid any memory allocation I also reduce the
size of the range to merge cause the implementation rely on recursivity
and so could end-up in a stack overflow. Maybe I need to check for
simulator targets like several other tests ? Unless simulators do not
run the performance tests ?
Regarding this stack overflow issue, is there some routine to find out
how many levels of function calls can be added before reaching the stack
overflow ? I could perhaps call __builtin_alloca and check the result
but that smells. If I could find out this we could fallback on an
iterative approach to complete the merge.
   PR libstdc++/83938
   * include/bits/stl_algo.h:
   (__inplace_merge): Take temporary buffer length from smallest range.
   (__stable_sort): Limit temporary buffer length.
   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
   computation in the loop.
   * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
   pivot positions.
   * testsuite/performance/25_algorithms/inplace_merge.cc: New.
   * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
   execution with different memory constraints.
Ok to commit ?
François
On 6/9/19 4:27 PM, François Dumont wrote:
> On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>> On 29/10/18 07:06 +0100, François Dumont wrote:
>>> Hi
>>>
>>> Â Â Â Some feedback regarding this patch ?
>>
>> Sorry this got missed, please resubmit during stage 1.
>>
>> You haven't CC'd the original patch author (chang jc) to give them a
>> chance to comment on your proposed changes to the patch.
>>
>> The attached PDF on PR libstdc++/83938 has extensive discussion of the
>> performance issue, but I don't see any for your version. Some
>> performance benchmarks for your version would be helpful.
>
> Here is this patch again.
>
> This time it is much closer to John's original one, I just kept my
> change on the size of the temporary buffer which doesn't need to be as
> large as it is currently allocated, especially with John's patch.
>
> The performance tests are showing the good improvements, attached are
> the before/after. Surprisingly I do not see any change regarding
> allocated memory, I guess measures are not accurate enough. However
> working on them also show that current implementation is fragile. If
> you reduce the allowed allocated memory too much you end up with a
> stack overflow because of the recursive implementation.
>
> I already have a patch to keep on trying to allocate memory as long as
> above a given threshold rather than 0 but I am also working on making
> implementation non-recursive. We'll see if even a buffer of size 1 is
> better than no buffer at all then.
>
> Â Â Â PR libstdc++/83938
> Â Â Â * include/bits/stl_algo.h:
> Â Â Â (__merge_adaptive): When buffer too small consider smallest range
> first
> Â Â Â and cut based on buffer size.
> Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
> Â Â Â (__stable_sort): Limit temporary buffer length.
> Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
> Â Â Â to reduce requested buffer length on allocation failure.
> Â Â Â * testsuite/performance/25_algorithms/inplace_merge.cc: New.
> Â Â Â * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
> Â Â Â execution with different level of memory.
>
> François
>
[-- Attachment #2: pr83938.patch --]
[-- Type: text/x-patch, Size: 15992 bytes --]
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index ec651e2cc45..b396437443f 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
const _DistanceType __len2 = std::distance(__middle, __last);
typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, __len1 + __len2);
+ _TmpBuf __buf(__first, std::min(__len1, __len2));
if (__buf.begin() == 0)
std::__merge_without_buffer
@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
}
+
std::__merge_adaptive(__first, __middle, __last,
_Distance(__middle - __first),
_Distance(__last - __middle),
@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
typedef typename iterator_traits<_RandomAccessIterator>::difference_type
_DistanceType;
+ if (__first == __last)
+ return;
+
typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
- _TmpBuf __buf(__first, std::distance(__first, __last));
+ _TmpBuf __buf(__first, (__last - __first + 1) / 2);
if (__buf.begin() == 0)
std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index fa78ee53eb4..b6ad9ee6a46 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
std::nothrow));
if (__tmp != 0)
return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
- __len /= 2;
+ __len = __len == 1 ? 0 : ((__len + 1) / 2);
}
return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
}
diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
index 0b0c9c59640..9c393ce8fe3 100644
--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
@@ -28,7 +28,7 @@ using std::inplace_merge;
typedef test_container<int, bidirectional_iterator_wrapper> container;
-void
+void
test1()
{
int array[] = { 1 };
@@ -39,7 +39,7 @@ test1()
inplace_merge(con2.begin(), con2.begin(), con2.end());
}
-void
+void
test2()
{
int array[] = { 0, 2, 4, 1, 3, 5 };
@@ -60,30 +60,34 @@ struct S
{ return a < _s.a; }
};
-void
+void
test3()
{
S s[8];
- s[0].a = 0;
- s[1].a = 1;
- s[2].a = 2;
- s[3].a = 3;
- s[4].a = 0;
- s[5].a = 1;
- s[6].a = 2;
- s[7].a = 3;
-
- s[0].b = 0;
- s[1].b = 1;
- s[2].b = 2;
- s[3].b = 3;
- s[4].b = 4;
- s[5].b = 5;
- s[6].b = 6;
- s[7].b = 7;
-
- inplace_merge(s, s + 4, s + 8);
- VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 );
+ for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx)
+ {
+ int bval = 0;
+ for (int i = 0; i != pivot_idx; ++i)
+ {
+ s[i].a = i;
+ s[i].b = bval++;
+ }
+
+ for (int i = pivot_idx; i != 8; ++i)
+ {
+ s[i].a = i - pivot_idx;
+ s[i].b = bval++;
+ }
+
+ inplace_merge(s, s + pivot_idx, s + 8);
+
+ for (int i = 1; i < 8; ++i)
+ {
+ VERIFY( !(s[i] < s[i - 1]) );
+ if (s[i - 1].a == s[i].a)
+ VERIFY( s[i - 1].b < s[i].b );
+ }
+ }
}
int
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
new file mode 100644
index 00000000000..99328b6ea4c
--- /dev/null
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
@@ -0,0 +1,289 @@
+// Copyright (C) 2019 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library. This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3. If not see
+// <http://www.gnu.org/licenses/>.
+
+#include <vector>
+#include <algorithm>
+#include <cmath>
+
+#include <testsuite_new_operators.h>
+#include <testsuite_performance.h>
+
+const int max_size = 10000000;
+const int small_size = 200000;
+const int front_pivot_idx = 10;
+int middle_pivot_idx = max_size / 2;
+int back_pivot_idx = max_size - front_pivot_idx;
+
+void bench(int mem_threshold, int pivot_index,
+ std::vector<int> revv,
+ std::vector<int> fwdv,
+ std::vector<int> wstv,
+ std::vector<int> rndv)
+{
+ using namespace __gnu_test;
+
+ time_counter time;
+ resource_counter resource;
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(revv.begin(), revv.begin() + pivot_index, revv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+
+ report_performance(__FILE__, "reverse", time, resource);
+ clear_counters(time, resource);
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(fwdv.begin(), fwdv.begin() + pivot_index, fwdv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+
+ report_performance(__FILE__, "forward", time, resource);
+ clear_counters(time, resource);
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(wstv.begin(), wstv.begin() + pivot_index, wstv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+
+ report_performance(__FILE__, "worst", time, resource);
+ clear_counters(time, resource);
+
+ set_new_limit(mem_threshold);
+
+ start_counters(time, resource);
+ std::inplace_merge(rndv.begin(), rndv.begin() + pivot_index, rndv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+ report_performance(__FILE__, "random", time, resource);
+}
+
+void mem_bench(double mem_ratio,
+ const std::vector<int>& front_revv,
+ const std::vector<int>& middle_revv,
+ const std::vector<int>& back_revv,
+ const std::vector<int>& fwdv,
+ const std::vector<int>& front_wstv,
+ const std::vector<int>& middle_wstv,
+ const std::vector<int>& back_wstv,
+ const std::vector<int>& front_rndv,
+ const std::vector<int>& middle_rndv,
+ const std::vector<int>& back_rndv)
+{
+ using namespace __gnu_test;
+
+ time_counter time;
+ resource_counter resource;
+
+ int max_mem = (int)std::ceil(front_pivot_idx * mem_ratio) * sizeof(int);
+ start_counters(time, resource);
+ bench(max_mem, front_pivot_idx, front_revv, fwdv, front_wstv, front_rndv);
+ stop_counters(time, resource);
+ report_performance(__FILE__, "front pivot", time, resource);
+ clear_counters(time, resource);
+
+ max_mem = (int)std::ceil(middle_pivot_idx * mem_ratio) * sizeof(int);
+ start_counters(time, resource);
+ bench(max_mem, middle_pivot_idx, middle_revv, fwdv, middle_wstv, middle_rndv);
+ stop_counters(time, resource);
+ report_performance(__FILE__, "middle pivot", time, resource);
+ clear_counters(time, resource);
+
+ max_mem = (int)std::ceil(front_pivot_idx * mem_ratio) * sizeof(int);
+ start_counters(time, resource);
+ bench(max_mem, back_pivot_idx, back_revv, fwdv, back_wstv, back_rndv);
+ stop_counters(time, resource);
+ report_performance(__FILE__, "back pivot", time, resource);
+}
+
+void init_reverse(std::vector<int>& v, size_t pivot_index)
+{
+ int val = 0;
+ for (size_t i = pivot_index; i != v.size(); ++i)
+ v[i] = val++;
+ for (size_t i = 0; i != pivot_index; ++i)
+ v[i] = val++;
+}
+
+void init_forward(std::vector<int>& v)
+{
+ int val = 0;
+ for (size_t i = 0; i != v.size(); ++i)
+ v[i] = val++;
+}
+
+void init_worst(std::vector<int>& v, size_t pivot_index)
+{
+ int val = 0;
+ if (pivot_index > v.size() / 2)
+ {
+ for (size_t i = 0; i != pivot_index; val += 2, ++i)
+ v[i] = val;
+ val = 1;
+ }
+ else
+ {
+ for (size_t i = pivot_index; i != v.size(); val += 2, ++i)
+ v[i] = val;
+ ++val;
+ }
+
+ if (pivot_index > v.size() / 2)
+ for (size_t i = pivot_index; i != v.size(); val += 2, ++i)
+ v[i] = val;
+ else
+ for (size_t i = 0; i != pivot_index; val += 2, ++i)
+ v[i] = val;
+}
+
+void init_random(std::vector<int>& v)
+{
+ // a simple pseudo-random series which does not rely on rand() and friends
+ v[0] = 0;
+ for (size_t i = 1; i != v.size(); ++i)
+ v[i] = (v[i-1] + 110211473) * 745988807;
+}
+
+void reduce_size(std::vector<int>& front_v,
+ std::vector<int>& middle_v,
+ std::vector<int>& back_v)
+{
+ front_v.resize(small_size);
+ middle_v.erase(middle_v.begin() + small_size / 2,
+ middle_v.begin() + max_size / 2);
+ middle_v.resize(small_size);
+ back_v.erase(back_v.begin(), back_v.end() - small_size);
+}
+
+int main()
+{
+ using namespace __gnu_test;
+
+ // No constraint to build vectors.
+ set_new_limit(~size_t(0));
+
+ std::vector<int> front_revv(max_size);
+ init_reverse(front_revv, front_pivot_idx);
+
+ std::vector<int> middle_revv(max_size);
+ init_reverse(middle_revv, middle_pivot_idx);
+
+ std::vector<int> back_revv(max_size);
+ init_reverse(back_revv, back_pivot_idx);
+
+ std::vector<int> fwdv(max_size);
+ init_forward(fwdv);
+
+ std::vector<int> front_wstv(max_size);
+ init_worst(front_wstv, front_pivot_idx);
+
+ std::vector<int> middle_wstv(max_size);
+ init_worst(middle_wstv, middle_pivot_idx);
+
+ std::vector<int> back_wstv(max_size);
+ init_worst(back_wstv, back_pivot_idx);
+
+ std::vector<int> front_rndv(max_size);
+ init_random(front_rndv);
+ std::vector<int> middle_rndv(front_rndv);
+ std::vector<int> back_rndv(front_rndv);
+
+ sort(front_rndv.begin(), front_rndv.begin() + front_pivot_idx);
+ sort(front_rndv.begin() + front_pivot_idx, front_rndv.end());
+
+ sort(middle_rndv.begin(), middle_rndv.begin() + middle_pivot_idx);
+ sort(middle_rndv.begin() + middle_pivot_idx, middle_rndv.end());
+
+ sort(back_rndv.begin(), back_rndv.begin() + back_pivot_idx);
+ sort(back_rndv.begin() + back_pivot_idx, back_rndv.end());
+
+ time_counter time;
+ resource_counter resource;
+
+ start_counters(time, resource);
+
+ // No limit.
+ mem_bench(1.0,
+ front_revv, middle_revv, back_revv,
+ fwdv,
+ front_wstv, middle_wstv, back_wstv,
+ front_rndv, middle_rndv, back_rndv);
+
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1 / 1 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+
+ // Limit to the fourth.
+ mem_bench(1.0 / 4,
+ front_revv, middle_revv, back_revv,
+ fwdv,
+ front_wstv, middle_wstv, back_wstv,
+ front_rndv, middle_rndv, back_rndv);
+
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1 / 4 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+
+ // Really limit allocation.
+ mem_bench(1.0 / 64,
+ front_revv, middle_revv, back_revv,
+ fwdv,
+ front_wstv, middle_wstv, back_wstv,
+ front_rndv, middle_rndv, back_rndv);
+
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1 /64 memory", time, resource);
+ clear_counters(time, resource);
+
+ middle_pivot_idx = small_size / 2;
+ back_pivot_idx = small_size - front_pivot_idx;
+ reduce_size(front_revv, middle_revv, back_revv);
+ fwdv.resize(small_size);
+ reduce_size(front_wstv, middle_wstv, back_wstv);
+ reduce_size(front_rndv, middle_rndv, back_rndv);
+
+ start_counters(time, resource);
+
+ // No memory.
+ mem_bench(0.0,
+ front_revv, middle_revv, back_revv,
+ fwdv,
+ front_wstv, middle_wstv, back_wstv,
+ front_rndv, middle_rndv, back_rndv);
+
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 0 / 1 memory", time, resource);
+ return 0;
+}
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
index e79e0a7f6b2..fa961d6eb35 100644
--- a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
@@ -17,49 +17,107 @@
#include <vector>
#include <algorithm>
+
+#include <testsuite_new_operators.h>
#include <testsuite_performance.h>
-int main()
+const int max_size = 10000000;
+const int small_size = 200000;
+
+void bench(size_t mem_threshold,
+ std::vector<int> revv,
+ std::vector<int> fwdv,
+ std::vector<int> rndv)
{
using namespace __gnu_test;
time_counter time;
resource_counter resource;
- const int max_size = 10000000;
-
- std::vector<int> v(max_size);
-
- for (int i = 0; i < max_size; ++i)
- v[i] = -i;
+ set_new_limit(mem_threshold);
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ std::stable_sort(revv.begin(), revv.end());
stop_counters(time, resource);
+ set_new_limit(~size_t(0));
report_performance(__FILE__, "reverse", time, resource);
clear_counters(time, resource);
- for (int i = 0; i < max_size; ++i)
- v[i] = i;
+ set_new_limit(mem_threshold);
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ std::stable_sort(fwdv.begin(), fwdv.end());
stop_counters(time, resource);
+ set_new_limit(~size_t(0));
report_performance(__FILE__, "forwards", time, resource);
clear_counters(time, resource);
- // a simple psuedo-random series which does not rely on rand() and friends
- v[0] = 0;
+ start_counters(time, resource);
+ std::stable_sort(rndv.begin(), rndv.end());
+ stop_counters(time, resource);
+
+ set_new_limit(~size_t(0));
+ report_performance(__FILE__, "random", time, resource);
+}
+
+int main()
+{
+ using namespace __gnu_test;
+
+ // No memory constraint.
+ set_new_limit(~size_t(0));
+
+ std::vector<int> revv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ revv[i] = -i;
+
+ std::vector<int> fwdv(max_size);
+ for (int i = 0; i < max_size; ++i)
+ fwdv[i] = i;
+
+ // a simple pseudo-random series which does not rely on rand() and friends
+ std::vector<int> rndv(max_size);
+ rndv[0] = 0;
for (int i = 1; i < max_size; ++i)
- v[i] = (v[i-1] + 110211473) * 745988807;
+ rndv[i] = (rndv[i-1] + 110211473) * 745988807;
+
+ time_counter time;
+ resource_counter resource;
start_counters(time, resource);
- std::stable_sort(v.begin(), v.end());
+ bench(~size_t(0), revv, fwdv, rndv);
stop_counters(time, resource);
- report_performance(__FILE__, "random", time, resource);
+ report_performance(__FILE__, "bench 1 / 1 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Limit to fourth the expected size of the sorted array.
+ bench(max_size * sizeof(int) / 4, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1 / 4 memory", time, resource);
+ clear_counters(time, resource);
+
+ start_counters(time, resource);
+ // Limit to 1/64 of range size.
+ bench(max_size * sizeof(int) / 64, revv, fwdv, rndv);
+ stop_counters(time, resource);
+
+ report_performance(__FILE__, "bench 1 /64 memory", time, resource);
+ clear_counters(time, resource);
+
+ revv.resize(small_size);
+ fwdv.resize(small_size);
+ rndv.resize(small_size);
+
+ start_counters(time, resource);
+ // Forbid any allocation.
+ bench(0, revv, fwdv, rndv);
+ stop_counters(time, resource);
+ report_performance(__FILE__, "bench 0 / 1 memory", time, resource);
return 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)#
2019-07-16 16:40 ` François Dumont
@ 2020-11-19 12:08 ` Jonathan Wakely
0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Wakely @ 2020-11-19 12:08 UTC (permalink / raw)
To: François Dumont; +Cc: libstdc++, gcc-patches, chang jc
On 16/07/19 18:40 +0200, François Dumont wrote:
>Hi
>
>Â Â Â I eventually spent much more time working on the inplace_merge
>performance bench.
>
>Â Â Â And the results do not confirm the theory:
>
>Before patch:
>inplace_merge.cc           bench 1 / 1 memory          243rÂ
>227u  17s     1216mem   5pf
>inplace_merge.cc           bench 1 / 4 memory          297r 278u Â
>18s      480mem   0pf
>inplace_merge.cc           bench 1 /64 memory         373r 354u Â
>18s      480mem   0pf
>inplace_merge.cc           bench 0 / 1 memory          12r 11u   Â
>0s      480mem   0pf
>
>After the patch to reduce memory allocation:
>inplace_merge.cc           bench 1 / 1 memory          245rÂ
>227u  18s     1216mem   0pf
>inplace_merge.cc           bench 1 / 4 memory          292r 273u Â
>19s      480mem   0pf
>inplace_merge.cc           bench 1 /64 memory         373r 356u Â
>17s      480mem   0pf
>inplace_merge.cc           bench 0 / 1 memory          11r 11u   Â
>0s      480mem   0pf
>
>With the __len1 > __len2 condition change:
>inplace_merge.cc           bench 1 / 1 memory          244rÂ
>225u  20s     1216mem   0pf
>inplace_merge.cc           bench 1 / 4 memory          379r 361u Â
>17s      480mem   0pf
>inplace_merge.cc           bench 1 /64 memory          640r 625u Â
>16s      480mem   0pf
>inplace_merge.cc           bench 0 / 1 memory          11r 11u  Â
>0s      480mem   0pf
>
>When there is no memory limit the results are identical of course.
>Otherwise as soon as memory is limited performance starts to decrease
>with the condition change on __len1 vs __len2.
>
>Could you give the bench you use to demonstrate the enhancement ? I
>also wonder why your patch doesn't change consistently the same
>condition in __merge_without_buffer ?
>
>For the moment I'd like to propose the attached patch that is to say
>the reduction on the amount of allocated memory and the new/modified
>benches.
>
>Note that as soon as I forbid any memory allocation I also reduce the
>size of the range to merge cause the implementation rely on
>recursivity and so could end-up in a stack overflow. Maybe I need to
>check for simulator targets like several other tests ? Unless
>simulators do not run the performance tests ?
The performance tests are never run by default.
I don't think we should spend too much time caring about performance
of sorting close to Out Of Memory conditions. We don't try to optimize
std::vector or other cases to work when close to OOM.
So I think just reducing the memory usage is the right approach here.
>Regarding this stack overflow issue, is there some routine to find out
>how many levels of function calls can be added before reaching the
>stack overflow ? I could perhaps call __builtin_alloca and check the
>result but that smells. If I could find out this we could fallback on
>an iterative approach to complete the merge.
No, alloca is inherently unsafe. Please don't even consider it :-)
>Â Â Â PR libstdc++/83938
>Â Â Â * include/bits/stl_algo.h:
>Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
>Â Â Â (__stable_sort): Limit temporary buffer length.
>Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
>Â Â Â computation in the loop.
Please use "Change __len computation in the loop to avoid truncation."
It's a bit more descriptive.
>Â Â Â * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
>Â Â Â pivot positions.
>Â Â Â * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>Â Â Â * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>Â Â Â execution with different memory constraints.
>
>Ok to commit ?
>
>François
>
>On 6/9/19 4:27 PM, François Dumont wrote:
>>On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>>>On 29/10/18 07:06 +0100, François Dumont wrote:
>>>>Hi
>>>>
>>>>Â Â Â Some feedback regarding this patch ?
>>>
>>>Sorry this got missed, please resubmit during stage 1.
>>>
>>>You haven't CC'd the original patch author (chang jc) to give them a
>>>chance to comment on your proposed changes to the patch.
>>>
>>>The attached PDF on PR libstdc++/83938 has extensive discussion of the
>>>performance issue, but I don't see any for your version. Some
>>>performance benchmarks for your version would be helpful.
>>
>>Here is this patch again.
>>
>>This time it is much closer to John's original one, I just kept my
>>change on the size of the temporary buffer which doesn't need to be
>>as large as it is currently allocated, especially with John's patch.
>>
>>The performance tests are showing the good improvements, attached
>>are the before/after. Surprisingly I do not see any change regarding
>>allocated memory, I guess measures are not accurate enough. However
>>working on them also show that current implementation is fragile. If
>>you reduce the allowed allocated memory too much you end up with a
>>stack overflow because of the recursive implementation.
>>
>>I already have a patch to keep on trying to allocate memory as long
>>as above a given threshold rather than 0 but I am also working on
>>making implementation non-recursive. We'll see if even a buffer of
>>size 1 is better than no buffer at all then.
>>
>>Â Â Â PR libstdc++/83938
>>Â Â Â * include/bits/stl_algo.h:
>>Â Â Â (__merge_adaptive): When buffer too small consider smallest
>>range first
>>Â Â Â and cut based on buffer size.
>>Â Â Â (__inplace_merge): Take temporary buffer length from smallest range.
>>Â Â Â (__stable_sort): Limit temporary buffer length.
>>Â Â Â * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>>Â Â Â to reduce requested buffer length on allocation failure.
>>Â Â Â * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>>Â Â Â * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>>Â Â Â execution with different level of memory.
>>
>>François
>>
>
>diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
>index ec651e2cc45..b396437443f 100644
>--- a/libstdc++-v3/include/bits/stl_algo.h
>+++ b/libstdc++-v3/include/bits/stl_algo.h
>@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> const _DistanceType __len2 = std::distance(__middle, __last);
>
> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
>- _TmpBuf __buf(__first, __len1 + __len2);
>+ _TmpBuf __buf(__first, std::min(__len1, __len2));
Please add a comment before this saying something like:
// __merge_adaptive will use a buffer for the smaller of
// [first,middle) and [middle,last).
> if (__buf.begin() == 0)
> std::__merge_without_buffer
>@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
> std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
> }
>+
> std::__merge_adaptive(__first, __middle, __last,
> _Distance(__middle - __first),
> _Distance(__last - __middle),
>@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
> typedef typename iterator_traits<_RandomAccessIterator>::difference_type
> _DistanceType;
>
>+ if (__first == __last)
>+ return;
>+
> typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
>- _TmpBuf __buf(__first, std::distance(__first, __last));
>+ _TmpBuf __buf(__first, (__last - __first + 1) / 2);
Please add a comment before this saying something like:
// __stable_sort_adaptive sorts the range in two halves,
// so the buffer only needs to fit half the range at once.
That explains why (last - first + 1)/2 is correct, and if we ever
change __stable_sort_adaptive we'll know to change this too.
>
> if (__buf.begin() == 0)
> std::__inplace_stable_sort(__first, __last, __comp);
>diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
>index fa78ee53eb4..b6ad9ee6a46 100644
>--- a/libstdc++-v3/include/bits/stl_tempbuf.h
>+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
>@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> std::nothrow));
> if (__tmp != 0)
> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>- __len /= 2;
>+ __len = __len == 1 ? 0 : ((__len + 1) / 2);
> }
> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
> }
>diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>index 0b0c9c59640..9c393ce8fe3 100644
>--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>@@ -28,7 +28,7 @@ using std::inplace_merge;
>
> typedef test_container<int, bidirectional_iterator_wrapper> container;
>
>-void
>+void
> test1()
> {
> int array[] = { 1 };
>@@ -39,7 +39,7 @@ test1()
> inplace_merge(con2.begin(), con2.begin(), con2.end());
> }
>
>-void
>+void
> test2()
> {
> int array[] = { 0, 2, 4, 1, 3, 5 };
>@@ -60,30 +60,34 @@ struct S
> { return a < _s.a; }
> };
>
>-void
>+void
> test3()
> {
> S s[8];
>- s[0].a = 0;
>- s[1].a = 1;
>- s[2].a = 2;
>- s[3].a = 3;
>- s[4].a = 0;
>- s[5].a = 1;
>- s[6].a = 2;
>- s[7].a = 3;
>-
>- s[0].b = 0;
>- s[1].b = 1;
>- s[2].b = 2;
>- s[3].b = 3;
>- s[4].b = 4;
>- s[5].b = 5;
>- s[6].b = 6;
>- s[7].b = 7;
>-
>- inplace_merge(s, s + 4, s + 8);
>- VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 );
>+ for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx)
>+ {
>+ int bval = 0;
>+ for (int i = 0; i != pivot_idx; ++i)
>+ {
>+ s[i].a = i;
>+ s[i].b = bval++;
>+ }
>+
>+ for (int i = pivot_idx; i != 8; ++i)
>+ {
>+ s[i].a = i - pivot_idx;
>+ s[i].b = bval++;
>+ }
>+
>+ inplace_merge(s, s + pivot_idx, s + 8);
>+
>+ for (int i = 1; i < 8; ++i)
>+ {
>+ VERIFY( !(s[i] < s[i - 1]) );
>+ if (s[i - 1].a == s[i].a)
>+ VERIFY( s[i - 1].b < s[i].b );
>+ }
>+ }
> }
I know it's redundant, but I think I'd prefer to keep test3 unchanged
and add test4 for the more general test that uses different pivots.
OK for trunk with those changes.
Thanks for persisting with this.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-11-19 12:08 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CALnYPH-iONCpF595=MpWP=w_+4xGqU11=Br4Yh=g=wdf2Ken3A@mail.gmail.com>
2018-01-19 20:05 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938) Jason Merrill
2018-01-25 22:37 ` chang jc
2018-01-30 21:42 ` François Dumont
2018-02-02 17:38 ` François Dumont
2018-06-06 16:39 ` François Dumont
2018-07-24 10:22 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938) François Dumont
2018-08-21 20:34 ` François Dumont
2018-10-29 6:07 ` François Dumont
2018-12-21 20:57 ` Jonathan Wakely
2019-06-09 14:27 ` François Dumont
2019-07-16 16:40 ` François Dumont
2020-11-19 12:08 ` [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)# Jonathan Wakely
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).