I missed a test that was failing because of this patch. So I revert a small part of it and here is the new proposal. Tested under Linux x86_64, ok to commit ? François On 24/07/2018 12:22, François Dumont wrote: > Hi > >     Any chance to review this patch ? > > François > > > On 06/06/2018 18:39, François Dumont wrote: >> Hi >> >>     I review and rework this proposal. I noticed that the same idea >> to limit buffer size within inplace_merge also apply to stable_sort. >> >>     I also change the decision when buffer is too small to consider >> the buffer size rather than going through successive cuts of the >> original ranges. This way we won't cut the range more than necessary. >> Note that if you prefer this second part could be committed seperatly. >> >>     PR libstdc++/83938 >>     * include/bits/stl_algo.h: >>     (__stable_partition_adaptive): When buffer to small cut range using >>     buffer size. >>     (__inplace_merge): Take temporary buffer length from smallest range. >>     (__merge_adaptive): When buffer too small consider smallest range >> first >>     and cut based on buffer size. >>     (__stable_sort_adaptive): When buffer too small cut based on buffer >>     size. >>     (__stable_sort): Limit temporary buffer length. >>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function >>     to reduce requested buffer length on allocation failure. >> >> Tested under Linux x86_64. >> >> Ok to commit ? >> >> François >> >> >> On 25/01/2018 23:37, chang jc wrote: >>> Hi: >>> >>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >>> __len = (__len == 1) ? 0 : ((__len + 1) / 2); >>> >>> 2. The coding gain has been shown  PR c++/83938; I re-post here >>> >>> >>> >>> >>>    21 >>>    20 >>>    19 >>>    18 >>>    17 >>>    16 >>> >>> >>>    0.471136 >>>    0.625695 >>>    0.767262 >>>    0.907461 >>>    1.04838 >>>    1.19508 >>> >>> >>>    0.340845 >>>    0.48651 >>>    0.639139 >>>    0.770133 >>>    0.898454 >>>    1.04632 >>> >>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with >>> 4325376 & B with 12451840 elements, total with 16M integers >>> >>> The proposed method has the speed up under given buffer size =, ex >>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >>> given sizof(int)*64K bytes. >>> >>> 3. As your suggestion, _TmpBuf __buf should be rewrite. >>> >>> 4. It represents a fact that the intuitive idea to split from larger >>> part is wrong. >>> >>> For example, if you have an input sorted array A & B, A has 8 integers >>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers. >>> >>> Current it tries to split from B, right? >>> >>> Then we have: >>> >>> A1 | A2  B1 | B2 >>> >>> B1 & B2 has 12 integers each, right? >>> >>> Current algorithm selects pivot as 13th integer from B, right? >>> >>> If the corresponding upper bound of A is 6th integer. >>> >>> Then it split in >>> >>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >>> >>> After rotate, we have two arrays to merge >>> >>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12] >>> >>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >>> >>> Sadly, [A1 = 5 | B1 = 12] CANNOT. >>> >>> So we do rotate again, split & merge the two split arrays from [A1 = 5 >>> | B1 = 12] again. >>> >>> >>> But wait, if we always split from the smaller one instead of larger >>> one. >>> >>> After rotate, it promises two split arrays both contain >>> ceiling[small/2]. >>> >>> Since tmp buffer also allocate by starting from sizeof(small) & >>> recursively downgrade by ceiling[small/2^(# of fail allocate)]. >>> >>> It means the allocated tmp buffer promises to be sufficient at the >>> level of (# of fail allocate). >>> >>> Instead, you can see if split from large at level (# of fail allocate) >>> several split array still CANNOT use  tmp buffer to do buffered merge. >>> >>> >>> As you know, buffered merge is far speed then (split, rotate, and >>> merge two sub-arrays) (PR c++/83938 gives the profiling figures), >>> >>> the way should provide speedup. >>> >>> >>> Thanks. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 24/01/2018 18:23, François Dumont wrote: >>> >>> Hi >>> >>> >>>      It sounds like a very sensitive change to make but nothing >>> worth figures. >>> Do you have any bench showing the importance of the gain ? >>> >>>      At least the memory usage optimization is obvious. >>> >>> On 19/01/2018 10:43, chang jc wrote: >>> >>> Current std::inplace_merge() suffers from performance issue by >>> inefficient >>> >>> logic under limited memory, >>> >>> It leads to performance downgrade. >>> >>> Please help to review it. >>> >>> Index: include/bits/stl_algo.h >>> =================================================================== >>> --- include/bits/stl_algo.h    (revision 256871) >>> +++ include/bits/stl_algo.h    (working copy) >>> @@ -2437,7 +2437,7 @@ >>>          _BidirectionalIterator __second_cut = __middle; >>>          _Distance __len11 = 0; >>>          _Distance __len22 = 0; >>> -      if (__len1 > __len2) >>> +      if (__len1 < __len2) >>>            { >>>              __len11 = __len1 / 2; >>>              std::advance(__first_cut, __len11); >>> @@ -2539,9 +2539,15 @@ >>>          const _DistanceType __len1 = std::distance(__first, __middle); >>>          const _DistanceType __len2 = std::distance(__middle, __last); >>> >>> + >>> >>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>> _TmpBuf; >>> >>> -      _TmpBuf __buf(__first, __last); >>> - >>> +      _BidirectionalIterator __start, __end; >>> +      if (__len1 < __len2) { >>> +    __start = __first; __end = __middle; >>> +      } else { >>> +    __start = __middle; __end = __last; >>> +      } >>> +      _TmpBuf __buf(__start, ___end); >>> >>> Note another optimization, we could introduce a _Temporary_buffer<> >>> constructor >>> in order to write: >>> >>> _TmpBuf __buf(std::min(__len1, __len2), __first); >>> >>> So that std::distance do not need to be called again. >>> >>>          if (__buf.begin() == 0) >>>        std::__merge_without_buffer >>>          (__first, __middle, __last, __len1, __len2, __comp); >>> Index: include/bits/stl_tempbuf.h >>> =================================================================== >>> --- include/bits/stl_tempbuf.h    (revision 256871) >>> +++ include/bits/stl_tempbuf.h    (working copy) >>> @@ -95,7 +95,7 @@ >>>                                std::nothrow)); >>>          if (__tmp != 0) >>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>> -      __len /= 2; >>> +      __len = (__len + 1) / 2; >>> >>> This part is problematic, if __len is 1 and allocation fails it will >>> loop >>> forever. >>> >>> It doesn't seem really necessary for your patch. >>> >>> >>> 2018-01-20 4:05 GMT+08:00 Jason Merrill: >>> >>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm >>>> adding the libstdc++ list to CC. >>>> >>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc  wrote: >>>>> Current std::inplace_merge() suffers from performance issue by >>>> inefficient >>>>> logic under limited memory, >>>>> >>>>> It leads to performance downgrade. >>>>> >>>>> Please help to review it. >>>>> >>>>> Index: include/bits/stl_algo.h >>>>> =================================================================== >>>>> --- include/bits/stl_algo.h     (revision 256871) >>>>> +++ include/bits/stl_algo.h     (working copy) >>>>> @@ -2437,7 +2437,7 @@ >>>>>            _BidirectionalIterator __second_cut = __middle; >>>>>            _Distance __len11 = 0; >>>>>            _Distance __len22 = 0; >>>>> -         if (__len1 > __len2) >>>>> +         if (__len1 < __len2) >>>>>              { >>>>>                __len11 = __len1 / 2; >>>>>                std::advance(__first_cut, __len11); >>>>> @@ -2539,9 +2539,15 @@ >>>>>         const _DistanceType __len1 = std::distance(__first, >>>>> __middle); >>>>>         const _DistanceType __len2 = std::distance(__middle, __last); >>>>> >>>>> + >>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>>> _TmpBuf; >>>>> -      _TmpBuf __buf(__first, __last); >>>>> - >>>>> +      _BidirectionalIterator __start, __end; >>>>> +      if (__len1 < __len2) { >>>>> +       __start = __first; __end = __middle; >>>>> +      } else { >>>>> +       __start = __middle; __end = __last; >>>>> +      } >>>>> +      _TmpBuf __buf(__start, ___end); >>>>>         if (__buf.begin() == 0) >>>>>          std::__merge_without_buffer >>>>>            (__first, __middle, __last, __len1, __len2, __comp); >>>>> Index: include/bits/stl_tempbuf.h >>>>> =================================================================== >>>>> --- include/bits/stl_tempbuf.h  (revision 256871) >>>>> +++ include/bits/stl_tempbuf.h  (working copy) >>>>> @@ -95,7 +95,7 @@ >>>>> std::nothrow)); >>>>>            if (__tmp != 0) >>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>> -         __len /= 2; >>>>> +         __len = (__len + 1) / 2; >>>>>          } >>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>>>       } >>>>> >>>>> >>>>> >>>>> >>>>> Thanks >> >> >