From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 111423 invoked by alias); 21 Dec 2018 20:57:39 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 111400 invoked by uid 89); 21 Dec 2018 20:57:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-11.1 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=extensive, importance, Speed, suffers X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 21 Dec 2018 20:57:36 +0000 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id CC64A396A; Fri, 21 Dec 2018 20:57:34 +0000 (UTC) Received: from localhost (unknown [10.33.36.14]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6D5EF60C54; Fri, 21 Dec 2018 20:57:34 +0000 (UTC) Date: Fri, 21 Dec 2018 21:03:00 -0000 From: Jonathan Wakely To: =?iso-8859-1?Q?Fran=E7ois?= Dumont Cc: libstdc++@gcc.gnu.org, gcc-patches , chang jc Subject: Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938) Message-ID: <20181221205733.GA21151@redhat.com> References: <42f45044-ab29-c64d-e6e6-6a3ab64fc051@gmail.com> <0aef49ef-f20a-b5ac-2903-0d43c7e8226a@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0aef49ef-f20a-b5ac-2903-0d43c7e8226a@gmail.com> X-Clacks-Overhead: GNU Terry Pratchett User-Agent: Mutt/1.10.1 (2018-07-13) X-SW-Source: 2018-12/txt/msg01611.txt.bz2 On 29/10/18 07:06 +0100, François Dumont wrote: >Hi > >    Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. > >Thanks, >François > >On 8/21/18 10:34 PM, François Dumont wrote: >>I missed a test that was failing because of this patch. So I revert >>a small part of it and here is the new proposal. >> >>Tested under Linux x86_64, ok to commit ? >> >>François >> >> >>On 24/07/2018 12:22, François Dumont wrote: >>>Hi >>> >>>    Any chance to review this patch ? >>> >>>François >>> >>> >>>On 06/06/2018 18:39, François Dumont wrote: >>>>Hi >>>> >>>>    I review and rework this proposal. I noticed that the same >>>>idea to limit buffer size within inplace_merge also apply to >>>>stable_sort. >>>> >>>>    I also change the decision when buffer is too small to >>>>consider the buffer size rather than going through successive >>>>cuts of the original ranges. This way we won't cut the range >>>>more than necessary. Note that if you prefer this second part >>>>could be committed seperatly. >>>> >>>>    PR libstdc++/83938 >>>>    * include/bits/stl_algo.h: >>>>    (__stable_partition_adaptive): When buffer to small cut range using >>>>    buffer size. >>>>    (__inplace_merge): Take temporary buffer length from >>>>smallest range. >>>>    (__merge_adaptive): When buffer too small consider smallest >>>>range first >>>>    and cut based on buffer size. >>>>    (__stable_sort_adaptive): When buffer too small cut based on buffer >>>>    size. >>>>    (__stable_sort): Limit temporary buffer length. >>>>    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change >>>>function >>>>    to reduce requested buffer length on allocation failure. >>>> >>>>Tested under Linux x86_64. >>>> >>>>Ok to commit ? >>>> >>>>François >>>> >>>> >>>>On 25/01/2018 23:37, chang jc wrote: >>>>>Hi: >>>>> >>>>>1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >>>>>__len = (__len == 1) ? 0 : ((__len + 1) / 2); >>>>> >>>>>2. The coding gain has been shown  PR c++/83938; I re-post here >>>>> >>>>> >>>>> >>>>> >>>>>   21 >>>>>   20 >>>>>   19 >>>>>   18 >>>>>   17 >>>>>   16 >>>>> >>>>> >>>>>   0.471136 >>>>>   0.625695 >>>>>   0.767262 >>>>>   0.907461 >>>>>   1.04838 >>>>>   1.19508 >>>>> >>>>> >>>>>   0.340845 >>>>>   0.48651 >>>>>   0.639139 >>>>>   0.770133 >>>>>   0.898454 >>>>>   1.04632 >>>>> >>>>>it means when Merge [0, 4325376, 16777216); A is a sorted integer with >>>>>4325376 & B with 12451840 elements, total with 16M integers >>>>> >>>>>The proposed method has the speed up under given buffer size =, ex >>>>>2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >>>>>given sizof(int)*64K bytes. >>>>> >>>>>3. As your suggestion, _TmpBuf __buf should be rewrite. >>>>> >>>>>4. It represents a fact that the intuitive idea to split from larger >>>>>part is wrong. >>>>> >>>>>For example, if you have an input sorted array A & B, A has 8 integers >>>>>& B has 24 integers. Given tmp buffer whose capacity = 4 integers. >>>>> >>>>>Current it tries to split from B, right? >>>>> >>>>>Then we have: >>>>> >>>>>A1 | A2  B1 | B2 >>>>> >>>>>B1 & B2 has 12 integers each, right? >>>>> >>>>>Current algorithm selects pivot as 13th integer from B, right? >>>>> >>>>>If the corresponding upper bound of A is 6th integer. >>>>> >>>>>Then it split in >>>>> >>>>>A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >>>>> >>>>>After rotate, we have two arrays to merge >>>>> >>>>>[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12] >>>>> >>>>>Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >>>>> >>>>>Sadly, [A1 = 5 | B1 = 12] CANNOT. >>>>> >>>>>So we do rotate again, split & merge the two split arrays from [A1 = 5 >>>>>| B1 = 12] again. >>>>> >>>>> >>>>>But wait, if we always split from the smaller one instead of >>>>>larger one. >>>>> >>>>>After rotate, it promises two split arrays both contain >>>>>ceiling[small/2]. >>>>> >>>>>Since tmp buffer also allocate by starting from sizeof(small) & >>>>>recursively downgrade by ceiling[small/2^(# of fail allocate)]. >>>>> >>>>>It means the allocated tmp buffer promises to be sufficient at the >>>>>level of (# of fail allocate). >>>>> >>>>>Instead, you can see if split from large at level (# of fail allocate) >>>>>several split array still CANNOT use  tmp buffer to do buffered merge. >>>>> >>>>> >>>>>As you know, buffered merge is far speed then (split, rotate, and >>>>>merge two sub-arrays) (PR c++/83938 gives the profiling figures), >>>>> >>>>>the way should provide speedup. >>>>> >>>>> >>>>>Thanks. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>On 24/01/2018 18:23, François Dumont wrote: >>>>> >>>>>Hi >>>>> >>>>> >>>>>     It sounds like a very sensitive change to make but >>>>>nothing worth figures. >>>>>Do you have any bench showing the importance of the gain ? >>>>> >>>>>     At least the memory usage optimization is obvious. >>>>> >>>>>On 19/01/2018 10:43, chang jc wrote: >>>>> >>>>>Current std::inplace_merge() suffers from performance issue by >>>>>inefficient >>>>> >>>>>logic under limited memory, >>>>> >>>>>It leads to performance downgrade. >>>>> >>>>>Please help to review it. >>>>> >>>>>Index: include/bits/stl_algo.h >>>>>=================================================================== >>>>>--- include/bits/stl_algo.h    (revision 256871) >>>>>+++ include/bits/stl_algo.h    (working copy) >>>>>@@ -2437,7 +2437,7 @@ >>>>>         _BidirectionalIterator __second_cut = __middle; >>>>>         _Distance __len11 = 0; >>>>>         _Distance __len22 = 0; >>>>>-      if (__len1 > __len2) >>>>>+      if (__len1 < __len2) >>>>>           { >>>>>             __len11 = __len1 / 2; >>>>>             std::advance(__first_cut, __len11); >>>>>@@ -2539,9 +2539,15 @@ >>>>>         const _DistanceType __len1 = std::distance(__first, >>>>>__middle); >>>>>         const _DistanceType __len2 = std::distance(__middle, __last); >>>>> >>>>>+ >>>>> >>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>>>>_TmpBuf; >>>>> >>>>>-      _TmpBuf __buf(__first, __last); >>>>>- >>>>>+      _BidirectionalIterator __start, __end; >>>>>+      if (__len1 < __len2) { >>>>>+    __start = __first; __end = __middle; >>>>>+      } else { >>>>>+    __start = __middle; __end = __last; >>>>>+      } >>>>>+      _TmpBuf __buf(__start, ___end); >>>>> >>>>>Note another optimization, we could introduce a >>>>>_Temporary_buffer<> constructor >>>>>in order to write: >>>>> >>>>>_TmpBuf __buf(std::min(__len1, __len2), __first); >>>>> >>>>>So that std::distance do not need to be called again. >>>>> >>>>>         if (__buf.begin() == 0) >>>>>       std::__merge_without_buffer >>>>>         (__first, __middle, __last, __len1, __len2, __comp); >>>>>Index: include/bits/stl_tempbuf.h >>>>>=================================================================== >>>>>--- include/bits/stl_tempbuf.h    (revision 256871) >>>>>+++ include/bits/stl_tempbuf.h    (working copy) >>>>>@@ -95,7 +95,7 @@ >>>>>                               std::nothrow)); >>>>>         if (__tmp != 0) >>>>>           return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>>-      __len /= 2; >>>>>+      __len = (__len + 1) / 2; >>>>> >>>>>This part is problematic, if __len is 1 and allocation fails >>>>>it will loop >>>>>forever. >>>>> >>>>>It doesn't seem really necessary for your patch. >>>>> >>>>> >>>>>2018-01-20 4:05 GMT+08:00 Jason Merrill: >>>>> >>>>>>This is a libstdc++ bug and patch, not the C++ front end.  So I'm >>>>>>adding the libstdc++ list to CC. >>>>>> >>>>>>On Fri, Jan 19, 2018 at 3:02 AM, chang >>>>>>jc  wrote: >>>>>>>Current std::inplace_merge() suffers from performance issue by >>>>>>inefficient >>>>>>>logic under limited memory, >>>>>>> >>>>>>>It leads to performance downgrade. >>>>>>> >>>>>>>Please help to review it. >>>>>>> >>>>>>>Index: include/bits/stl_algo.h >>>>>>>=================================================================== >>>>>>>--- include/bits/stl_algo.h     (revision 256871) >>>>>>>+++ include/bits/stl_algo.h     (working copy) >>>>>>>@@ -2437,7 +2437,7 @@ >>>>>>>           _BidirectionalIterator __second_cut = __middle; >>>>>>>           _Distance __len11 = 0; >>>>>>>           _Distance __len22 = 0; >>>>>>>-         if (__len1 > __len2) >>>>>>>+         if (__len1 < __len2) >>>>>>>             { >>>>>>>               __len11 = __len1 / 2; >>>>>>>               std::advance(__first_cut, __len11); >>>>>>>@@ -2539,9 +2539,15 @@ >>>>>>>        const _DistanceType __len1 = >>>>>>>std::distance(__first, __middle); >>>>>>>        const _DistanceType __len2 = >>>>>>>std::distance(__middle, __last); >>>>>>> >>>>>>>+ >>>>>>>        typedef _Temporary_buffer<_BidirectionalIterator, >>>>>>>_ValueType> >>>>>>_TmpBuf; >>>>>>>-      _TmpBuf __buf(__first, __last); >>>>>>>- >>>>>>>+      _BidirectionalIterator __start, __end; >>>>>>>+      if (__len1 < __len2) { >>>>>>>+       __start = __first; __end = __middle; >>>>>>>+      } else { >>>>>>>+       __start = __middle; __end = __last; >>>>>>>+      } >>>>>>>+      _TmpBuf __buf(__start, ___end); >>>>>>>        if (__buf.begin() == 0) >>>>>>>         std::__merge_without_buffer >>>>>>>           (__first, __middle, __last, __len1, __len2, __comp); >>>>>>>Index: include/bits/stl_tempbuf.h >>>>>>>=================================================================== >>>>>>>--- include/bits/stl_tempbuf.h  (revision 256871) >>>>>>>+++ include/bits/stl_tempbuf.h  (working copy) >>>>>>>@@ -95,7 +95,7 @@ >>>>>>>std::nothrow)); >>>>>>>           if (__tmp != 0) >>>>>>>             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>>>>-         __len /= 2; >>>>>>>+         __len = (__len + 1) / 2; >>>>>>>         } >>>>>>>        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>>>>>      } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>Thanks >>>> >>>> >>> >> >