From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-493037-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 111423 invoked by alias); 21 Dec 2018 20:57:39 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 111400 invoked by uid 89); 21 Dec 2018 20:57:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-11.1 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=extensive, importance, Speed, suffers
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 21 Dec 2018 20:57:36 +0000
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id CC64A396A;	Fri, 21 Dec 2018 20:57:34 +0000 (UTC)
Received: from localhost (unknown [10.33.36.14])	by smtp.corp.redhat.com (Postfix) with ESMTP id 6D5EF60C54;	Fri, 21 Dec 2018 20:57:34 +0000 (UTC)
Date: Fri, 21 Dec 2018 21:03:00 -0000
From: Jonathan Wakely <jwakely@redhat.com>
To: =?iso-8859-1?Q?Fran=E7ois?= Dumont <frs.dumont@gmail.com>
Cc: libstdc++@gcc.gnu.org, gcc-patches <gcc-patches@gcc.gnu.org>,	chang jc <r97922153@gmail.com>
Subject: Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)
Message-ID: <20181221205733.GA21151@redhat.com>
References: <CALnYPH-iONCpF595=MpWP=w_+4xGqU11=Br4Yh=g=wdf2Ken3A@mail.gmail.com> <CADzB+2kDq7A_h_+YdgLHQsqO+rhZabndHzxLmqQ5+bf3jHobxw@mail.gmail.com> <CALnYPH__r8ragj4j0nAY2DKOUFZafYoEEbsjJmMp0Cw-3x=-2w@mail.gmail.com> <ac9ee48a-8252-a750-5a27-9414463e375b@gmail.com> <fe4b246f-c06f-40b8-5c7d-34e89037cfb9@gmail.com> <42f45044-ab29-c64d-e6e6-6a3ab64fc051@gmail.com> <0aef49ef-f20a-b5ac-2903-0d43c7e8226a@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <0aef49ef-f20a-b5ac-2903-0d43c7e8226a@gmail.com>
X-Clacks-Overhead: GNU Terry Pratchett
User-Agent: Mutt/1.10.1 (2018-07-13)
X-SW-Source: 2018-12/txt/msg01611.txt.bz2

On 29/10/18 07:06 +0100, François Dumont wrote:
>Hi
>
>    Some feedback regarding this patch ?

Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.


>
>Thanks,
>François
>
>On 8/21/18 10:34 PM, François Dumont wrote:
>>I missed a test that was failing because of this patch. So I revert 
>>a small part of it and here is the new proposal.
>>
>>Tested under Linux x86_64, ok to commit ?
>>
>>François
>>
>>
>>On 24/07/2018 12:22, François Dumont wrote:
>>>Hi
>>>
>>>    Any chance to review this patch ?
>>>
>>>François
>>>
>>>
>>>On 06/06/2018 18:39, François Dumont wrote:
>>>>Hi
>>>>
>>>>    I review and rework this proposal. I noticed that the same 
>>>>idea to limit buffer size within inplace_merge also apply to 
>>>>stable_sort.
>>>>
>>>>    I also change the decision when buffer is too small to 
>>>>consider the buffer size rather than going through successive 
>>>>cuts of the original ranges. This way we won't cut the range 
>>>>more than necessary. Note that if you prefer this second part 
>>>>could be committed seperatly.
>>>>
>>>>    PR libstdc++/83938
>>>>    * include/bits/stl_algo.h:
>>>>    (__stable_partition_adaptive): When buffer to small cut range using
>>>>    buffer size.
>>>>    (__inplace_merge): Take temporary buffer length from 
>>>>smallest range.
>>>>    (__merge_adaptive): When buffer too small consider smallest 
>>>>range first
>>>>    and cut based on buffer size.
>>>>    (__stable_sort_adaptive): When buffer too small cut based on buffer
>>>>    size.
>>>>    (__stable_sort): Limit temporary buffer length.
>>>>    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
>>>>function
>>>>    to reduce requested buffer length on allocation failure.
>>>>
>>>>Tested under Linux x86_64.
>>>>
>>>>Ok to commit ?
>>>>
>>>>François
>>>>
>>>>
>>>>On 25/01/2018 23:37, chang jc wrote:
>>>>>Hi:
>>>>>
>>>>>1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>>>__len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>>
>>>>>2. The coding gain has been shown  PR c++/83938; I re-post here
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   21
>>>>>   20
>>>>>   19
>>>>>   18
>>>>>   17
>>>>>   16
>>>>>
>>>>>
>>>>>   0.471136
>>>>>   0.625695
>>>>>   0.767262
>>>>>   0.907461
>>>>>   1.04838
>>>>>   1.19508
>>>>>
>>>>>
>>>>>   0.340845
>>>>>   0.48651
>>>>>   0.639139
>>>>>   0.770133
>>>>>   0.898454
>>>>>   1.04632
>>>>>
>>>>>it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>>>4325376 & B with 12451840 elements, total with 16M integers
>>>>>
>>>>>The proposed method has the speed up under given buffer size =, ex
>>>>>2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>>>given sizof(int)*64K bytes.
>>>>>
>>>>>3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>>
>>>>>4. It represents a fact that the intuitive idea to split from larger
>>>>>part is wrong.
>>>>>
>>>>>For example, if you have an input sorted array A & B, A has 8 integers
>>>>>& B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>>
>>>>>Current it tries to split from B, right?
>>>>>
>>>>>Then we have:
>>>>>
>>>>>A1 | A2  B1 | B2
>>>>>
>>>>>B1 & B2 has 12 integers each, right?
>>>>>
>>>>>Current algorithm selects pivot as 13th integer from B, right?
>>>>>
>>>>>If the corresponding upper bound of A is 6th integer.
>>>>>
>>>>>Then it split in
>>>>>
>>>>>A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>>
>>>>>After rotate, we have two arrays to merge
>>>>>
>>>>>[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>>>
>>>>>Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>>
>>>>>Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>>
>>>>>So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>>>| B1 = 12] again.
>>>>>
>>>>>
>>>>>But wait, if we always split from the smaller one instead of 
>>>>>larger one.
>>>>>
>>>>>After rotate, it promises two split arrays both contain 
>>>>>ceiling[small/2].
>>>>>
>>>>>Since tmp buffer also allocate by starting from sizeof(small) &
>>>>>recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>>
>>>>>It means the allocated tmp buffer promises to be sufficient at the
>>>>>level of (# of fail allocate).
>>>>>
>>>>>Instead, you can see if split from large at level (# of fail allocate)
>>>>>several split array still CANNOT use  tmp buffer to do buffered merge.
>>>>>
>>>>>
>>>>>As you know, buffered merge is far speed then (split, rotate, and
>>>>>merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>>
>>>>>the way should provide speedup.
>>>>>
>>>>>
>>>>>Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On 24/01/2018 18:23, François Dumont wrote:
>>>>>
>>>>>Hi
>>>>>
>>>>>
>>>>>     It sounds like a very sensitive change to make but 
>>>>>nothing worth figures.
>>>>>Do you have any bench showing the importance of the gain ?
>>>>>
>>>>>     At least the memory usage optimization is obvious.
>>>>>
>>>>>On 19/01/2018 10:43, chang jc wrote:
>>>>>
>>>>>Current std::inplace_merge() suffers from performance issue by 
>>>>>inefficient
>>>>>
>>>>>logic under limited memory,
>>>>>
>>>>>It leads to performance downgrade.
>>>>>
>>>>>Please help to review it.
>>>>>
>>>>>Index: include/bits/stl_algo.h
>>>>>===================================================================
>>>>>--- include/bits/stl_algo.h    (revision 256871)
>>>>>+++ include/bits/stl_algo.h    (working copy)
>>>>>@@ -2437,7 +2437,7 @@
>>>>>         _BidirectionalIterator __second_cut = __middle;
>>>>>         _Distance __len11 = 0;
>>>>>         _Distance __len22 = 0;
>>>>>-      if (__len1 > __len2)
>>>>>+      if (__len1 < __len2)
>>>>>           {
>>>>>             __len11 = __len1 / 2;
>>>>>             std::advance(__first_cut, __len11);
>>>>>@@ -2539,9 +2539,15 @@
>>>>>         const _DistanceType __len1 = std::distance(__first, 
>>>>>__middle);
>>>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>>+
>>>>>
>>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>>>_TmpBuf;
>>>>>
>>>>>-      _TmpBuf __buf(__first, __last);
>>>>>-
>>>>>+      _BidirectionalIterator __start, __end;
>>>>>+      if (__len1 < __len2) {
>>>>>+    __start = __first; __end = __middle;
>>>>>+      } else {
>>>>>+    __start = __middle; __end = __last;
>>>>>+      }
>>>>>+      _TmpBuf __buf(__start, ___end);
>>>>>
>>>>>Note another optimization, we could introduce a 
>>>>>_Temporary_buffer<> constructor
>>>>>in order to write:
>>>>>
>>>>>_TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>>
>>>>>So that std::distance do not need to be called again.
>>>>>
>>>>>         if (__buf.begin() == 0)
>>>>>       std::__merge_without_buffer
>>>>>         (__first, __middle, __last, __len1, __len2, __comp);
>>>>>Index: include/bits/stl_tempbuf.h
>>>>>===================================================================
>>>>>--- include/bits/stl_tempbuf.h    (revision 256871)
>>>>>+++ include/bits/stl_tempbuf.h    (working copy)
>>>>>@@ -95,7 +95,7 @@
>>>>>                               std::nothrow));
>>>>>         if (__tmp != 0)
>>>>>           return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>-      __len /= 2;
>>>>>+      __len = (__len + 1) / 2;
>>>>>
>>>>>This part is problematic, if __len is 1 and allocation fails 
>>>>>it will loop
>>>>>forever.
>>>>>
>>>>>It doesn't seem really necessary for your patch.
>>>>>
>>>>>
>>>>>2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>>
>>>>>>This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>>>>adding the libstdc++ list to CC.
>>>>>>
>>>>>>On Fri, Jan 19, 2018 at 3:02 AM, chang 
>>>>>>jc<r97922153@gmail.com>  wrote:
>>>>>>>Current std::inplace_merge() suffers from performance issue by
>>>>>>inefficient
>>>>>>>logic under limited memory,
>>>>>>>
>>>>>>>It leads to performance downgrade.
>>>>>>>
>>>>>>>Please help to review it.
>>>>>>>
>>>>>>>Index: include/bits/stl_algo.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_algo.h     (revision 256871)
>>>>>>>+++ include/bits/stl_algo.h     (working copy)
>>>>>>>@@ -2437,7 +2437,7 @@
>>>>>>>           _BidirectionalIterator __second_cut = __middle;
>>>>>>>           _Distance __len11 = 0;
>>>>>>>           _Distance __len22 = 0;
>>>>>>>-         if (__len1 > __len2)
>>>>>>>+         if (__len1 < __len2)
>>>>>>>             {
>>>>>>>               __len11 = __len1 / 2;
>>>>>>>               std::advance(__first_cut, __len11);
>>>>>>>@@ -2539,9 +2539,15 @@
>>>>>>>        const _DistanceType __len1 = 
>>>>>>>std::distance(__first, __middle);
>>>>>>>        const _DistanceType __len2 = 
>>>>>>>std::distance(__middle, __last);
>>>>>>>
>>>>>>>+
>>>>>>>        typedef _Temporary_buffer<_BidirectionalIterator, 
>>>>>>>_ValueType>
>>>>>>_TmpBuf;
>>>>>>>-      _TmpBuf __buf(__first, __last);
>>>>>>>-
>>>>>>>+      _BidirectionalIterator __start, __end;
>>>>>>>+      if (__len1 < __len2) {
>>>>>>>+       __start = __first; __end = __middle;
>>>>>>>+      } else {
>>>>>>>+       __start = __middle; __end = __last;
>>>>>>>+      }
>>>>>>>+      _TmpBuf __buf(__start, ___end);
>>>>>>>        if (__buf.begin() == 0)
>>>>>>>         std::__merge_without_buffer
>>>>>>>           (__first, __middle, __last, __len1, __len2, __comp);
>>>>>>>Index: include/bits/stl_tempbuf.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_tempbuf.h  (revision 256871)
>>>>>>>+++ include/bits/stl_tempbuf.h  (working copy)
>>>>>>>@@ -95,7 +95,7 @@
>>>>>>>std::nothrow));
>>>>>>>           if (__tmp != 0)
>>>>>>>             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>>>-         __len /= 2;
>>>>>>>+         __len = (__len + 1) / 2;
>>>>>>>         }
>>>>>>>        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>>>      }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>
>>>>
>>>
>>
>