From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from m0.truegem.net (m0.truegem.net [69.55.228.47]) by sourceware.org (Postfix) with ESMTPS id 223973857C4E for ; Fri, 25 Sep 2020 06:01:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 223973857C4E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=maxrnd.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=mark@maxrnd.com Received: (from daemon@localhost) by m0.truegem.net (8.12.11/8.12.11) id 08P61GpO042354 for ; Thu, 24 Sep 2020 23:01:16 -0700 (PDT) (envelope-from mark@maxrnd.com) Received: from 162-235-43-67.lightspeed.irvnca.sbcglobal.net(162.235.43.67), claiming to be "[192.168.1.100]" via SMTP by m0.truegem.net, id smtpdJeWdsB; Thu Sep 24 23:01:08 2020 X-Mozilla-News-Host: news://news.gmane.org:119 To: cygwin-developers@cygwin.com From: Mark Geisert Subject: Cygwin malloc tune-up status Message-ID: <067987e2-e958-b56c-efea-25d827568453@maxrnd.com> Date: Thu, 24 Sep 2020 23:01:09 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin-developers@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Cygwin core component developers mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Sep 2020 06:01:18 -0000 Hi folks, I've been looking into two potential enhancements of Cygwin's malloc operation for Cygwin 3.2. The first is to replace the existing function-level locking with something more fine-grained to help threaded processes; the second is to implement thread-specific memory pools with the aim of lessening lock activity even further. Although I've investigated several alternative malloc packages, including ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on the performance of Cygwin's malloc. Unfortunately using Windows Heaps would require fiddling with undocumented heap structures to enable use with fork(). I also looked at BSD's jemalloc and Google's tcmalloc. Those two require much more work to port to Cygwin so I've back-burnered them for the time being. I decided to concentrate on Cygwin's malloc, which is actually the most recent version of Doug Lea's dlmalloc package. Both of the desired enhancements can largely be achieved by changing a couple #defines in Cygwin's malloc_wrapper.cc. The following table shows some results of my investigation into various tunings of Cygwin's malloc implementation. Each column below is a form of the existing dlmalloc-based code, but making use of certain #defines to tune the allocator's behavior. The "legacy" implementation (first data column) is the one used up to Cygwin 3.1.x. See the NOTES at end of doc for table details. One thing that stands out in the profiling data is that Windows overhead is on the order of 90% of total profiling counts (on a malloc torture test program). So changes made within the Cygwin DLL are unlikely to speed up Cygwin's malloc unless the changes lead to less reliance on Windows calls. (I think this data boils down to "mmaps are slow on Windows", just as we already know "file I/O is slow on Windows".) I also have Cygwin DLL hot spot profiling data for the "no mspaces" and all mspace sizes shown here that justifies blaming mmaps. locking strategy: func locks data lock lockless* lockless* lockless* ---------- ---------- ---------- ---------- ---------- malloc strategy: legacy no mspaces mspace=64K mspace=512K mspace=8M ---------- ---------- ---------- ---------- ---------- *** MALTEST PROFILING DATA *** profile count subtotals... KernelBase.dll 94 97 45 43 36 kernel32.dll 19 29 27 27 18 ntdll.dll 46950 34389 74997 75765 82826 cygwin1.dll 1172 1545 4727 4906 4519 maltest.exe 1873 2472 3249 3573 3075 profile count totals 50108 38532 83045 84314 90474 perf vs legacy 1.00 1.30 0.60 0.59 0.55 profile counts as percentage of totals... Windows dlls 93.9% 89.6% 90.4% 89.9% 91.6% cygwin1.dll 2.3% 4.0% 5.7% 5.8% 5.0% maltest.exe 3.7% 6.4% 3.9% 4.2% 3.4% *** OTHER TEST DATA *** raw data... maltest, ops/sec 13863 19500 8087 8417 9197 cygwin config, secs 39.82 38.45 40.92 41.29 46.04 cygwin make -j4, secs 1600 1555 1611 1589 1611 perf vs legacy... maltest 1.00 1.41 0.58 0.61 0.66 cygwin config 1.00 1.04 0.97 0.96 0.86 cygwin make -j4 1.00 1.03 0.99 1.01 0.99 *** NOTES *** - "lockless*" means no lock needed if request can be satisfied from thread's own mspace, else a lock+unlock is needed on the global mspace. - Each profile "count" equals 0.01 CPU seconds. - Under OTHER TEST DATA, maltest and cygwin config data are averages of 5 runs, cygwin make data are from single runs. - "maltest" is a threaded malloc stress tester. In my testing it's set up to use 4 threads that allocate and later release random-sized blocks <= 512kB. A subset of block sizes are skewed somewhat smaller than truly random to simulate frequent C++ class instantiation. Threads also touch each page of a block on return from malloc(); this simulates actual app behavior better than just doing mallocs+frees. A subset of mallocs (~6%) are morphed into reallocs just to exercise that path. - The profile counts were obtained with cygmon, a tool I ought to release. - All investigation done on a 2C/4T 2.3GHz Windows 10 machine using an SSD. This email is basically a state dump of where I'm currently at. Comments or questions are welcome. I'm inclined to release what implements the first enhancement (maybe 3% speedup, more for multi-thread processes) but leave mspaces for the future, if at all. Maybe the less-than-satisfying mspace performance argues for trying harder to get jemalloc or tcmalloc investigated in the Cygwin environment. Thanks for reading. ..mark