From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from m0.truegem.net (m0.truegem.net [69.55.228.47]) by sourceware.org (Postfix) with ESMTPS id 1287E385800A for ; Mon, 18 Jan 2021 07:07:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1287E385800A Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=maxrnd.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=mark@maxrnd.com Received: (from daemon@localhost) by m0.truegem.net (8.12.11/8.12.11) id 10I77Vgt096862 for ; Sun, 17 Jan 2021 23:07:31 -0800 (PST) (envelope-from mark@maxrnd.com) Received: from 162-235-43-67.lightspeed.irvnca.sbcglobal.net(162.235.43.67), claiming to be "[192.168.1.20]" via SMTP by m0.truegem.net, id smtpdJ9EGIG; Sun Jan 17 23:07:24 2021 Subject: Re: Extreme slowdown due to malloc? To: Cygwin-Apps References: <87mty66fw5.fsf@Rainer.invalid> <012a9e3c-ec24-f307-a3c4-9f2589d54e34@maxrnd.com> <87k0tae4cm.fsf@Otto.invalid> <87eej3beys.fsf@Rainer.invalid> From: Mark Geisert Message-ID: Date: Sun, 17 Jan 2021 23:07:24 -0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4 MIME-Version: 1.0 In-Reply-To: <87eej3beys.fsf@Rainer.invalid> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin-apps@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Cygwin package maintainer discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jan 2021 07:07:33 -0000 Hi Achim, Thank you very much for the detailed instructions and also the comparison data Linux vs Cygwin for all those testcases. Achim Gratz wrote: > ASSI writes: >>> I have a Cygwin malloc speedup patch that *might* help the m-t part. >>> I'll prepare and submit that to cygwin-patches shortly. >> >> Well, if you want to test it with the new ZStandard, give it a spin… >> I'll check how far I can strip that test down so you can use the Cygwin >> source tree for testing. I've now done this. And I don't see any improvement. Reasons below... > OK, it's actually pretty simple, do this inside a checkout of > newlib-cygwin: > > $ find newlib winsup texinfo -type f > flist > $ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover > > On Linux, it reads in all the files in about two seconds, while it takes > quite a while longer on Cygwin. But the real bummer is that > constructing the partial suffix arrays (which is single-threaded) will > seemingly take forever, while it's done much faster on Linux. You can > pare down the number of files like that: > > $ shuf -n 320 flist > slist I've settled on '-n 1600' for testing. I'm running these Cygwin tests on a 2C/4T i3-something with 8GB memory and an SSD used for filesystem and page file. Not a dog but clearly not a dire-wolf either. The page fault numbers are comparable to what you've shown for Cygwin on your system. The long pause after zstd prints "Constructing partial suffix array" is because zstd is cpu-bound in qsort() for a long time. No paging during that time. Then when the statistics start being printed out, that's when the paging insanity starts. What I discovered is that zstd is repeatedly asking malloc() for large memory blocks, presumably to mmap files in, then free()ing them. Any malloc request 256K or larger is fulfilled by mmap() rather than enlarging the heap for it. But crucially, there is no mechanism for our malloc to hang on to freed mmap()ed pages for future use. If you free an mmap()ed block, it is unmap()ed immediately. So for zstd's usage pattern you get an incredible number of page faults to satisfy the mmap()s and Windows seems to take a non-trivial bit of time for each mmap(). I will be looking at our malloc implementation to see if tuning something can fix this behavior. Adding code is the last resort. Thanks again for the great testcase. ..mark