public inbox for cygwin-developers@cygwin.com
 help / color / mirror / Atom feed
* Cygwin malloc tune-up status
@ 2020-09-25  6:01 Mark Geisert
  2020-09-27 16:54 ` Johannes Schindelin
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2020-09-25  6:01 UTC (permalink / raw)
  To: cygwin-developers

Hi folks,
I've been looking into two potential enhancements of Cygwin's malloc operation
for Cygwin 3.2.  The first is to replace the existing function-level locking
with something more fine-grained to help threaded processes; the second is to
implement thread-specific memory pools with the aim of lessening lock activity
even further.

Although I've investigated several alternative malloc packages, including
ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
the performance of Cygwin's malloc.  Unfortunately using Windows Heaps would
require fiddling with undocumented heap structures to enable use with fork().
I also looked at BSD's jemalloc and Google's tcmalloc.  Those two require much
more work to port to Cygwin so I've back-burnered them for the time being.

I decided to concentrate on Cygwin's malloc, which is actually the most recent
version of Doug Lea's dlmalloc package.  Both of the desired enhancements can
largely be achieved by changing a couple #defines in Cygwin's malloc_wrapper.cc.

The following table shows some results of my investigation into various tunings
of Cygwin's malloc implementation.  Each column below is a form of the existing
dlmalloc-based code, but making use of certain #defines to tune the allocator's
behavior.  The "legacy" implementation (first data column) is the one used up
to Cygwin 3.1.x.  See the NOTES at end of doc for table details.

One thing that stands out in the profiling data is that Windows overhead is on
the order of 90% of total profiling counts (on a malloc torture test program).
So changes made within the Cygwin DLL are unlikely to speed up Cygwin's malloc
unless the changes lead to less reliance on Windows calls.  (I think this data
boils down to "mmaps are slow on Windows", just as we already know "file I/O is
slow on Windows".)  I also have Cygwin DLL hot spot profiling data for the "no
mspaces" and all mspace sizes shown here that justifies blaming mmaps.

locking strategy: func locks    data lock    lockless*    lockless*    lockless*
                   ----------   ----------   ----------   ----------   ----------
malloc strategy:      legacy   no mspaces   mspace=64K  mspace=512K    mspace=8M
                   ----------   ----------   ----------   ----------   ----------
*** MALTEST PROFILING DATA ***
profile count subtotals...
KernelBase.dll            94           97           45           43           36
kernel32.dll              19           29           27           27           18
ntdll.dll              46950        34389        74997        75765        82826
cygwin1.dll             1172         1545         4727         4906         4519
maltest.exe             1873         2472         3249         3573         3075

profile count totals   50108        38532        83045        84314        90474
perf vs legacy          1.00         1.30         0.60         0.59         0.55

profile counts as percentage of totals...
Windows dlls           93.9%        89.6%        90.4%        89.9%        91.6%
cygwin1.dll             2.3%         4.0%         5.7%         5.8%         5.0%
maltest.exe             3.7%         6.4%         3.9%         4.2%         3.4%

*** OTHER TEST DATA ***
raw data...
maltest, ops/sec       13863        19500         8087         8417         9197
cygwin config, secs    39.82        38.45        40.92        41.29        46.04
cygwin make -j4, secs   1600         1555         1611         1589         1611

perf vs legacy...
maltest                 1.00         1.41         0.58         0.61         0.66
cygwin config           1.00         1.04         0.97         0.96         0.86
cygwin make -j4         1.00         1.03         0.99         1.01         0.99

*** NOTES ***
- "lockless*" means no lock needed if request can be satisfied from thread's
   own mspace, else a lock+unlock is needed on the global mspace.

- Each profile "count" equals 0.01 CPU seconds.

- Under OTHER TEST DATA, maltest and cygwin config data are averages of 5 runs,
   cygwin make data are from single runs.

- "maltest" is a threaded malloc stress tester.  In my testing it's set up to
   use 4 threads that allocate and later release random-sized blocks <= 512kB.
   A subset of block sizes are skewed somewhat smaller than truly random to
   simulate frequent C++ class instantiation.  Threads also touch each page of a
   block on return from malloc(); this simulates actual app behavior better than
   just doing mallocs+frees.  A subset of mallocs (~6%) are morphed into
   reallocs just to exercise that path.

- The profile counts were obtained with cygmon, a tool I ought to release.

- All investigation done on a 2C/4T 2.3GHz Windows 10 machine using an SSD.

This email is basically a state dump of where I'm currently at.  Comments or
questions are welcome.  I'm inclined to release what implements the first
enhancement (maybe 3% speedup, more for multi-thread processes) but leave
mspaces for the future, if at all.  Maybe the less-than-satisfying mspace
performance argues for trying harder to get jemalloc or tcmalloc investigated
in the Cygwin environment.

Thanks for reading.

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cygwin malloc tune-up status
  2020-09-25  6:01 Cygwin malloc tune-up status Mark Geisert
@ 2020-09-27 16:54 ` Johannes Schindelin
  2020-09-29  2:22   ` Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Schindelin @ 2020-09-27 16:54 UTC (permalink / raw)
  To: Mark Geisert; +Cc: cygwin-developers

Hi Mark,

On Thu, 24 Sep 2020, Mark Geisert wrote:

> I've been looking into two potential enhancements of Cygwin's malloc operation
> for Cygwin 3.2.  The first is to replace the existing function-level locking
> with something more fine-grained to help threaded processes; the second is to
> implement thread-specific memory pools with the aim of lessening lock activity
> even further.
>
> Although I've investigated several alternative malloc packages, including
> ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
> the performance of Cygwin's malloc.  Unfortunately using Windows Heaps would
> require fiddling with undocumented heap structures to enable use with fork().
> I also looked at BSD's jemalloc and Google's tcmalloc.  Those two require much
> more work to port to Cygwin so I've back-burnered them for the time being.

I am just a lurker when it comes to your project, but I wonder whether you
had any chance to look into mimalloc
(https://github.com/microsoft/mimalloc)? I had investigated it in Git for
Windows' context for a while (because nedmalloc, which is used by Git for
Windows, is no longer actively maintained).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cygwin malloc tune-up status
  2020-09-27 16:54 ` Johannes Schindelin
@ 2020-09-29  2:22   ` Mark Geisert
  2021-04-01  9:19     ` Teemu Nätkinniemi
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2020-09-29  2:22 UTC (permalink / raw)
  To: cygwin-developers

Johannes Schindelin wrote:
> Hi Mark,
> On Thu, 24 Sep 2020, Mark Geisert wrote:
>> I've been looking into two potential enhancements of Cygwin's malloc operation
>> for Cygwin 3.2.  The first is to replace the existing function-level locking
>> with something more fine-grained to help threaded processes; the second is to
>> implement thread-specific memory pools with the aim of lessening lock activity
>> even further.
>>
>> Although I've investigated several alternative malloc packages, including
>> ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
>> the performance of Cygwin's malloc.  Unfortunately using Windows Heaps would
>> require fiddling with undocumented heap structures to enable use with fork().
>> I also looked at BSD's jemalloc and Google's tcmalloc.  Those two require much
>> more work to port to Cygwin so I've back-burnered them for the time being.
> 
> I am just a lurker when it comes to your project, but I wonder whether you
> had any chance to look into mimalloc
> (https://github.com/microsoft/mimalloc)? I had investigated it in Git for
> Windows' context for a while (because nedmalloc, which is used by Git for
> Windows, is no longer actively maintained).

Hi Johannes,
Great minds think alike!  Yours is the 3rd pointer I've received on- and off-list 
towards mimalloc.  I had not heard of it before.  I've looked into it and have now 
added it to my back-burnered list.

mimalloc looks promising.  It's fairly small.  It has at least one issue it shares 
with jemalloc and tcmalloc: initialization code that needs to run before the 
Cygwin DLL has completely set up a new process' environment.  A chicken-and-egg 
problem, if you will.  A solution to that (which I'm pondering) will allow me to 
test all three malloc alternatives in the future.

Thank you to all our users for sharing helpful pointers!

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cygwin malloc tune-up status
  2020-09-29  2:22   ` Mark Geisert
@ 2021-04-01  9:19     ` Teemu Nätkinniemi
  2021-04-02  5:45       ` Maybe consider rpmalloc -- Was: " Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-01  9:19 UTC (permalink / raw)
  To: Mark Geisert; +Cc: cygwin-developers

ti 29. syysk. 2020 klo 5.23 Mark Geisert (mark@maxrnd.com) kirjoitti:
>
> Johannes Schindelin wrote:
> > Hi Mark,
> > On Thu, 24 Sep 2020, Mark Geisert wrote:
> >> I've been looking into two potential enhancements of Cygwin's malloc operation
> >> for Cygwin 3.2.  The first is to replace the existing function-level locking
> >> with something more fine-grained to help threaded processes; the second is to
> >> implement thread-specific memory pools with the aim of lessening lock activity
> >> even further.
> >>
> >> Although I've investigated several alternative malloc packages, including
> >> ptmalloc[23], nedalloc, and Windows Heaps, only the latter seems to improve on
> >> the performance of Cygwin's malloc.  Unfortunately using Windows Heaps would
> >> require fiddling with undocumented heap structures to enable use with fork().
> >> I also looked at BSD's jemalloc and Google's tcmalloc.  Those two require much
> >> more work to port to Cygwin so I've back-burnered them for the time being.
> >
> > I am just a lurker when it comes to your project, but I wonder whether you
> > had any chance to look into mimalloc
> > (https://github.com/microsoft/mimalloc)? I had investigated it in Git for
> > Windows' context for a while (because nedmalloc, which is used by Git for
> > Windows, is no longer actively maintained).
>
> Hi Johannes,
> Great minds think alike!  Yours is the 3rd pointer I've received on- and off-list
> towards mimalloc.  I had not heard of it before.  I've looked into it and have now
> added it to my back-burnered list.
>
> mimalloc looks promising.  It's fairly small.  It has at least one issue it shares
> with jemalloc and tcmalloc: initialization code that needs to run before the
> Cygwin DLL has completely set up a new process' environment.  A chicken-and-egg
> problem, if you will.  A solution to that (which I'm pondering) will allow me to
> test all three malloc alternatives in the future.
>

Hi!

I encounter a problem with Cygwin's malloc and remembered this thread.
Sorry if this is off-topic.

I have been trying to port bwa aligner to Cygwin. Initially everything
seemed to work but for some reason in some cases threading didn't seem
to work properly. I got a fix recently from a third party which was to
force bwa to use rpmalloc.

This got me thinking if there is a problem with Cygwin's malloc in
some cases and if there were people in this list who might be
interested in knowing that the problem exists.

Here's a link to the rpmalloc fix.

https://github.com/WGSExtract/bwa/commit/3087fa876b079fcb6a0a58f1e01757f4820094a8

Here's a test case:

bwa_original mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam

bwa_working mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam

Files:

https://drive.google.com/drive/folders/1waICih51f4mHZEyWY1onyEcKqm0kj3Yt?usp=sharing

Thanks,
Teemu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Maybe consider rpmalloc -- Was: Re: Cygwin malloc tune-up status
  2021-04-01  9:19     ` Teemu Nätkinniemi
@ 2021-04-02  5:45       ` Mark Geisert
  2021-04-03  2:53         ` Maybe consider rpmalloc Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2021-04-02  5:45 UTC (permalink / raw)
  To: cygwin-developers

Hi Teemu,

Teemu Nätkinniemi via Cygwin-developers wrote:
> Hi!
> 
> I encounter a problem with Cygwin's malloc and remembered this thread.
> Sorry if this is off-topic.

New topic, so new thread :-)

> I have been trying to port bwa aligner to Cygwin. Initially everything
> seemed to work but for some reason in some cases threading didn't seem
> to work properly. I got a fix recently from a third party which was to
> force bwa to use rpmalloc.
> 
> This got me thinking if there is a problem with Cygwin's malloc in
> some cases and if there were people in this list who might be
> interested in knowing that the problem exists. >
> Here's a link to the rpmalloc fix.
> 
> https://github.com/WGSExtract/bwa/commit/3087fa876b079fcb6a0a58f1e01757f4820094a8
> 
> Here's a test case:
> 
> bwa_original mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> 
> bwa_working mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> 
> Files:
> 
> https://drive.google.com/drive/folders/1waICih51f4mHZEyWY1onyEcKqm0kj3Yt?usp=sharing

I'm not aware of anything actually being broken in the current Cygwin malloc 
(which is just the most recent dlmalloc), even in multi-threaded workloads.  It's 
only that there has long been interest around possibilities of making it faster. 
So far I've been unsuccessful: though I can get the Cygwin-level malloc operation 
faster by using other malloc's, it has always been overshadowed by a lot more time 
being spent in ntdll.dll underneath.  There's no point releasing something 
cosmetically better but slower in practice.

Thanks for mentioning rpmalloc; that's a new one to me.  It does appear to be 
coded nicely, like most of the other malloc's I've looked at.  I'll see if I can 
get that to work with Cygwin; the big issue tends to be fork() having to replicate 
the parent malloc layout, book-keeping as well as app data, in the child.

I will point out that you're not using rpmalloc as it would work on Cygwin, you've 
got a Windows-flavored rpmalloc because you've #defined _WIN32 and such via the 
Makefile.cygwin.  But it appears to be a clean break with Cygwin's malloc in your 
application, so congrats on that.

I understand that you've solved what you think was a Cygwin malloc issue by using 
rpmalloc, but I don't see how you came to the conclusion that it was a malloc 
issue, as opposed to something else with threads or beyond that.
Thanks again,

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-02  5:45       ` Maybe consider rpmalloc -- Was: " Mark Geisert
@ 2021-04-03  2:53         ` Mark Geisert
  2021-04-03  6:46           ` Teemu Nätkinniemi
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2021-04-03  2:53 UTC (permalink / raw)
  To: cygwin-developers

Hi Teemu,
Regarding your test case...

>> Here's a test case:
>>
>> bwa_original mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
>>
>> bwa_working mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
>>
>> Files:
>>
>> https://drive.google.com/drive/folders/1waICih51f4mHZEyWY1onyEcKqm0kj3Yt?usp=sharing 

Could you please put a copy of the file hs37d5.fa into your Google drive folder? 
If that file can be built with the given fastq file and the bwa program, please 
show me how to do that.  A copy+paste from your terminal session would be 
preferable over the video captures you've supplied already, if possible.
Thank you,

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-03  2:53         ` Maybe consider rpmalloc Mark Geisert
@ 2021-04-03  6:46           ` Teemu Nätkinniemi
  2021-04-03  6:48             ` Teemu Nätkinniemi
  0 siblings, 1 reply; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-03  6:46 UTC (permalink / raw)
  To: Mark Geisert; +Cc: cygwin-developers

The necessary files are in the bwa_reference subfolder. hs37d5.fa in
the command is called a prefix in bwa. The original hs7d5.fa is not
needed.

The command is exactly like I wrote earlier.

./bwa mem -t 10 bwa_reference/hs37d5.fa test_1.fastq.gz
test_2.fastq.gz > test1.sam

Let me know if you have any problems executing the commands.

Teemu

la 3. huhtik. 2021 klo 5.53 Mark Geisert (mark@maxrnd.com) kirjoitti:
>
> Hi Teemu,
> Regarding your test case...
>
> >> Here's a test case:
> >>
> >> bwa_original mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> >>
> >> bwa_working mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> >>
> >> Files:
> >>
> >> https://drive.google.com/drive/folders/1waICih51f4mHZEyWY1onyEcKqm0kj3Yt?usp=sharing
>
> Could you please put a copy of the file hs37d5.fa into your Google drive folder?
> If that file can be built with the given fastq file and the bwa program, please
> show me how to do that.  A copy+paste from your terminal session would be
> preferable over the video captures you've supplied already, if possible.
> Thank you,
>
> ..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-03  6:46           ` Teemu Nätkinniemi
@ 2021-04-03  6:48             ` Teemu Nätkinniemi
  2021-04-11  9:28               ` Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-03  6:48 UTC (permalink / raw)
  To: Mark Geisert; +Cc: cygwin-developers

Sorry, hurt my back yesterday and looks like I am not thinking clearly.

./bwa mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test1.sam

la 3. huhtik. 2021 klo 9.46 Teemu Nätkinniemi (tnatkinn@gmail.com) kirjoitti:
>
> The necessary files are in the bwa_reference subfolder. hs37d5.fa in
> the command is called a prefix in bwa. The original hs7d5.fa is not
> needed.
>
> The command is exactly like I wrote earlier.
>
> ./bwa mem -t 10 bwa_reference/hs37d5.fa test_1.fastq.gz
> test_2.fastq.gz > test1.sam
>
> Let me know if you have any problems executing the commands.
>
> Teemu
>
> la 3. huhtik. 2021 klo 5.53 Mark Geisert (mark@maxrnd.com) kirjoitti:
> >
> > Hi Teemu,
> > Regarding your test case...
> >
> > >> Here's a test case:
> > >>
> > >> bwa_original mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> > >>
> > >> bwa_working mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test.sam
> > >>
> > >> Files:
> > >>
> > >> https://drive.google.com/drive/folders/1waICih51f4mHZEyWY1onyEcKqm0kj3Yt?usp=sharing
> >
> > Could you please put a copy of the file hs37d5.fa into your Google drive folder?
> > If that file can be built with the given fastq file and the bwa program, please
> > show me how to do that.  A copy+paste from your terminal session would be
> > preferable over the video captures you've supplied already, if possible.
> > Thank you,
> >
> > ..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-03  6:48             ` Teemu Nätkinniemi
@ 2021-04-11  9:28               ` Mark Geisert
  2021-04-12  8:48                 ` Teemu Nätkinniemi
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2021-04-11  9:28 UTC (permalink / raw)
  To: cygwin-developers

Hi Teemu,

Teemu Nätkinniemi via Cygwin-developers wrote:
> Sorry, hurt my back yesterday and looks like I am not thinking clearly.

Hope you are feeling better by this time.

> ./bwa mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test1.sam

Thanks.  It was my unfamiliarity with Google Drive which prevented my finding all 
the data files you had stored there.  After a while I did find all I needed.

I rebuilt bwa.exe alternately using the provided Makefile and Makefile.cygwin. 
When building with the latter I made sure your #ifdef patches were enabled so that 
rpmalloc was pulled in for the build.  When building with the former I made sure 
your patches were disabled, so the Cygwin malloc would be used for this case.

I had no difficulty running either version of bwa to completion.  On one smallish 
test machine the rpmalloc version finished in a bit less elapsed time but with the 
same CPU time as the Cygwin malloc version.

I also ran on a larger system; here both versions ran with similar elapsed and CPU 
times.  I also ran the Cygwin malloc version with '-t 32' to add some stress but 
still your test case ran to successful completion.

So I'm afraid I can't explain the results you were seeing.  Is it possible that 
you might have given up too soon running the Cygwin malloc version, thinking you 
should be seeing output as quickly as you would on Linux?  You won't, unfortunately.

You might try backing out your changes, or I think, building again on your main 
branch, to see if waiting longer proves successful.  If you have any other 
suggestions, please let us know.
Thanks & Regards,

..mark

P.S. Here's Cygwin malloc version's output from my smallish system
(i5, 2.3GHz, 2C/4T)...
./bwa mem -t 10 bwa_reference/hs37d5.fa /tmp/ERS4238880_1.fastq > test1.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1712342 sequences (100000087 bp)...
[M::process] read 1103688 sequences (64503600 bp)...
[M::mem_process_seqs] Processed 1712342 reads in 2157.077 CPU sec, 2214.871 real sec
[M::mem_process_seqs] Processed 1103688 reads in 1541.766 CPU sec, 1591.704 real sec
[main] Version: 0.7.17-r1198-dirty
[main] CMD: ./bwa mem -t 10 bwa_reference/hs37d5.fa /tmp/ERS4238880_1.fastq
[main] Real time: 3831.624 sec; CPU: 3713.937 sec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-11  9:28               ` Mark Geisert
@ 2021-04-12  8:48                 ` Teemu Nätkinniemi
  2021-04-13  8:24                   ` Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-12  8:48 UTC (permalink / raw)
  To: Mark Geisert; +Cc: cygwin-developers

Hello,

Thanks for testing! I found a better test case with smaller files
which should clearly show the issue.

https://drive.google.com/drive/folders/1jOilHtKrr6CHn7zg__DE93RCDyseoTB1?usp=sharing

Here's the results. Bwa_working is the one with rpmalloc and
bwa_original is unpatched. As you can see the unpatched version with
several threads takes a  whole lot more time to finish even when
compared with unpatched exe running with a single thread. I am not the
only one experiencing the issue so I doubt it is my system.

$ ../bwa-working/bwa_working.exe mem chr19_KI270866v1_alt.fasta
7859_GPI.read1.fq 7859_GPI.read2.fq > test1working.sam
(cut)
[main] Real time: 1.744 sec; CPU: 1.624 sec

$ ../bwa-working/bwa_working.exe mem -t 10 chr19_KI270866v1_alt.fasta
7859_GPI.read1.fq 7859_GPI.read2.fq > test1workingt10.sam
(cut)
[main] Real time: 0.354 sec; CPU: 2.218 sec

$ ../bwa-test/bwa_original.exe mem chr19_KI270866v1_alt.fasta
7859_GPI.read1.fq 7859_GPI.read2.fq > test1orig.sam
(cut)
[main] Real time: 1.733 sec; CPU: 1.608 sec

$ ../bwa-test/bwa_original.exe mem -t 10 chr19_KI270866v1_alt.fasta
7859_GPI.read1.fq 7859_GPI.read2.fq > test1origt10.sam
(cut)
[main] Real time: 8.131 sec; CPU: 5.265 sec

Teemu


su 11. huhtik. 2021 klo 12.52 Mark Geisert (mark@maxrnd.com) kirjoitti:
>
> Hi Teemu,
>
> Teemu Nätkinniemi via Cygwin-developers wrote:
> > Sorry, hurt my back yesterday and looks like I am not thinking clearly.
>
> Hope you are feeling better by this time.
>
> > ./bwa mem -t 10 bwa_reference/hs37d5.fa ERS4238880_1.fastq > test1.sam
>
> Thanks.  It was my unfamiliarity with Google Drive which prevented my finding all
> the data files you had stored there.  After a while I did find all I needed.
>
> I rebuilt bwa.exe alternately using the provided Makefile and Makefile.cygwin.
> When building with the latter I made sure your #ifdef patches were enabled so that
> rpmalloc was pulled in for the build.  When building with the former I made sure
> your patches were disabled, so the Cygwin malloc would be used for this case.
>
> I had no difficulty running either version of bwa to completion.  On one smallish
> test machine the rpmalloc version finished in a bit less elapsed time but with the
> same CPU time as the Cygwin malloc version.
>
> I also ran on a larger system; here both versions ran with similar elapsed and CPU
> times.  I also ran the Cygwin malloc version with '-t 32' to add some stress but
> still your test case ran to successful completion.
>
> So I'm afraid I can't explain the results you were seeing.  Is it possible that
> you might have given up too soon running the Cygwin malloc version, thinking you
> should be seeing output as quickly as you would on Linux?  You won't, unfortunately.
>
> You might try backing out your changes, or I think, building again on your main
> branch, to see if waiting longer proves successful.  If you have any other
> suggestions, please let us know.
> Thanks & Regards,
>
> ..mark
>
> P.S. Here's Cygwin malloc version's output from my smallish system
> (i5, 2.3GHz, 2C/4T)...
> ./bwa mem -t 10 bwa_reference/hs37d5.fa /tmp/ERS4238880_1.fastq > test1.sam
> [M::bwa_idx_load_from_disk] read 0 ALT contigs
> [M::process] read 1712342 sequences (100000087 bp)...
> [M::process] read 1103688 sequences (64503600 bp)...
> [M::mem_process_seqs] Processed 1712342 reads in 2157.077 CPU sec, 2214.871 real sec
> [M::mem_process_seqs] Processed 1103688 reads in 1541.766 CPU sec, 1591.704 real sec
> [main] Version: 0.7.17-r1198-dirty
> [main] CMD: ./bwa mem -t 10 bwa_reference/hs37d5.fa /tmp/ERS4238880_1.fastq
> [main] Real time: 3831.624 sec; CPU: 3713.937 sec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-12  8:48                 ` Teemu Nätkinniemi
@ 2021-04-13  8:24                   ` Mark Geisert
  2021-04-13 13:05                     ` Teemu Nätkinniemi
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2021-04-13  8:24 UTC (permalink / raw)
  To: cygwin-developers

Hi Teemu,

Teemu Nätkinniemi wrote:
> Hello,
> 
> Thanks for testing! I found a better test case with smaller files
> which should clearly show the issue.
> 
> https://drive.google.com/drive/folders/1jOilHtKrr6CHn7zg__DE93RCDyseoTB1?usp=sharing
> 
> Here's the results. Bwa_working is the one with rpmalloc and
> bwa_original is unpatched. As you can see the unpatched version with
> several threads takes a  whole lot more time to finish even when
> compared with unpatched exe running with a single thread. I am not the
> only one experiencing the issue so I doubt it is my system.
[...]

Well this certainly does show the issue(s) you're seeing.  Short runs but long 
enough for me to get decent profiling.  Yes, malloc code shows up a lot in the 
profiles.

I'm going to research some stuff and ask some questions of the Cygwin gurus to see 
if we can do something about this.
Thanks a bunch,

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-13  8:24                   ` Mark Geisert
@ 2021-04-13 13:05                     ` Teemu Nätkinniemi
  2021-04-14  8:19                       ` Mark Geisert
  0 siblings, 1 reply; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-13 13:05 UTC (permalink / raw)
  To: cygwin-developers

Hi Mark,

Thanks a lot for looking into this issue! I wonder if there are any
other applications affected by this?

Teemu

ti 13. huhtik. 2021 klo 12.36 Mark Geisert (mark@maxrnd.com) kirjoitti:
>
> Hi Teemu,
>
> Teemu Nätkinniemi wrote:
> > Hello,
> >
> > Thanks for testing! I found a better test case with smaller files
> > which should clearly show the issue.
> >
> > https://drive.google.com/drive/folders/1jOilHtKrr6CHn7zg__DE93RCDyseoTB1?usp=sharing
> >
> > Here's the results. Bwa_working is the one with rpmalloc and
> > bwa_original is unpatched. As you can see the unpatched version with
> > several threads takes a  whole lot more time to finish even when
> > compared with unpatched exe running with a single thread. I am not the
> > only one experiencing the issue so I doubt it is my system.
> [...]
>
> Well this certainly does show the issue(s) you're seeing.  Short runs but long
> enough for me to get decent profiling.  Yes, malloc code shows up a lot in the
> profiles.
>
> I'm going to research some stuff and ask some questions of the Cygwin gurus to see
> if we can do something about this.
> Thanks a bunch,

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-13 13:05                     ` Teemu Nätkinniemi
@ 2021-04-14  8:19                       ` Mark Geisert
  2021-04-14 18:36                         ` Teemu Nätkinniemi
  2021-04-14 18:53                         ` Jon Turney
  0 siblings, 2 replies; 17+ messages in thread
From: Mark Geisert @ 2021-04-14  8:19 UTC (permalink / raw)
  To: cygwin-developers

Teemu Nätkinniemi wrote:
> Thanks a lot for looking into this issue! I wonder if there are any
> other applications affected by this?

We have several examples by now.  All are (relatively) long-lasting apps, with 
high to very high memory allocation churn, often multi-threaded.  I believe some 
specific rsync operations hit this.  Achim reported a zstd operation that 
exhibited the symptoms.  And I've been attempting to get a working replacement for 
Cygwin's malloc for some time but every malloc I've tested, several of them, 
exhibits similar symptoms: excessive time being spent in ntdll.dll presumably 
supporting the memory operations.

Your rpmalloc "hack" is interesting in that you aren't using Cygwin's mmap() 
underneath the malloc routines; you're calling Windows VM ops directly.  Not sure 
yet what all the implications are.

I need to identify what's being hit within ntdll.dll.  Is it one or two routines, 
or just hot locks.  So that means getting the correct PDB file from the MS Symbol 
Server and working with Windows tools I'm unfamiliar with.  Sigh, in an earlier 
life I had a gdb that we'd taught how to work with PDB files; dunno if I could 
resurrect that.  Profiling the Cygwin DLL itself, call profiling I mean, might 
lead somewhere as well.

Lots of plausible directions to go...
Cheers,

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-14  8:19                       ` Mark Geisert
@ 2021-04-14 18:36                         ` Teemu Nätkinniemi
  2021-04-14 18:53                         ` Jon Turney
  1 sibling, 0 replies; 17+ messages in thread
From: Teemu Nätkinniemi @ 2021-04-14 18:36 UTC (permalink / raw)
  To: cygwin-developers

ke 14. huhtik. 2021 klo 12.20 Mark Geisert (mark@maxrnd.com) kirjoitti:
>
> Teemu Nätkinniemi wrote:
> > Thanks a lot for looking into this issue! I wonder if there are any
> > other applications affected by this?
>
> We have several examples by now.  All are (relatively) long-lasting apps, with
> high to very high memory allocation churn, often multi-threaded.  I believe some
> specific rsync operations hit this.  Achim reported a zstd operation that
> exhibited the symptoms.  And I've been attempting to get a working replacement for
> Cygwin's malloc for some time but every malloc I've tested, several of them,
> exhibits similar symptoms: excessive time being spent in ntdll.dll presumably
> supporting the memory operations.

Bwa's author has a more recent program called bwa-mem2 which seems to
have exact same problem as bwa. I have tried the rpmalloc trick but it
did not work or I could not identify the correct routine(s).

https://github.com/bwa-mem2/bwa-mem2

> Your rpmalloc "hack" is interesting in that you aren't using Cygwin's mmap()
> underneath the malloc routines; you're calling Windows VM ops directly.  Not sure
> yet what all the implications are.

It was not my hack. Just a third party who managed to fix the problem
as bwa is actually a useful program and there was a need to get it
running on Windows/Cygwin.

> Lots of plausible directions to go...

I am thankful that you are looking into this. Does anyone else have
any input that could be useful? Corinna?

Teemu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-14  8:19                       ` Mark Geisert
  2021-04-14 18:36                         ` Teemu Nätkinniemi
@ 2021-04-14 18:53                         ` Jon Turney
  2021-04-19  5:16                           ` Mark Geisert
  1 sibling, 1 reply; 17+ messages in thread
From: Jon Turney @ 2021-04-14 18:53 UTC (permalink / raw)
  To: cygwin-developers

On 14/04/2021 09:19, Mark Geisert wrote:
> Teemu Nätkinniemi wrote:
>> Thanks a lot for looking into this issue! I wonder if there are any
>> other applications affected by this?
> 
> We have several examples by now.  All are (relatively) long-lasting 
> apps, with high to very high memory allocation churn, often 
> multi-threaded.  I believe some specific rsync operations hit this.  
> Achim reported a zstd operation that exhibited the symptoms.  And I've 
> been attempting to get a working replacement for Cygwin's malloc for 
> some time but every malloc I've tested, several of them, exhibits 
> similar symptoms: excessive time being spent in ntdll.dll presumably 
> supporting the memory operations.
> 
> Your rpmalloc "hack" is interesting in that you aren't using Cygwin's 
> mmap() underneath the malloc routines; you're calling Windows VM ops 
> directly.  Not sure yet what all the implications are.
> 
> I need to identify what's being hit within ntdll.dll.  Is it one or two 
> routines, or just hot locks.  So that means getting the correct PDB file 
> from the MS Symbol Server and working with Windows tools I'm unfamiliar 
> with.  Sigh, in an earlier life I had a gdb that we'd taught how to work 

Yes, this would indeed be a very useful thing to have in gdb.

I'm not aware of any public work in that direction, though.

> with PDB files; dunno if I could resurrect that.  Profiling the Cygwin 
> DLL itself, call profiling I mean, might lead somewhere as well.

In the past I've had some success with using the Very Sleepy profiler 
([1]), which can use both PDB and DWARF symbols, on cygwin executables.

[1] https://github.com/VerySleepy/verysleepy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-14 18:53                         ` Jon Turney
@ 2021-04-19  5:16                           ` Mark Geisert
  2021-04-20 19:34                             ` Jon Turney
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Geisert @ 2021-04-19  5:16 UTC (permalink / raw)
  To: cygwin-developers

Jon Turney wrote:
> On 14/04/2021 09:19, Mark Geisert wrote:
>> I need to identify what's being hit within ntdll.dll.  Is it one or two 
>> routines, or just hot locks.  So that means getting the correct PDB file from 
>> the MS Symbol Server and working with Windows tools I'm unfamiliar with.  Sigh, 
>> in an earlier life I had a gdb that we'd taught how to work 
> 
> Yes, this would indeed be a very useful thing to have in gdb.
> 
> I'm not aware of any public work in that direction, though.
> 
>> with PDB files; dunno if I could resurrect that.  Profiling the Cygwin DLL 
>> itself, call profiling I mean, might lead somewhere as well.
> 
> In the past I've had some success with using the Very Sleepy profiler ([1]), which 
> can use both PDB and DWARF symbols, on cygwin executables.
> 
> [1] https://github.com/VerySleepy/verysleepy

Thanks for that link, Jon.  That tool is potentially very useful.  Are you sure it 
understands DWARF though?  It seems to show only a subset of cygwin1.dll symbols 
but I can't immediately tell why those and not others.  Perhaps they're just the 
unmangled names present in the COFF symbol table?

Did you do anything in particular to assist it with debugging Cygwin exes?  Like 
adding to the Symbol Cache it builds?  I only see PDB files in its cache so far.

I think building a "fake" PDB file for cygwin1.dll might be good enough, but if 
there's an easier way I'd love to hear it.
Thanks again,

..mark

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Maybe consider rpmalloc
  2021-04-19  5:16                           ` Mark Geisert
@ 2021-04-20 19:34                             ` Jon Turney
  0 siblings, 0 replies; 17+ messages in thread
From: Jon Turney @ 2021-04-20 19:34 UTC (permalink / raw)
  To: cygwin-developers

On 19/04/2021 06:16, Mark Geisert wrote:
> Jon Turney wrote:
>> On 14/04/2021 09:19, Mark Geisert wrote:
>>> I need to identify what's being hit within ntdll.dll.  Is it one or 
>>> two routines, or just hot locks.  So that means getting the correct 
>>> PDB file from the MS Symbol Server and working with Windows tools I'm 
>>> unfamiliar with.  Sigh, in an earlier life I had a gdb that we'd 
>>> taught how to work 
>>
>> Yes, this would indeed be a very useful thing to have in gdb.
>>
>> I'm not aware of any public work in that direction, though.
>>
>>> with PDB files; dunno if I could resurrect that.  Profiling the 
>>> Cygwin DLL itself, call profiling I mean, might lead somewhere as well.
>>
>> In the past I've had some success with using the Very Sleepy profiler 
>> ([1]), which can use both PDB and DWARF symbols, on cygwin executables.
>>
>> [1] https://github.com/VerySleepy/verysleepy
> 
> Thanks for that link, Jon.  That tool is potentially very useful.  Are 
> you sure it understands DWARF though?  It seems to show only a subset of 
> cygwin1.dll symbols but I can't immediately tell why those and not 
> others.  Perhaps they're just the unmangled names present in the COFF 
> symbol table?
> 
> Did you do anything in particular to assist it with debugging Cygwin 
> exes?  Like adding to the Symbol Cache it builds?  I only see PDB files 
> in its cache so far.

I think there was some wrestling with it required, but I don't recall 
the details anymore.

> I think building a "fake" PDB file for cygwin1.dll might be good enough, 
> but if there's an easier way I'd love to hear it.
> Thanks again,

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-04-20 19:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25  6:01 Cygwin malloc tune-up status Mark Geisert
2020-09-27 16:54 ` Johannes Schindelin
2020-09-29  2:22   ` Mark Geisert
2021-04-01  9:19     ` Teemu Nätkinniemi
2021-04-02  5:45       ` Maybe consider rpmalloc -- Was: " Mark Geisert
2021-04-03  2:53         ` Maybe consider rpmalloc Mark Geisert
2021-04-03  6:46           ` Teemu Nätkinniemi
2021-04-03  6:48             ` Teemu Nätkinniemi
2021-04-11  9:28               ` Mark Geisert
2021-04-12  8:48                 ` Teemu Nätkinniemi
2021-04-13  8:24                   ` Mark Geisert
2021-04-13 13:05                     ` Teemu Nätkinniemi
2021-04-14  8:19                       ` Mark Geisert
2021-04-14 18:36                         ` Teemu Nätkinniemi
2021-04-14 18:53                         ` Jon Turney
2021-04-19  5:16                           ` Mark Geisert
2021-04-20 19:34                             ` Jon Turney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).