public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Cygwin triggers integrity scrubbing on ReFS filesystems, making searching files impossible on large datasets
@ 2022-10-11  7:53 Matt D.
  0 siblings, 0 replies; only message in thread
From: Matt D. @ 2022-10-11  7:53 UTC (permalink / raw)
  To: cygwin

I formatted a drive today, with ReFS on a Storage Pool mirror with
integrity streams enabled, before copying data over from a backup. The
data included several million files, which I search often with tools
like find and grep. After the copy was finished, I tried doing a
simple find:

time find . -iname file.png

I noticed that the search was taking much longer than expected, and I
gave up after waiting for over 20 minutes. I confirmed that I could
perform a search of the same data on an external USB3 drive formatted
NTFS in between 1-1.5 minutes.

To verify that this is in fact an incompatibility with ReFS's
integrity streams, I formatted the same pool with this feature
disabled and copied the files back over. Without integrity streams,
the find operation took about 30 seconds. I confirmed this further by
formatting the pool as NTFS, with a similar result. I then formatted
the pool one last time with ReFS again with integrity streams enabled,
and the problem returned.

Although the behavior appears as a program hang, it's just very slow
at searching, and not actually frozen. It continues to respond to
Ctrl-C and, if a more permissive pattern is used, output can be seen
during the search; it's just very slow. I believe the issue has to do
with how Cygwin or find is accessing these files as it searches,
triggering the integrity scrubber on each visit, causing the search to
be unbearably slow. Using Windows search on the same disk does not
have this problem.

I haven't tried to do any performance comparison with grep, but I
would expect the experience to be similarly poor or worse. It's
interesting that the scrubber is triggered in this example with find,
as I'm only examining the name of files, and not trying to read their
contents.

See here for more information on ReFS integrity streams:

https://learn.microsoft.com/en-us/windows-server/storage/refs/integrity-streams

To format a disk with this feature, PowerShell must be used, as it's
not enabled by default or accessible from the GUI:

Format-Volume -DriveLetter D -FileSystem REFS -SetIntegrityStreams $true

The hardware I used was two Crucial MX500 2TB SSDs, recently trimmed,
in a RAID1 mirror configuration in Storage Spaces on Windows 10
Professional for Workstations. My system just formatted and fully
updated. Cygwin was also a fresh download and fully updated. The
system is otherwise very fast, with a Ryzen 1800X and 64GB of memory.

At this point, I am unable to use Cygwin whatsoever on any disk
formatted ReFS with the integrity streams feature enabled for any kind
of performant workload on a dataset that includes I/O on a large
number of files.

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-10-11  7:53 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-11  7:53 Cygwin triggers integrity scrubbing on ReFS filesystems, making searching files impossible on large datasets Matt D.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).