public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Re: rfe: CYGWIN fslinktypes option? Re: Catastrophic Cygwin find . -ls, grep performance on samba share compared to WSL&Linux
Date: Mon, 8 Jan 2024 15:53:02 +0100	[thread overview]
Message-ID: <ZZwMTtV5_kdgH0Yr@calimero.vinschen.de> (raw)
In-Reply-To: <CAKAoaQnQ2eL9JJfn=CeJ06WujqgLdLVeXS7ojf7GvvmkB-KYoA@mail.gmail.com>

On Dec 24 01:47, Roland Mainz via Cygwin wrote:
> On Thu, Dec 21, 2023 at 9:32 PM Kaz Kylheku via Cygwin
> <cygwin@cygwin.com> wrote:
> > On 2023-12-21 04:16, Martin Wege via Cygwin wrote:
> > > On Wed, Dec 20, 2023 at 6:21 PM Kaz Kylheku via Cygwin
> > > <cygwin@cygwin.com> wrote:
> [snip]
> > > The root cause is IMO the extra Win32 syscalls (>= 3 per file lookup,
> > > compared to 1 on Linux) to lookup the *.lnk and *.exe.lnk files on
> > > filesystems which have native link support (NTFS, ReFS, SMBFS, NFS).
> > > On SMBFS and NFS it hurts the most, because access latency is the
> > > highest for networked filesystems.
> >
> > Could some intelligent caching be added there? (Discussion of
> > associated invalidation problem in 3... 2.... 1... )
> 
> See below, basically a short-lived cache which is only valid for the
> lifetime of the one POSIX function call would be OK...
> 
> > Can you discuss more details, so people don't have to dive into code
> > to understand it? If we are accessing some file "foo", the application
> > or user may actually be referring to a "foo.lnk" link. But in the
> > happy case that "foo" exists, why would we bother looking for "foo.lnk"?
> >
> > If "foo" does not exist, but "foo.lnk" does, that could probably be
> > cached, so that next time "foo" is accessed, we go straight for "foo.lnk",
> > and keep using that while it exists.
> >
> > If someone has both "foo" and "foo.lnk" in the same directory,
> > that's a bit of a degenerate case; how important is it to be "correct",
> > anyway.
> 
> Question, mainly for Corinna:
> Could the code be modified to use one |NtQueryDirectoryFile()| call
> with a SINGLE pattern testing for { "foo", "foo.lnk", "foo.lnk.exe",
> ... } (instead of calling the kernel for each suffix independently)
> and cache that information for the lifetime of the matching POSIX
> function call ?

Yes and no.  This could certainly made to work, but it has a couple
of caveats which are not trivial, and there's *no* guarantee that
you will be able to get faster code by doing that.  At all.

First of all, in contrast to calling NtOpenFile on the file,
NtQueryDirectoryFile always needs two calls, because you have to open
the directory first. If you then found the file, you have to open the
file to fetch information.  So you have always one more call than by
opening the file immediately and having immediate success.  It's more or
less equivalent if the file is a *.exe file, and it's one less hit if
it's a *.lnk file.

Which pattern would you like to use? Let's assume we carefully try to
get rid of .exe.lnk, we still have to check for "foo", "foo.exe" and
"foo.lnk".  Even if we get rid of .lnk, we have two patterns which
can *not* be expressed in a single call to NtQueryDirectoryFile.
We only have Windows' most simple globbing, i. e., we have '*' and '?'.
The only pattern matching "foo" and "foo.exe" is "foo*".  "foo.*"
does not hit on "foo". So "foo*".  As you know, the NtQueryDirectoryFile
call can return a buffer with multiple hits.  But the buffer has a 
finite size, so if somebody is looking for the file "a", we'd have to
look for "a*", which may have more hits than fit into the buffer,
So the code has to be prepared not only to scan a 64K buffer for
(potentially) hundrets of entries, but also to repeat the call to
NtQueryDirectoryFile to load more matching file entries.

Next problem, NFS.  The current call just opening the file checks with
the necessary flags to access symlinks.  Without these flags, NFS
symlinks are invisible or not handled as symlinks.  So, right now, we
have a single call on NFS to open a file, if it exists without suffix.
If you use NtQueryDirectoryFile, you have another subtil problem.  If it
happens to be an NFS dir, you have to use another FILE_INFORMATION_CLASS,
otherwise symlinks don't show up at all.  This information clas isn't
even sufficient for the most basic of information we need in the
symlink_info::check method. So you need to open the file here, too,
and extract the information.

There's probably more to it, but that's just what came to mind for
a start.

> The idea is to reduce the number of userland<--->kernel roundstrips
> from <n> to <1>, and filesystem drivers could be optimized even
> further (for example if the network filesystem protocol supports file
> name globbing...)

I have a hard time to see that you can really avoid a lot of calls.
You may find that you won't save a lot of them, and another lot
of them don't matter becasue the OS already cached information.

Also, as exciting as it might be to do extensive caching (and, as I
wrote in a former reply today, we do some caching), keep in mind the we
are only a user-space DLL.  The only caching of file information you
can rely upon is that of the kernel.


Corinna

  reply	other threads:[~2024-01-08 14:53 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-06  4:08 Dan Shelton
2023-12-18  6:22 ` Dan Shelton
2023-12-18  6:49   ` Marco Atzeri
2023-12-18  6:53     ` Dan Shelton
2023-12-18  7:05       ` Marco Atzeri
2023-12-18  7:16         ` Dan Shelton
2023-12-18  8:23           ` Marco Atzeri
2023-12-20 17:20   ` Kaz Kylheku
2023-12-21 12:16     ` rfe: CYGWIN fslinktypes option? " Martin Wege
2023-12-21 16:10       ` Cedric Blancher
2023-12-21 17:43         ` Brian Inglis
2023-12-21 20:32       ` Kaz Kylheku
2023-12-24  0:47         ` Roland Mainz
2024-01-08 14:53           ` Corinna Vinschen [this message]
2023-12-22 18:53       ` Andrey Repin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZwMTtV5_kdgH0Yr@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).