From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt1-x82a.google.com (mail-qt1-x82a.google.com [IPv6:2607:f8b0:4864:20::82a]) by sourceware.org (Postfix) with ESMTPS id E09C13847824 for ; Sat, 26 Jun 2021 05:33:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E09C13847824 Received: by mail-qt1-x82a.google.com with SMTP id r7so9195809qta.12 for ; Fri, 25 Jun 2021 22:33:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=SFlgbeqlhXxtnJf3KvrC/jvqWR8rATTMGbVNmGQNdjA=; b=rm8IB4v0lB1zkiqbRebX5LwoEF34rmfveDUPt3JlK6l5lLv9+UtypJWsAmC1RZhPv4 0dQMZk9b7rjF0Ef/IJZRzd/KkdVp7gyTrKwMRTkrAl/iDh71WbZ1lh71gBYed45QZiGF UdOez9h0l/R3NvrjAbSAat59teOFphxm8+/6lweN/gCvLVYfks7K02+6IrOKvFmUkjb1 rDU7VZLNulnzKrv4VK8nBcUbAVRsuUW07Tn6Z7zG1CXQh+rF/jmZSDgX6nXT5gAmy1c1 KJR4+5bi1/qO6rmBlVqzFkpL3/T0J4fjp4kGr8hzZCZb1IXlFagZRfCqXbGYM3hsbBau GBvw== X-Gm-Message-State: AOAM533dC83uIdx+m9HmXVfKLtgaT70rp1N3RRxqUCRZWRCqCC85Hamk 9ts99kusEJavfWjcsWboh6Npy/UELr9UEXh3kGCZmFV9HS6pAQ== X-Google-Smtp-Source: ABdhPJyty6d2Xm+7JSaANFa5IEna6+YaAhjN84FVwCK73+3CB5RpLH1/PkT083+c/xSJkeQ4SBOPeXB8CwwoMEv0rHE= X-Received: by 2002:a05:622a:1747:: with SMTP id l7mr12713093qtk.225.1624685604102; Fri, 25 Jun 2021 22:33:24 -0700 (PDT) MIME-Version: 1.0 References: <952ad3ba-34f4-c3a4-450c-263b16795c8d@syping.de> In-Reply-To: <952ad3ba-34f4-c3a4-450c-263b16795c8d@syping.de> From: Doug Henderson Date: Fri, 25 Jun 2021 23:33:12 -0600 Message-ID: Subject: Re: Cygwin, Unicode and "long" path names To: cygwin Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, FROM_LOCAL_NOVOWEL, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jun 2021 05:33:27 -0000 )()On Fri, 25 Jun 2021 at 19:55, Vadim wrote: > > Ah, this beautiful topic. Windows 7 x64. > > This is the summary written as post-scriptum, tests and findings below: > > 1) Cygwin limits individual names to 255 bytes, Windows seems to follow > UTF-16 chars and work fine: 256 bytes in 108 characters works. > > Basically, this becomes a bytes vs characters story. > > 2) Bash file name auto-expansion detects the file of that name, but it > gets truncated to 255 bytes. find's behaviour is the same ("No such file > or directory" due to trying to access a non-existing truncated name) > > 2.1) If you try to correct the above mistake by adding truncated > characters, then the program (cat) will complain about "File name too long" > > 2.2) If there exists a folder with a 255-byte name, equal to the > truncated name, then "find ." will do a listing on that folder twice > (effectively hiding the long-named folder from tools without leaving an > error message) > > 3) UNC Paths get the same treatment: File name too long. > > I expected Cygwin to handle these names without problems just like > Windows, Explorer, cmd etc. do. Is this particular problem new or known? > All I could find on the mailing list is around the time when Cygwin > hadn't yet implemented Unicode support (UTF-8?), ~2004-2008. > > These names were created by youtube-dl.exe executed from within Cygwin. > > - Vadim I believe this is the result of the difference between Pascal type strings, which have a length-byte followed by data-bytes and C type strings which have data-bytes followed by a zero-byte, or worse, in the case of two byte characters, data-words followed by a zero-word. For single byte characters both P and C styles use 256 bytes. Using the 255 length limit without accounting for the trailing zero-byte could account for some of the observed problem. More likely, the problems relates to double byte character sets. For double byte characters, 255 bytes of UTF-16 characters or more likely 255 bytes of MCBS (multi-byte character set) or DBCS (double-byte character set) can encode to more or less than 255 UTF-8 bytes depending on the average bytes/character of the UTF-8 encoding. This could account for the failure to handle all bytes of the NTFS filename when converted to UTF-8. Converted Linux programs may fail to allocate a large enough encoding buffer leading to the observed truncation. Similarly for 510 bytes containing 255 words of DBCS characters. Youtube-dl.exe is basically a windows Python 3 program with C-extensions. Python 3 properly handles Unicode and the encoding and decoding of the aforementioned character encodings. I would look for library functions which perform decoding of NTFS file names into UTF-8 names, verify their correctness, and follow the path of the usage of their output through the system. I think this will mean that using the windows 255 byte limit cannot be used at all in any cygwin program that will handle international file names. Unfortunately that sounds like a lot of work. In theory, if all 255 characters in the filename component required 4 byte UTF-8 encodings, this would require about 1024 bytes. However this does not even touch on emojis where a one character emoji can expand to as much as 35 or so bytes! That basically means the end of static allocation for file and directory names and name component buffers. That may be a major job in the cygwin kernel, not to mention all the available packages! HTH Doug -- Doug Henderson, Calgary, Alberta, Canada - from gmail.com