public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Can not stat file with utf char U+F020
@ 2023-04-14 17:53 Gionatan Danti
  2023-04-14 19:00 ` Corinna Vinschen
  0 siblings, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-14 17:53 UTC (permalink / raw)
  To: cygwin

Dear list,
I have an issue with unreadable files with contain utf char U+F020 
(which appear as "middle dot with some space after") in their name.

stat on such a file results in "no such file or directory"

 From here [1] it seems that a patch was contemplated many years ago, but 
I don't know its status now.

Any ideas or workaround?
Thanks.

[1] https://sourceware.org/legacy-ml/cygwin/2009-11/msg00043.html

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 17:53 Can not stat file with utf char U+F020 Gionatan Danti
@ 2023-04-14 19:00 ` Corinna Vinschen
  2023-04-14 19:54   ` Brian Inglis
  2023-04-14 20:17   ` Gionatan Danti
  0 siblings, 2 replies; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-14 19:00 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: cygwin

On Apr 14 19:53, Gionatan Danti via Cygwin wrote:
> Dear list,
> I have an issue with unreadable files with contain utf char U+F020 (which
> appear as "middle dot with some space after") in their name.
> 
> stat on such a file results in "no such file or directory"
> 
> From here [1] it seems that a patch was contemplated many years ago, but I
> don't know its status now.
> 
> Any ideas or workaround?

There's no (good) solution from inside Cygwin.

Keep in mind that the Unicode area from U+E000 up to U+F8FF is called
"Private Use Area".  So none of the chars are mapped into any
singlebyte, doublebyte, or multibyte charset.  Typically we don't expect
that filenames contain any of these chars, and we're only using a very
small subset of them for our own, dubious purposes anyway:

https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars

> [1] https://sourceware.org/legacy-ml/cygwin/2009-11/msg00043.html

While this patch would have fixed your problem, a later followup patch
broke your usage of U+F020 (space replacement) and, FWIW, of U+F02E
(dot replacement) again:

https://cygwin.com/cgit/newlib-cygwin/commit/?id=8802178fddfd

This was done to accomodate filesystems implementing the idiotic
approach to support only DOS filenames, i. e., not allowing leading or
trailing spaces and not allowing trailing dots.  These are Netapp and
Novell Netware filesystems.  See the last paragraph of

https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars

Any chance you can just rename the files?


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 19:00 ` Corinna Vinschen
@ 2023-04-14 19:54   ` Brian Inglis
  2023-04-14 20:20     ` Corinna Vinschen
  2023-04-14 20:21     ` Gionatan Danti
  2023-04-14 20:17   ` Gionatan Danti
  1 sibling, 2 replies; 19+ messages in thread
From: Brian Inglis @ 2023-04-14 19:54 UTC (permalink / raw)
  To: cygwin; +Cc: Gionatan Danti

On 2023-04-14 13:00, Corinna Vinschen via Cygwin wrote:
> On Apr 14 19:53, Gionatan Danti via Cygwin wrote:
>> I have an issue with unreadable files with contain utf char U+F020 (which
>> appear as "middle dot with some space after") in their name.
>> stat on such a file results in "no such file or directory"
>>  From here [1] it seems that a patch was contemplated many years ago, but I
>> don't know its status now.
>> Any ideas or workaround?

> There's no (good) solution from inside Cygwin.
> Keep in mind that the Unicode area from U+E000 up to U+F8FF is called
> "Private Use Area".  So none of the chars are mapped into any
> singlebyte, doublebyte, or multibyte charset.  Typically we don't expect
> that filenames contain any of these chars, and we're only using a very
> small subset of them for our own, dubious purposes anyway:
> https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars

>> [1] https://sourceware.org/legacy-ml/cygwin/2009-11/msg00043.html

> While this patch would have fixed your problem, a later followup patch
> broke your usage of U+F020 (space replacement) and, FWIW, of U+F02E
> (dot replacement) again:
> 	https://cygwin.com/cgit/newlib-cygwin/commit/?id=8802178fddfd
> This was done to accomodate filesystems implementing the idiotic
> approach to support only DOS filenames, i.e. not allowing leading or
> trailing spaces and not allowing trailing dots. These are Netapp and
> Novell Netware filesystems. See the last paragraph of
> https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
> Any chance you can just rename the files?

UCSUR Under-ConScript Unicode Registry and its predecessor ConScript Unicode 
Registry CSUR

	https://www.kreativekorp.com/ucsur/

	http://www.evertype.com/standards/csur/

unofficially register Unicode PUA glyphs for academic, artificial, constructed, 
historical, invented, and minority language scripts, some of which have made it 
into Unicode e.g.

	Script		CSUR		Unicode
	PHAISTOS DISC	U+E6D0-U+E6FF	U+101D0-U+101DF
	SHAVIAN		U+E700-U+E72F	U+10450-U+1047F
	DESERET		U+E830-U+E88F	U+10400-U+1044F

and maintain their own Unidata e.g.

	https://www.kreativekorp.com/ucsur/UNIDATA/Blocks.txt

and some Unicode fonts have -CSUR addition files (like -Italic etc.) that 
support BMP and SMP PUA glyphs.

For Cygwin purposes:

F000−F7FF	unassigned	Reserved for hacks and corporate use

so Cygwin's special Windows file name characters mappings are clear:

	F022	"
	F02A	*
	F03A	:
	F03C	<
	F03E	>
	F03F	?
	F07C	|

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 19:00 ` Corinna Vinschen
  2023-04-14 19:54   ` Brian Inglis
@ 2023-04-14 20:17   ` Gionatan Danti
  2023-04-14 20:40     ` Corinna Vinschen
  2023-04-15  5:10     ` Brian Inglis
  1 sibling, 2 replies; 19+ messages in thread
From: Gionatan Danti @ 2023-04-14 20:17 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: cygwin

Il 2023-04-14 21:00 Corinna Vinschen ha scritto:
> There's no (good) solution from inside Cygwin.
> [snip]

Yeah, I can only imagine how difficult is to be compatible with posix, 
win32 and the likes.

> Any chance you can just rename the files?

I renamed the files, in fact.

However, it seems that users working with (older?) Office for MAC use 
U+F020 more frequently than I expected, maybe because of that [1]:

"Microsoft's defunct Services For Macintosh feature used U+F001 through 
U+F029 as replacements for special characters allowed in HFS but 
forbidden in NTFS, and U+F02A for the Apple logo."

Any chances to enable a "bypass" for these characters (excluding the one 
you reserved for compatibility as explained detailed in the "Forbidden 
characters in filenames")? Maybe hidden behind a configurable option 
(even disabled by default), so to not interfere with the current 
behavior?

Thanks.

[1] https://en.wikipedia.org/wiki/Private_Use_Areas#Vendor_use


-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 19:54   ` Brian Inglis
@ 2023-04-14 20:20     ` Corinna Vinschen
  2023-04-14 20:21     ` Gionatan Danti
  1 sibling, 0 replies; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-14 20:20 UTC (permalink / raw)
  To: Brian Inglis via Cygwin; +Cc: Brian Inglis, Gionatan Danti

On Apr 14 13:54, Brian Inglis via Cygwin wrote:
> On 2023-04-14 13:00, Corinna Vinschen via Cygwin wrote:
> > On Apr 14 19:53, Gionatan Danti via Cygwin wrote:
> > > [1] https://sourceware.org/legacy-ml/cygwin/2009-11/msg00043.html
> 
> > While this patch would have fixed your problem, a later followup patch
> > broke your usage of U+F020 (space replacement) and, FWIW, of U+F02E
> > (dot replacement) again:
> > 	https://cygwin.com/cgit/newlib-cygwin/commit/?id=8802178fddfd
> > This was done to accomodate filesystems implementing the idiotic
> > approach to support only DOS filenames, i.e. not allowing leading or
> > trailing spaces and not allowing trailing dots. These are Netapp and
> > Novell Netware filesystems. See the last paragraph of
> > https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
> > Any chance you can just rename the files?
> 
> UCSUR Under-ConScript Unicode Registry and its predecessor ConScript Unicode
> Registry CSUR
> 
> 	https://www.kreativekorp.com/ucsur/
> 
> 	http://www.evertype.com/standards/csur/
> 
> unofficially register Unicode PUA glyphs for academic, artificial,
> constructed, historical, invented, and minority language scripts, some of
> which have made it into Unicode e.g.
> 
> 	Script		CSUR		Unicode
> 	PHAISTOS DISC	U+E6D0-U+E6FF	U+101D0-U+101DF
> 	SHAVIAN		U+E700-U+E72F	U+10450-U+1047F
> 	DESERET		U+E830-U+E88F	U+10400-U+1044F
> 
> and maintain their own Unidata e.g.
> 
> 	https://www.kreativekorp.com/ucsur/UNIDATA/Blocks.txt
> 
> and some Unicode fonts have -CSUR addition files (like -Italic etc.) that
> support BMP and SMP PUA glyphs.
> 
> For Cygwin purposes:
> 
> F000−F7FF	unassigned	Reserved for hacks and corporate use
> 
> so Cygwin's special Windows file name characters mappings are clear:
> 
For completeness sake, starting with commit 8802178fddfd:

        F020    <space>
> 	F022	"
> 	F02A	*
        F02E    .
> 	F03A	:
> 	F03C	<
> 	F03E	>
> 	F03F	?
> 	F07C	|


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 19:54   ` Brian Inglis
  2023-04-14 20:20     ` Corinna Vinschen
@ 2023-04-14 20:21     ` Gionatan Danti
  2023-04-14 20:25       ` Corinna Vinschen
  1 sibling, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-14 20:21 UTC (permalink / raw)
  To: cygwin; +Cc: Brian Inglis

Il 2023-04-14 21:54 Brian Inglis ha scritto:
> UCSUR Under-ConScript Unicode Registry and its predecessor ConScript
> Unicode Registry CSUR
> 
> 	https://www.kreativekorp.com/ucsur/
> 
> 	http://www.evertype.com/standards/csur/
> 
> unofficially register Unicode PUA glyphs for academic, artificial,
> constructed, historical, invented, and minority language scripts, some
> of which have made it into Unicode e.g.
> 
> 	Script		CSUR		Unicode
> 	PHAISTOS DISC	U+E6D0-U+E6FF	U+101D0-U+101DF
> 	SHAVIAN		U+E700-U+E72F	U+10450-U+1047F
> 	DESERET		U+E830-U+E88F	U+10400-U+1044F
> 
> and maintain their own Unidata e.g.
> 
> 	https://www.kreativekorp.com/ucsur/UNIDATA/Blocks.txt
> 
> and some Unicode fonts have -CSUR addition files (like -Italic etc.)
> that support BMP and SMP PUA glyphs.

So they are actively using PUA? I did not know that, thanks.

> For Cygwin purposes:
> 
> F000−F7FF	unassigned	Reserved for hacks and corporate use
> 
> so Cygwin's special Windows file name characters mappings are clear:
> 
> 	F022	"
> 	F02A	*
> 	F03A	:
> 	F03C	<
> 	F03E	>
> 	F03F	?
> 	F07C	|

Would it be possible to "bypass" the chars in the range F000−F7FF that 
are not used/reserved by cygwin?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 20:21     ` Gionatan Danti
@ 2023-04-14 20:25       ` Corinna Vinschen
  2023-04-14 21:01         ` Gionatan Danti
  0 siblings, 1 reply; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-14 20:25 UTC (permalink / raw)
  To: cygwin

On Apr 14 22:21, Gionatan Danti via Cygwin wrote:
> Il 2023-04-14 21:54 Brian Inglis ha scritto:
> > UCSUR Under-ConScript Unicode Registry and its predecessor ConScript
> > Unicode Registry CSUR
> > 
> > 	https://www.kreativekorp.com/ucsur/
> > 
> > 	http://www.evertype.com/standards/csur/
> > 
> > unofficially register Unicode PUA glyphs for academic, artificial,
> > constructed, historical, invented, and minority language scripts, some
> > of which have made it into Unicode e.g.
> > 
> > 	Script		CSUR		Unicode
> > 	PHAISTOS DISC	U+E6D0-U+E6FF	U+101D0-U+101DF
> > 	SHAVIAN		U+E700-U+E72F	U+10450-U+1047F
> > 	DESERET		U+E830-U+E88F	U+10400-U+1044F
> > 
> > and maintain their own Unidata e.g.
> > 
> > 	https://www.kreativekorp.com/ucsur/UNIDATA/Blocks.txt
> > 
> > and some Unicode fonts have -CSUR addition files (like -Italic etc.)
> > that support BMP and SMP PUA glyphs.
> 
> So they are actively using PUA? I did not know that, thanks.
> 
> > For Cygwin purposes:
> > 
> > F000−F7FF	unassigned	Reserved for hacks and corporate use
> > 
> > so Cygwin's special Windows file name characters mappings are clear:
> > 
> > 	F022	"
> > 	F02A	*
> > 	F03A	:
> > 	F03C	<
> > 	F03E	>
> > 	F03F	?
> > 	F07C	|
> 
> Would it be possible to "bypass" the chars in the range F000−F7FF that are
> not used/reserved by cygwin?

We do that.  You're just stumbling over tha fact that U+F020 is also
used as outlined in 
https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
and https://cygwin.com/pipermail/cygwin/2023-April/253478.html


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 20:17   ` Gionatan Danti
@ 2023-04-14 20:40     ` Corinna Vinschen
  2023-04-14 20:51       ` Gionatan Danti
  2023-04-15  5:10     ` Brian Inglis
  1 sibling, 1 reply; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-14 20:40 UTC (permalink / raw)
  To: Gionatan Danti, Brian Inglis, cygwin

On Apr 14 22:17, Gionatan Danti via Cygwin wrote:
> Il 2023-04-14 21:00 Corinna Vinschen ha scritto:
> > There's no (good) solution from inside Cygwin.
> > [snip]
> 
> Yeah, I can only imagine how difficult is to be compatible with posix, win32
> and the likes.
> 
> > Any chance you can just rename the files?
> 
> I renamed the files, in fact.
> 
> However, it seems that users working with (older?) Office for MAC use U+F020
> more frequently than I expected, maybe because of that [1]:
> 
> "Microsoft's defunct Services For Macintosh feature used U+F001 through
> U+F029 as replacements for special characters allowed in HFS but forbidden
> in NTFS, and U+F02A for the Apple logo."

Drat.  This is kind of sick.  At the same time, Interix used the
U+F0xx area as we do.  That's why I chose this area, to be filename
compatible with Interix.

> Any chances to enable a "bypass" for these characters (excluding the one you
> reserved for compatibility as explained detailed in the "Forbidden
> characters in filenames")? Maybe hidden behind a configurable option (even
> disabled by default), so to not interfere with the current behavior?

This is really tricky.  A new mount point flag could be used to override
this behaviour on a per path basis.  One problem is, the unicode ->
multibyte conversion when evaluating a symlink is done before it's clear
where the symlink target is.  Only the string is converted and it might
be a relative path, so the code doesn't know where the target ends up.
And that's probably not all.

Is it really worth to add code to support a long deprecated Windows
service?


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 20:40     ` Corinna Vinschen
@ 2023-04-14 20:51       ` Gionatan Danti
  0 siblings, 0 replies; 19+ messages in thread
From: Gionatan Danti @ 2023-04-14 20:51 UTC (permalink / raw)
  To: Gionatan Danti, Brian Inglis, cygwin

Il 2023-04-14 22:40 Corinna Vinschen ha scritto:
> This is really tricky.  A new mount point flag could be used to 
> override
> this behaviour on a per path basis.  One problem is, the unicode ->
> multibyte conversion when evaluating a symlink is done before it's 
> clear
> where the symlink target is.  Only the string is converted and it might
> be a relative path, so the code doesn't know where the target ends up.
> And that's probably not all.

To tell the truth, it is such a corner (and infortunate) case that I 
would not care if the workaround does not work for symlinks.

> Is it really worth to add code to support a long deprecated Windows
> service?

Yeah, I understand your point. I am not in the position to evaluate if 
it would be worth.
Maybe a special case for only U+F020 (the most common "strange" char I 
see) can be considered?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 20:25       ` Corinna Vinschen
@ 2023-04-14 21:01         ` Gionatan Danti
  2023-04-17  5:36           ` Gionatan Danti
  0 siblings, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-14 21:01 UTC (permalink / raw)
  To: cygwin; +Cc: Corinna Vinschen

Il 2023-04-14 22:25 Corinna Vinschen via Cygwin ha scritto:
> We do that.  You're just stumbling over tha fact that U+F020 is also
> used as outlined in
> https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
> and https://cygwin.com/pipermail/cygwin/2023-April/253478.html

Ah, so spaces and dots are replaced respectively by U+F020 and U+F02E 
even without the "dos" mount option?
Because I can not see it in my case of an NTFS filesystem with the 
following mount options: binary,posix=0,user,noumount,auto

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 20:17   ` Gionatan Danti
  2023-04-14 20:40     ` Corinna Vinschen
@ 2023-04-15  5:10     ` Brian Inglis
  2023-04-17  9:10       ` Corinna Vinschen
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Inglis @ 2023-04-15  5:10 UTC (permalink / raw)
  To: cygwin; +Cc: Gionatan Danti

On 2023-04-14 14:17, Gionatan Danti via Cygwin wrote:
> Il 2023-04-14 21:00 Corinna Vinschen ha scritto:
>> There's no (good) solution from inside Cygwin.

> Yeah, I can only imagine how difficult is to be compatible with posix, win32 and 
> the likes.

>> Any chance you can just rename the files?

> I renamed the files, in fact.
> However, it seems that users working with (older?) Office for MAC use U+F020 
> more frequently than I expected, maybe because of that [1]:
> "Microsoft's defunct Services For Macintosh feature used U+F001 through U+F029 
> as replacements for special characters allowed in HFS but forbidden in NTFS, and 
> U+F02A for the Apple logo."
> Any chances to enable a "bypass" for these characters (excluding the one you 
> reserved for compatibility as explained detailed in the "Forbidden characters in 
> filenames")? Maybe hidden behind a configurable option (even disabled by 
> default), so to not interfere with the current behavior?

> [1] https://en.wikipedia.org/wiki/Private_Use_Areas#Vendor_use

Now if MS SfM and Cygwin had both registered with U/CSUR, they would not be 
fighting over Unicode code points, although it looks like there is a lot of 
competition for the code points! ;^>

Would it make more sense to add custom file name character filters into some 
utility, such as unix2dos/mac2unix, cygpath, or some other, and add (Cyg)win, or 
create such a utility, so those could be added to processes?

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-14 21:01         ` Gionatan Danti
@ 2023-04-17  5:36           ` Gionatan Danti
  2023-04-17  9:05             ` Corinna Vinschen
  0 siblings, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-17  5:36 UTC (permalink / raw)
  To: cygwin; +Cc: Corinna Vinschen

Il 2023-04-14 23:01 Gionatan Danti via Cygwin ha scritto:
> Il 2023-04-14 22:25 Corinna Vinschen via Cygwin ha scritto:
>> We do that.  You're just stumbling over tha fact that U+F020 is also
>> used as outlined in
>> https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
>> and https://cygwin.com/pipermail/cygwin/2023-April/253478.html
> 
> Ah, so spaces and dots are replaced respectively by U+F020 and U+F02E
> even without the "dos" mount option?
> Because I can not see it in my case of an NTFS filesystem with the
> following mount options: binary,posix=0,user,noumount,auto

Hi all,
it's not clear to me why even without the "dos" mount option both space 
and dot are replaced by U+F020 and U+F02E, preventing U+F020 
passthrough.

Am I missing something?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-17  5:36           ` Gionatan Danti
@ 2023-04-17  9:05             ` Corinna Vinschen
  2023-04-17 10:58               ` Andrey Repin
  2023-04-17 13:46               ` Gionatan Danti
  0 siblings, 2 replies; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-17  9:05 UTC (permalink / raw)
  To: Gionatan Danti, cygwin

On Apr 17 07:36, Gionatan Danti via Cygwin wrote:
> Il 2023-04-14 23:01 Gionatan Danti via Cygwin ha scritto:
> > Il 2023-04-14 22:25 Corinna Vinschen via Cygwin ha scritto:
> > > We do that.  You're just stumbling over tha fact that U+F020 is also
> > > used as outlined in
> > > https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
> > > and https://cygwin.com/pipermail/cygwin/2023-April/253478.html
> > 
> > Ah, so spaces and dots are replaced respectively by U+F020 and U+F02E
> > even without the "dos" mount option?
> > Because I can not see it in my case of an NTFS filesystem with the
> > following mount options: binary,posix=0,user,noumount,auto
> 
> Hi all,
> it's not clear to me why even without the "dos" mount option both space and
> dot are replaced by U+F020 and U+F02E, preventing U+F020 passthrough.
> 
> Am I missing something?

It's actually not the "dos" mount option but specific filesystems
which trigger the conversion from U+0020 to U+F020.

However, the conversion back is handled in a piece of code which has
no information about the underlying filesystem, so the F0xx -> 00xx
conversion is done all the time.  Adding filesystem info in this
place is really tricky.


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-15  5:10     ` Brian Inglis
@ 2023-04-17  9:10       ` Corinna Vinschen
  0 siblings, 0 replies; 19+ messages in thread
From: Corinna Vinschen @ 2023-04-17  9:10 UTC (permalink / raw)
  To: cygwin; +Cc: Brian Inglis, Gionatan Danti

On Apr 14 23:10, Brian Inglis via Cygwin wrote:
> On 2023-04-14 14:17, Gionatan Danti via Cygwin wrote:
> > Il 2023-04-14 21:00 Corinna Vinschen ha scritto:
> > > There's no (good) solution from inside Cygwin.
> 
> > Yeah, I can only imagine how difficult is to be compatible with posix,
> > win32 and the likes.
> 
> > > Any chance you can just rename the files?
> 
> > I renamed the files, in fact.
> > However, it seems that users working with (older?) Office for MAC use
> > U+F020 more frequently than I expected, maybe because of that [1]:
> > "Microsoft's defunct Services For Macintosh feature used U+F001 through
> > U+F029 as replacements for special characters allowed in HFS but
> > forbidden in NTFS, and U+F02A for the Apple logo."
> > Any chances to enable a "bypass" for these characters (excluding the one
> > you reserved for compatibility as explained detailed in the "Forbidden
> > characters in filenames")? Maybe hidden behind a configurable option
> > (even disabled by default), so to not interfere with the current
> > behavior?
> 
> > [1] https://en.wikipedia.org/wiki/Private_Use_Areas#Vendor_use
> 
> Now if MS SfM and Cygwin had both registered with U/CSUR, they would not be
> fighting over Unicode code points, although it looks like there is a lot of
> competition for the code points! ;^>
> 
> Would it make more sense to add custom file name character filters into some
> utility, such as unix2dos/mac2unix, cygpath, or some other, and add
> (Cyg)win, or create such a utility, so those could be added to processes?

Adding this to some utility would make more sense than adding another
complication into the Cygwin codebase to support really old stuff.


Corinna

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-17  9:05             ` Corinna Vinschen
@ 2023-04-17 10:58               ` Andrey Repin
  2023-04-17 13:46               ` Gionatan Danti
  1 sibling, 0 replies; 19+ messages in thread
From: Andrey Repin @ 2023-04-17 10:58 UTC (permalink / raw)
  To: Corinna Vinschen via Cygwin, cygwin

Greetings, Corinna Vinschen via Cygwin!

> On Apr 17 07:36, Gionatan Danti via Cygwin wrote:
>> Il 2023-04-14 23:01 Gionatan Danti via Cygwin ha scritto:
>> > Il 2023-04-14 22:25 Corinna Vinschen via Cygwin ha scritto:
>> > > We do that.  You're just stumbling over tha fact that U+F020 is also
>> > > used as outlined in
>> > > https://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-specialchars
>> > > and https://cygwin.com/pipermail/cygwin/2023-April/253478.html
>> > 
>> > Ah, so spaces and dots are replaced respectively by U+F020 and U+F02E
>> > even without the "dos" mount option?
>> > Because I can not see it in my case of an NTFS filesystem with the
>> > following mount options: binary,posix=0,user,noumount,auto
>> 
>> Hi all,
>> it's not clear to me why even without the "dos" mount option both space and
>> dot are replaced by U+F020 and U+F02E, preventing U+F020 passthrough.
>> 
>> Am I missing something?

> It's actually not the "dos" mount option but specific filesystems
> which trigger the conversion from U+0020 to U+F020.

> However, the conversion back is handled in a piece of code which has
> no information about the underlying filesystem, so the F0xx -> 00xx
> conversion is done all the time.  Adding filesystem info in this
> place is really tricky.

My understanding is that on Windows, a regular file name can't start or end
with space, and can't end with dot.
There's ways to game this rule, but in simple cases this is how it works for
most part.
If a similar rule can be crafted for filesystems under discussion, that could
simplify the problem.


-- 
With best regards,
Andrey Repin
Monday, April 17, 2023 13:53:57

Sorry for my terrible english...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-17  9:05             ` Corinna Vinschen
  2023-04-17 10:58               ` Andrey Repin
@ 2023-04-17 13:46               ` Gionatan Danti
  2023-04-18 21:09                 ` Gionatan Danti
  1 sibling, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-17 13:46 UTC (permalink / raw)
  To: cygwin; +Cc: Corinna Vinschen

Il 2023-04-17 11:05 Corinna Vinschen ha scritto:
> It's actually not the "dos" mount option but specific filesystems
> which trigger the conversion from U+0020 to U+F020.

OK.

> However, the conversion back is handled in a piece of code which has
> no information about the underlying filesystem, so the F0xx -> 00xx
> conversion is done all the time.  Adding filesystem info in this
> place is really tricky.

Ah, I missed it, thanks! With these new information, I did some 
progress.

First, I use the "dos" mount option to always trigger conversion of 
space and dot at filename end into F+00xx chars. Now I am able to create 
such strange-looking file (in Explorer) within cygwin itself. For 
example, touch "zzs " now results in "zzs+strangechar" in Explorer. Both 
cygwin and windows are able to read/write such file.

But if I edit the filename via Explorer adding an extension (ie: from 
"zzs+strangechar" to "zzs+strangechar.txt") now cygwin is suddenly 
unable to read/write the file.

It seems to me that the appended chars prevent cygwin to translate back 
F0xx to 00xx (as the PUA char is not at the end of the filename 
anymore).

So, two paths should be available:
- always translate back F0xx to 00xx even if not at the end of filename;
- otherwise, if too invasive to do it unconditionally, add an option as 
"always_translate_pua" (default: off) to enable such behavior based on 
user needs.

I would (naively?) think that option 1 (always translate back PUA) 
should be the preferred approach, as cygwin is at the moment effectively 
unable to access some files.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-17 13:46               ` Gionatan Danti
@ 2023-04-18 21:09                 ` Gionatan Danti
  2023-04-19  1:10                   ` L A Walsh
  0 siblings, 1 reply; 19+ messages in thread
From: Gionatan Danti @ 2023-04-18 21:09 UTC (permalink / raw)
  To: cygwin

Il 2023-04-17 15:46 Gionatan Danti via Cygwin ha scritto:
> First, I use the "dos" mount option to always trigger conversion of
> space and dot at filename end into F+00xx chars. Now I am able to
> create such strange-looking file (in Explorer) within cygwin itself.
> For example, touch "zzs " now results in "zzs+strangechar" in
> Explorer. Both cygwin and windows are able to read/write such file.
> 
> But if I edit the filename via Explorer adding an extension (ie: from
> "zzs+strangechar" to "zzs+strangechar.txt") now cygwin is suddenly
> unable to read/write the file.
> 
> It seems to me that the appended chars prevent cygwin to translate
> back F0xx to 00xx (as the PUA char is not at the end of the filename
> anymore).
> 
> So, two paths should be available:
> - always translate back F0xx to 00xx even if not at the end of 
> filename;
> - otherwise, if too invasive to do it unconditionally, add an option
> as "always_translate_pua" (default: off) to enable such behavior based
> on user needs.
> 
> I would (naively?) think that option 1 (always translate back PUA)
> should be the preferred approach, as cygwin is at the moment
> effectively unable to access some files.

Hi all,
any thoughts on the matter? Am I missing something?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-18 21:09                 ` Gionatan Danti
@ 2023-04-19  1:10                   ` L A Walsh
  2023-04-19 11:56                     ` Gionatan Danti
  0 siblings, 1 reply; 19+ messages in thread
From: L A Walsh @ 2023-04-19  1:10 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: cygwin

I'm a bit confused as to what char you are trying to access/use, as
U+F020 is in the Private Use area (PUA)

Since it's in the PUA, it seems its meaning could differ by 
application/OS/User, no?
I.e. have no set definition


I mean you can use it in Cygwin to represent some character not usually 
permitted in
a DOS/Win filename (like :/\, etc.), but it wouldn't have the same 
meaning then
in Windows though.?  Isn't Private Use area application specific so an 
application can
create and use its own symbol set -- even though it wouldn't be portable 
to another application.

So if you create a character in Cygwin that maps to that area -- how 
would you expect Windows to
know that the character is and how treat it?

I think characters in the PUA range are used to allow Cygwin filenames 
to contain colon, slashes
and quotes -- so one wouldn't want Windows to understand the cygwin 
intent or it would defeat
the purpose of using custom characters to represent filenames that are 
legal under POSIX but not
under Windows.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can not stat file with utf char U+F020
  2023-04-19  1:10                   ` L A Walsh
@ 2023-04-19 11:56                     ` Gionatan Danti
  0 siblings, 0 replies; 19+ messages in thread
From: Gionatan Danti @ 2023-04-19 11:56 UTC (permalink / raw)
  To: L A Walsh; +Cc: cygwin

Il 2023-04-19 03:10 L A Walsh ha scritto:
> I'm a bit confused as to what char you are trying to access/use, as
> U+F020 is in the Private Use area (PUA)
> 
> Since it's in the PUA, it seems its meaning could differ by
> application/OS/User, no?
> I.e. have no set definition
> 
> I mean you can use it in Cygwin to represent some character not
> usually permitted in
> a DOS/Win filename (like :/\, etc.), but it wouldn't have the same 
> meaning then
> in Windows though.?  Isn't Private Use area application specific so an
> application can
> create and use its own symbol set -- even though it wouldn't be
> portable to another application.

The issue is with any clients/applications (even cygwin) creating a 
filename ending with a dot (or other chars) which is replaced with 
U+F020. If this file is later renamed adding some other character 
*after* the replaced dot, it become unreadable by cygwin.

Something similar to that:
- an user create a file name "project.", forgetting the extension, on an 
Windows share;
- the client replace the dot with U+F020;
- at this point all is good: the file can be read by the client, Windows 
and cygwin;
- the user notice the missing extension and rename the file in 
"project.txt";
- cygwin now does *not* traslate back U+F020 to dot and it is unable to 
read the file.

> I think characters in the PUA range are used to allow Cygwin filenames
> to contain colon, slashes
> and quotes -- so one wouldn't want Windows to understand the cygwin
> intent or it would defeat
> the purpose of using custom characters to represent filenames that are
> legal under POSIX but not
> under Windows.

True, but dot and spaces are somewhat different from the other reserved 
chars. While backslash, colons, etc. are rejected by NTFS itself (or by 
lower layer API), trailing dot and spaces are ignored/stripped by Win32. 
This means that Linux clients accessing an SMB share *can* successfully 
create such filenames without any issue and without replacing them with 
PUA chars.

For example, I created a file called "zzz." from a Linux+Mate client. 
Cygwin correctly see the filename as:
$ ls "zzz." | od -x --endian=big
0000000 7a7a 7a2e 0a00

True, Windows can not access this file, but this is fine because such a 
filename should never be understood by Windows. Not being able to open 
the file from Windows, its users themselves will find and correct the 
issue, renaming the file.

As things are now, we have the opposite issue: should (for whichever 
reason) a file exist with names as "zzz[U+F020]txt", cygwin will not be 
able to access this file. This means that anyone using cygwin+rsync to 
backup a Windows server will now have an inaccessible and impossible to 
backup file.

Thinking about that: how do you feel having an option to exclude 
trailing dots and spaces from PUA translations (effectively reverting 
them to the status of "normal" characters)?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-04-19 11:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14 17:53 Can not stat file with utf char U+F020 Gionatan Danti
2023-04-14 19:00 ` Corinna Vinschen
2023-04-14 19:54   ` Brian Inglis
2023-04-14 20:20     ` Corinna Vinschen
2023-04-14 20:21     ` Gionatan Danti
2023-04-14 20:25       ` Corinna Vinschen
2023-04-14 21:01         ` Gionatan Danti
2023-04-17  5:36           ` Gionatan Danti
2023-04-17  9:05             ` Corinna Vinschen
2023-04-17 10:58               ` Andrey Repin
2023-04-17 13:46               ` Gionatan Danti
2023-04-18 21:09                 ` Gionatan Danti
2023-04-19  1:10                   ` L A Walsh
2023-04-19 11:56                     ` Gionatan Danti
2023-04-14 20:17   ` Gionatan Danti
2023-04-14 20:40     ` Corinna Vinschen
2023-04-14 20:51       ` Gionatan Danti
2023-04-15  5:10     ` Brian Inglis
2023-04-17  9:10       ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).