* With bad UTF-8, cygwin can create files it can't read
@ 2015-03-25 15:26 Kyzer
2015-03-30 11:16 ` Corinna Vinschen
0 siblings, 1 reply; 6+ messages in thread
From: Kyzer @ 2015-03-25 15:26 UTC (permalink / raw)
To: cygwin
Hello,
I've found that if you use cygwin to create a file with badly-encoded
UTF-8, readdir() gives out an entry with a name that cygwin won't
subsequently accept.
* create a file using filename with hex bytes F4 8F BF BF
* readdir() reports the filename as hex bytes E2 8E B3 ED BF BF
* attempting to open or unlink the filename E2 8E B3 ED BF BF fails
* attempting to open or unlink the filename F4 8F BF BF succeeds
Here's a test case. Beware that it will delete everything in the
current directory.
#include <stdio.h>
#include <dirent.h>
int main() {
DIR *d;
struct dirent *de;
char *fname = "\xF4\x8F\xBF\xBF";
// touch file
fclose(fopen(fname, "wb"));
// iterate through dir
d = opendir(".");
while ((de = readdir(d))) {
if (de->d_name[0] == '.') continue;
printf("unlink(%s) = %d\n", de->d_name, unlink(de->d_name));
}
closedir(d);
// show that unlink works if you know the real filename
printf("unlink(%s) = %d\n", fname, unlink(fname));
}
This outputs (piped through hexdump -C)
00000000 75 6e 6c 69 6e 6b 28 e2 8e b3 ed bf bf 29 20 3d |unlink(......) =|
00000010 20 2d 31 0a 75 6e 6c 69 6e 6b 28 f4 8f bf bf 29 | -1.unlink(....)|
00000020 20 3d 20 30 0a | = 0.|
00000025
e.g.
unlink(\xe2\x8e\xb3\xed\xbf\xbf) = -1
unlink(\xf4\x8f\xbf\xbf) = 0
This is with cygwin package 1.7.35
$ cygcheck -c cygwin
Cygwin Package Information
Package Version Status
cygwin 1.7.35-1 OK
WIndows / DOS does not have the problem:
c:\test\t>dir
Volume in drive C has no label.
Volume Serial Number is ....-....
Directory of c:\test\t
25/03/2015 14:30 <DIR> .
25/03/2015 14:30 <DIR> ..
25/03/2015 14:30 0 ??
1 File(s) 0 bytes
2 Dir(s) 39,906,525,184 bytes free
c:\test\t>del *
c:\test\t\*, Are you sure (Y/N)? y
c:\test\t>dir
Volume in drive C has no label.
Volume Serial Number is ....-....
Directory of c:\test\t
25/03/2015 14:31 <DIR> .
25/03/2015 14:31 <DIR> ..
0 File(s) 0 bytes
2 Dir(s) 39,906,525,184 bytes free
Regards
Stuart
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: With bad UTF-8, cygwin can create files it can't read
2015-03-25 15:26 With bad UTF-8, cygwin can create files it can't read Kyzer
@ 2015-03-30 11:16 ` Corinna Vinschen
2015-04-01 13:34 ` Corinna Vinschen
0 siblings, 1 reply; 6+ messages in thread
From: Corinna Vinschen @ 2015-03-30 11:16 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 717 bytes --]
On Mar 25 14:34, Kyzer wrote:
> Hello,
>
> I've found that if you use cygwin to create a file with badly-encoded
> UTF-8, readdir() gives out an entry with a name that cygwin won't
> subsequently accept.
>
> * create a file using filename with hex bytes F4 8F BF BF
> * readdir() reports the filename as hex bytes E2 8E B3 ED BF BF
> * attempting to open or unlink the filename E2 8E B3 ED BF BF fails
> * attempting to open or unlink the filename F4 8F BF BF succeeds
Thanks for the testcase. I'll have a look later this week (I hope).
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: With bad UTF-8, cygwin can create files it can't read
2015-03-30 11:16 ` Corinna Vinschen
@ 2015-04-01 13:34 ` Corinna Vinschen
2015-04-01 16:01 ` Warren Young
2015-04-01 16:10 ` Corinna Vinschen
0 siblings, 2 replies; 6+ messages in thread
From: Corinna Vinschen @ 2015-04-01 13:34 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 1783 bytes --]
Hi Stuart,
On Mar 30 13:04, Corinna Vinschen wrote:
> On Mar 25 14:34, Kyzer wrote:
> > Hello,
> >
> > I've found that if you use cygwin to create a file with badly-encoded
> > UTF-8, readdir() gives out an entry with a name that cygwin won't
> > subsequently accept.
> >
> > * create a file using filename with hex bytes F4 8F BF BF
> > * readdir() reports the filename as hex bytes E2 8E B3 ED BF BF
> > * attempting to open or unlink the filename E2 8E B3 ED BF BF fails
> > * attempting to open or unlink the filename F4 8F BF BF succeeds
>
> Thanks for the testcase. I'll have a look later this week (I hope).
Wow. Just wow. You found a long-standing bug in the wctomb conversion
from UTF-16 to UTF-8.
As you probably know, Unicode values beyond the base plane (that is,
everything > 0xffff in UTF-32 and > ef bf bf in UTF-8 notation)
are represented as so-called surrogate pairs in UTF-16, two UTF-16
values in the 0xd800 - 0xdfff range.
While the conversion from UTF-8 f4 8f Bf Bf to UTF-16 dbff dfff
worked fine, the conversion back to UTF-8 has a subtil bug. There's
a test for a lone high surrogate pair in the underlying conversion
function. This tests the next UTF-16 value like this:
if (wchar < 0xdc00 || wchar >= 0xdfff)
/* Handle lone high surrogate */
Notice the >= 0xdfff? That should have been > 0xdfff. Duh. This
bug is only a bit over 5 years old...
Fixed in the git repo. I'l regenerate the today's fool..., erm, the
today's developer snapshot on https://cygwin.com/snapshots/ later today.
Thanks, especially for the simple testcase,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: With bad UTF-8, cygwin can create files it can't read
2015-04-01 13:34 ` Corinna Vinschen
@ 2015-04-01 16:01 ` Warren Young
2015-04-01 16:16 ` Corinna Vinschen
2015-04-01 16:10 ` Corinna Vinschen
1 sibling, 1 reply; 6+ messages in thread
From: Warren Young @ 2015-04-01 16:01 UTC (permalink / raw)
To: cygwin
On Apr 1, 2015, at 7:34 AM, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
>
> As you probably know, Unicode values beyond the base plane (that is,
> everything > 0xffff in UTF-32 and > ef bf bf in UTF-8 notation)
> are represented as so-called surrogate pairs in UTF-16, two UTF-16
> values in the 0xd800 - 0xdfff range.
I happened to have run across a similar strangeness in Unicode earlier today. Does Cygwin cope with/care about Unicode normalization forms?
http://goo.gl/jnsqhC
For example, will open(2) cope with any UTF-8 form of a string that you could pass in UTF-16 encoding to CreateFile()?
You could imagine, say, a web app getting a string from a user, then using that to access a file on disk. A different browser given the “same” string could result in a different series of bytes passed to the Cygwin POSIX layer.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: With bad UTF-8, cygwin can create files it can't read
2015-04-01 13:34 ` Corinna Vinschen
2015-04-01 16:01 ` Warren Young
@ 2015-04-01 16:10 ` Corinna Vinschen
1 sibling, 0 replies; 6+ messages in thread
From: Corinna Vinschen @ 2015-04-01 16:10 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 1907 bytes --]
On Apr 1 15:34, Corinna Vinschen wrote:
> Hi Stuart,
>
> On Mar 30 13:04, Corinna Vinschen wrote:
> > On Mar 25 14:34, Kyzer wrote:
> > > Hello,
> > >
> > > I've found that if you use cygwin to create a file with badly-encoded
> > > UTF-8, readdir() gives out an entry with a name that cygwin won't
> > > subsequently accept.
> > >
> > > * create a file using filename with hex bytes F4 8F BF BF
> > > * readdir() reports the filename as hex bytes E2 8E B3 ED BF BF
> > > * attempting to open or unlink the filename E2 8E B3 ED BF BF fails
> > > * attempting to open or unlink the filename F4 8F BF BF succeeds
> >
> > Thanks for the testcase. I'll have a look later this week (I hope).
>
> Wow. Just wow. You found a long-standing bug in the wctomb conversion
> from UTF-16 to UTF-8.
>
> As you probably know, Unicode values beyond the base plane (that is,
> everything > 0xffff in UTF-32 and > ef bf bf in UTF-8 notation)
> are represented as so-called surrogate pairs in UTF-16, two UTF-16
> values in the 0xd800 - 0xdfff range.
>
> While the conversion from UTF-8 f4 8f Bf Bf to UTF-16 dbff dfff
> worked fine, the conversion back to UTF-8 has a subtil bug. There's
> a test for a lone high surrogate pair in the underlying conversion
> function. This tests the next UTF-16 value like this:
>
> if (wchar < 0xdc00 || wchar >= 0xdfff)
> /* Handle lone high surrogate */
>
> Notice the >= 0xdfff? That should have been > 0xdfff. Duh. This
> bug is only a bit over 5 years old...
>
> Fixed in the git repo. I'l regenerate the today's fool..., erm, the
> today's developer snapshot on https://cygwin.com/snapshots/ later today.
Snapshot is up. Please give it a try.
Thanks,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: With bad UTF-8, cygwin can create files it can't read
2015-04-01 16:01 ` Warren Young
@ 2015-04-01 16:16 ` Corinna Vinschen
0 siblings, 0 replies; 6+ messages in thread
From: Corinna Vinschen @ 2015-04-01 16:16 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 998 bytes --]
On Apr 1 10:01, Warren Young wrote:
> On Apr 1, 2015, at 7:34 AM, Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> >
> > As you probably know, Unicode values beyond the base plane (that is,
> > everything > 0xffff in UTF-32 and > ef bf bf in UTF-8 notation)
> > are represented as so-called surrogate pairs in UTF-16, two UTF-16
> > values in the 0xd800 - 0xdfff range.
>
> I happened to have run across a similar strangeness in Unicode earlier
> today. Does Cygwin cope with/care about Unicode normalization forms?
Not at all. UTF-8 string in, equivalent UTF-16 string out and vice versa,
on the bit level. Additionally there's a replacement for UTF-16 values
which can't be handled by the current (non-UTF-8) codeset, e.g. ISO8859-1:
ASCII CAN followed by the UTF-8 representation of the UTF-16 character.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-04-01 16:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-25 15:26 With bad UTF-8, cygwin can create files it can't read Kyzer
2015-03-30 11:16 ` Corinna Vinschen
2015-04-01 13:34 ` Corinna Vinschen
2015-04-01 16:01 ` Warren Young
2015-04-01 16:16 ` Corinna Vinschen
2015-04-01 16:10 ` Corinna Vinschen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).