From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26239 invoked by alias); 22 Sep 2009 16:12:29 -0000 Received: (qmail 26229 invoked by uid 22791); 22 Sep 2009 16:12:28 -0000 X-SWARE-Spam-Status: No, hits=-0.9 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_12,J_CHICKENPOX_14,J_CHICKENPOX_23,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org Received: from mail-qy0-f181.google.com (HELO mail-qy0-f181.google.com) (209.85.221.181) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 22 Sep 2009 16:12:23 +0000 Received: by qyk11 with SMTP id 11so243768qyk.20 for ; Tue, 22 Sep 2009 09:12:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.119.159 with SMTP id z31mr454829qcq.49.1253635941360; Tue, 22 Sep 2009 09:12:21 -0700 (PDT) In-Reply-To: <20090922094523.GR20981@calimero.vinschen.de> References: <416096c60909101512l6e42ab72l4ba5fd792363eefd@mail.gmail.com> <20090921161014.GI20981@calimero.vinschen.de> <416096c60909211154u5ddd5869v986011aa4ee13d57@mail.gmail.com> <20090922094523.GR20981@calimero.vinschen.de> Date: Tue, 22 Sep 2009 16:12:00 -0000 Message-ID: <416096c60909220912s5dd749bh5cfeb670b0e78c7a@mail.gmail.com> Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete? From: Andy Koppe To: cygwin@cygwin.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2009-09/txt/msg00550.txt.bz2 2009/9/22 Corinna Vinschen: >> > As you might know, invalid bytes >=3D 0x80 are translated to UTF-16 by >> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. >> > The problem now is that readdir() will return the transposed characters >> > as if they are the original characters. >> >> Yep, that's where the bug is. Those 0xDC?? words represent invalid >> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters. >> >> Therefore, when converting a UTF-16 Windows filename to the current >> charset, 0xDC?? words should be treated like any other UTF-16 word >> that can't be represented in the current charset: it should be encoded >> as a ^N sequence. > > How? =C2=A0Just like the incoming multibyte character didn't represent a = valid > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. > Therefore, the ^N conversion will fail since U+DCxx can't be converted > to valid UTF-8. True, but that's an implementation issue rather than a design issue, i.e. the ^N conversion needs to do the UTF-8 conversion itself rather than invoke the __utf8 functions. Shall I look into creating a patch? >> > So it looks like the current mechanism to handle invalid multibyte >> > sequences is too complicated for us. =C2=A0As far as I can see, it wou= ld be >> > much simpler and less error prone to translate the invalid bytes simply >> > to the equivalent UTF-16 value. =C2=A0That creates filenames with UTF-= 16 >> > values from the ISO-8859-1 range. >> >> This won't work correctly, because different POSIX filenames will map >> to the same Windows filename. For example, the filenames "\xC3\xA4" >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that >> represents a-umlaut in 8859-1), will both map to Windows filename >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file >> called "\xC4", a readdir() would show that file as "\xC3\xA4". > > Right, but using your above suggestion will also lead to another filename > in readdir, it would just be \x0e\xsome\xthing. I don't think the suggestion above is directly relevant to the problem I tried to highlight here. Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4 on disk, while "\xC4" turns into U+DCC4, and converting back yields the original separate filenames. If I understand your proposal correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence converting back would yield "\xC3\xA4" for both. This is wrong. Those filenames shouldn't be clobbering each other, and a filename shouldn't change between open() and readdir(), certainly not without switching charset inbetween. Having said that, if you did switch charset from UTF-8 e.g. to ISO-8859-1, the on-disk U+DCC4 would indeed turn into "\x0E\xsome\xthing". However, that issue applies to any UTF-16 character not in the target charset, not just those funny U+DC?? codes for representing invalid UTF-8 bytes. The only way to avoid the POSIX filenames changing depending on locale would be to assume UTF-8 for filenames no matter the locale charset. That's an entirely different can of worms though, extending the compatibility problems discussed on the "The C locale" thread to all non-UTF-8 locales, and putting the onus for converting filenames on applications. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple