From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cygwin-return-154826-listarch-cygwin=sourceware.org@cygwin.com>
Received: (qmail 21576 invoked by alias); 22 Sep 2009 09:45:37 -0000
Received: (qmail 21568 invoked by uid 22791); 22 Sep 2009 09:45:37 -0000
X-Spam-Check-By: sourceware.org
Received: from aquarius.hirmke.de (HELO calimero.vinschen.de) (217.91.18.234)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 22 Sep 2009 09:45:34 +0000
Received: by calimero.vinschen.de (Postfix, from userid 500) 	id EDEEC6D434D; Tue, 22 Sep 2009 11:45:23 +0200 (CEST)
Date: Tue, 22 Sep 2009 09:45:00 -0000
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Message-ID: <20090922094523.GR20981@calimero.vinschen.de>
Reply-To: cygwin@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
References: <h8bk5a$big$1@ger.gmane.org> <416096c60909101512l6e42ab72l4ba5fd792363eefd@mail.gmail.com> <h8p50e$im8$1@ger.gmane.org> <20090921161014.GI20981@calimero.vinschen.de> <416096c60909211154u5ddd5869v986011aa4ee13d57@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <416096c60909211154u5ddd5869v986011aa4ee13d57@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
X-SW-Source: 2009-09/txt/msg00524.txt.bz2

On Sep 21 19:54, Andy Koppe wrote:
> 2009/9/21 Corinna Vinschen:
> > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
> > The problem now is that readdir() will return the transposed characters
> > as if they are the original characters.
> 
> Yep, that's where the bug is. Those 0xDC?? words represent invalid
> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
> 
> Therefore, when converting a UTF-16 Windows filename to the current
> charset, 0xDC?? words should be treated like any other UTF-16 word
> that can't be represented in the current charset: it should be encoded
> as a ^N sequence.

How?  Just like the incoming multibyte character didn't represent a valid
UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
Therefore, the ^N conversion will fail since U+DCxx can't be converted
to valid UTF-8.

> > So it looks like the current mechanism to handle invalid multibyte
> > sequences is too complicated for us.  As far as I can see, it would be
> > much simpler and less error prone to translate the invalid bytes simply
> > to the equivalent UTF-16 value.  That creates filenames with UTF-16
> > values from the ISO-8859-1 range.
> 
> This won't work correctly, because different POSIX filenames will map
> to the same Windows filename. For example, the filenames "\xC3\xA4"
> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
> represents a-umlaut in 8859-1), will both map to Windows filename
> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
> called "\xC4", a readdir() would show that file as "\xC3\xA4".

Right, but using your above suggestion will also lead to another filename
in readdir, it would just be \x0e\xsome\xthing.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple