public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Andy Koppe <andy.koppe@gmail.com>
To: cygwin@cygwin.com
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Date: Wed, 23 Sep 2009 11:52:00 -0000	[thread overview]
Message-ID: <416096c60909230452l42aa2210nf22b07c20cd2e697@mail.gmail.com> (raw)
In-Reply-To: <20090922170709.GV20981@calimero.vinschen.de>

2009/9/22 Corinna Vinschen:
>> >> Therefore, when converting a UTF-16 Windows filename to the current
>> >> charset, 0xDC?? words should be treated like any other UTF-16 word
>> >> that can't be represented in the current charset: it should be encoded
>> >> as a ^N sequence.

(I started writing this before seeing your patch to the singlebyte
codepage tables, which makes plenty of sense. Here goes anyway.)

Having actually looked at strfuncs.cc, my diagnosis was too
simplistic, because the U+DC?? codes are used not only for invalid
UTF-8 bytes, but for invalid bytes in any charset. This even includes
CP1252, which has a few holes in the 0x80..0x9F range.

Therefore, the complete solution would be something like this: when
sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte
it encodes is indeed an invalid byte in the current charset. If it is,
it translates it into that invalid byte, because on the way back it
would once again be turned into the same 0xDC?? code. If the byte
would represent (part of) a valid character, however, it would need to
be encoded as a ^N sequence to ensure correct roundtripping.

Now that shouldn't be too difficult to implement for singlebyte
charsets, but it gets somewhat hairy for multibyte charsets, including
UTF-8 itself. Here's how I think it could be done though:

In sys_cp_wcstombs:

* On encountering a DC?? code, extract the encoded byte, and feed it
into f_mbtowc. A private mbstate for this is needed, starting in the
initial state for each filename. Switch on the result of f_mbtowc:
** case -2 (incomplete sequence): add the byte to a buffer for this purpose
** case -1 (invalid sequence): copy anything already in the buffer
plus the current byte into the target filename, as we can be sure that
they'll turn back into U-DCbb again on the way back.
** case >0 (valid sequence): encode buffer contents and current byte
as a ^N codes that don't represent valid UTF-8

* When encountering a non-DC?? code, copy any bytes left in the buffer
into the target filename.

Unfortunately the latter point still leaves a loophole, in case the
incomplete sequence from the buffer and the subsequent bytes combine
into something valid. Singlebyte charset aren 't affected though,
because they don't have continuation bytes. Nor is UTF-8, because it
was designed such that continuation bytes are distinct from initial
bytes. Which leaves the DBCS charsets.

However, it rather looks like DBCSs are an intractable problem here in
any case, because of issues like this:

http://support.microsoft.com/kb/170559: "There are some codes that are
not matched one-to-one between Shift-JIS (Japanese character set
supported by MS) and Unicode. When an application calls
MultiByteToWideChar() and WideCharToMultiByte() to perform code
conversion between Shift-JIS and Unicode, the function returns the
wrong code value in some cases."

Which leaves me scratching my head regarding the C locale. More later ...

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

  reply	other threads:[~2009-09-23 11:52 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-10 19:31 Lapo Luchini
2009-09-10 22:12 ` Andy Koppe
2009-09-15 22:38   ` Lapo Luchini
2009-09-21 16:10     ` Corinna Vinschen
2009-09-21 18:54       ` Andy Koppe
2009-09-22  9:45         ` Corinna Vinschen
2009-09-22 16:12           ` Andy Koppe
2009-09-22 17:07             ` Corinna Vinschen
2009-09-23 11:52               ` Andy Koppe [this message]
2009-09-23 12:02               ` Corinna Vinschen
2009-09-23 12:35                 ` Andy Koppe
2009-09-23 12:43                   ` Corinna Vinschen
2009-09-23 13:39                     ` Corinna Vinschen
2009-09-23 21:31                       ` Ross Smith
2009-09-25 22:36                         ` Robert Pendell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=416096c60909230452l42aa2210nf22b07c20cd2e697@mail.gmail.com \
    --to=andy.koppe@gmail.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).