From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com, bug-gnulib@gnu.org, bug-coreutils@gnu.org
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Date: Mon, 31 Jan 2011 20:49:00 -0000 [thread overview]
Message-ID: <20110131182210.GL1057@calimero.vinschen.de> (raw)
In-Reply-To: <4D46EA2B.1010307@redhat.com>
On Jan 31 09:58, Eric Blake wrote:
> > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> > but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> > output bytes when it consumes a wchar_t.
>
> > Now with a chinese character outside the BMP:
> > $
> > 1 4
> > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> > 3 6
> >
> > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> >
> > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> > 1 5
> > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> > 2 7
> >
> > So both the number of characters and the number of words are counted
> > wrong as soon as non-BMP characters occur.
> >
>
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
>
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?
Just to clarify a bit. This has been discussed on the cygwin-developer
mailing list back in 2009. The original code which handled UTF-16
surrogates always wrote at least 1 byte to the destination UTF-8 string.
However, the problem is that Windows filenames may contain lone
surrogate pairs, even though the filename is usually interpreted as
UTF-16.
So the current code returns 0 bytes for the first surrogate half and
only writes the full UTF-8 sequence after the second surrogate half has
been evaluated. In the case where a lone high surrogate is still
pending, but the low surrogate is missing, we can just write out the
high surrogate in CESU-8 encoding. This would not have been possible if
we had already written the first byte of the UTF-8 string. Lone low
surrogates are written as CESU-8 sequence immediately so they are nothing
to worry about.
As for wctomb/wcrtomb returning 0: Even if this looks like kind of a
stretch, this should not be a problem per POSIX. A return value of 0
from wctomb/wcrtomb has no special meaning(*). Even in the case where
the incoming wide char is L'\0', the resulting \0 is written and 1 is
returned. Since 0 bytes have been written to the destination string,
returning 0 is perfectly valid. If a calling function misinterprets the
return value of 0 as an error or EOF, it's not a bug in wctomb/wcrtomb.
For the original discussion, see
http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html
Corinna
(*) http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcrtomb.html
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
next prev parent reply other threads:[~2011-01-31 18:22 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <201101310304.42975.bruno@clisp.org>
2011-01-31 19:16 ` Eric Blake
2011-01-31 20:49 ` Corinna Vinschen [this message]
2011-02-02 11:29 ` Bruno Haible
2011-02-02 12:15 ` Corinna Vinschen
2011-02-02 12:21 ` Corinna Vinschen
2011-02-02 16:03 ` Bruno Haible
2011-02-02 16:28 ` Corinna Vinschen
2011-02-02 16:35 ` Corinna Vinschen
2011-02-02 20:28 ` Andy Koppe
2011-02-04 22:46 ` Warren Young
2011-02-02 17:52 ` bug#7948: " Paul Eggert
2011-02-02 18:57 ` Bruno Haible
2011-02-02 20:43 ` Andy Koppe
2011-02-03 12:57 ` Ulf Zibis
2011-02-02 21:24 ` Eric Blake
2011-02-02 21:39 ` Corinna Vinschen
2011-02-02 23:03 ` Bruno Haible
2011-02-02 23:19 ` Eric Blake
2011-02-03 0:13 ` Bruno Haible
2011-02-03 9:42 ` Corinna Vinschen
2011-02-03 10:48 ` Bruno Haible
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110131182210.GL1057@calimero.vinschen.de \
--to=corinna-cygwin@cygwin.com \
--cc=bug-coreutils@gnu.org \
--cc=bug-gnulib@gnu.org \
--cc=cygwin@cygwin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).