From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com, bug-gnulib@gnu.org, bug-coreutils@gnu.org
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 02 Feb 2011 12:21:00 -0000 [thread overview]
Message-ID: <20110202122102.GD2675@calimero.vinschen.de> (raw)
In-Reply-To: <20110202121442.GC2675@calimero.vinschen.de>
On Feb 2 13:14, Corinna Vinschen wrote:
> On Feb 2 12:29, Bruno Haible wrote:
> > Hello Eric,
> >
> > > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > > ...
> > > > What consequences does this have?
> > > >
> > > > 1) All code that uses the functions from <wctype.h> (wide character
> > > > classification and mapping) or wcwidth() malfunctions on strings that
> > > > contains Unicode characters outside the BMP, i.e. outside the range
> > > > U+0000..U+FFFF.
> > >
> > > Not necessarily. Such code falls outside of POSIX, but it may still be
> > > a well-behaved extension if given sane behavior for how to deal with
> > > surrogates.
> >
> > No. Code that uses <wctype.h> and wcwidth() is written precisely according
> > to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> > is in UTF-16 encoding. There simply is no way to define these functions
> > in a reasonable way for surrogates.
> >
> > For example:
> > U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true)
> > U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false)
> > U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false)
> > U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false)
> > U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true)
> > U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true)
> > There is no way that a system can provide this information through a
> > function 'iswalpha' that takes a single wchar_t argument.
>
> iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte,
> the function can return the correct value, provided that the application
> converts the UTF-16 surrogate to UTF-32 before calling iswalpha.
And, please note the wording in SUSv4, for instance in
http://calimero.vinschen.de/susv4/functions/iswalpha.html
The wc argument is a wint_t, the value of which the application shall
^^^^^^ ^^^^^^^^^^^
ensure is a wide-character code corresponding to a valid character in
the current locale, or equal to the value of the macro WEOF. If the
argument has any other value, the behavior is undefined.
I don't see any words in that which would disallow to convert UTF-16
wchar_t surrogates to a wint_t UTF-32 value before calling one of
the wctype functions. Just like you have to be careful not to call
the ctype functions with a signed char.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
next prev parent reply other threads:[~2011-02-02 12:21 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <201101310304.42975.bruno@clisp.org>
2011-01-31 19:16 ` Eric Blake
2011-01-31 20:49 ` Corinna Vinschen
2011-02-02 11:29 ` Bruno Haible
2011-02-02 12:15 ` Corinna Vinschen
2011-02-02 12:21 ` Corinna Vinschen [this message]
2011-02-02 16:03 ` Bruno Haible
2011-02-02 16:28 ` Corinna Vinschen
2011-02-02 16:35 ` Corinna Vinschen
2011-02-02 20:28 ` Andy Koppe
2011-02-04 22:46 ` Warren Young
2011-02-02 17:52 ` bug#7948: " Paul Eggert
2011-02-02 18:57 ` Bruno Haible
2011-02-02 20:43 ` Andy Koppe
2011-02-03 12:57 ` Ulf Zibis
2011-02-02 21:24 ` Eric Blake
2011-02-02 21:39 ` Corinna Vinschen
2011-02-02 23:03 ` Bruno Haible
2011-02-02 23:19 ` Eric Blake
2011-02-03 0:13 ` Bruno Haible
2011-02-03 9:42 ` Corinna Vinschen
2011-02-03 10:48 ` Bruno Haible
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110202122102.GD2675@calimero.vinschen.de \
--to=corinna-cygwin@cygwin.com \
--cc=bug-coreutils@gnu.org \
--cc=bug-gnulib@gnu.org \
--cc=cygwin@cygwin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).