On 01/29/2011 09:01 AM, Corinna Vinschen wrote:
>> So, using UTF-16 surrogate encodings for characters outside the basic
>> plane violates POSIX, but it's the best we can do for those characters.
> 
> Right, and we discussed this already on this list.  Or the developer
> list, I don't remember.  Maybe we should have stick to the base plane
> and only use UCS-2 to be more POSIX compatible.

The burden is on the application, not on cygwin.  If the application
wants POSIX behavior, then they obey __STDC_ISO_10646__ and use ONLY
characters from the basic plane (no surrogates), at which point their
use of wchar_t fits the POSIX definition (one wchar_t per character).
The moment they pass a surrogate, they are no longer honoring the
restriction documented by __STDC_ISO_10646__ so they are no longer under
the rules of POSIX, and then cygwin can do whatever it wants (and in
this case, QoI demands that we honor surrogates to the best of our
ability for full UTF-16 support, and you can have multi-wchar_t
characters just as you already have multi-byte UTF-8 char characters).
In other words, cygwin IS being POSIX-compliant by advertising only the
Unicode 4.0 character set in the __STDC_ISO_10646__, while still
supporting Unicode 5.2 (should we upgrade to Unicode 6.0?) as an
extension when you no longer care about POSIX.

> However, the POSIX definition doesn't contradict what I said about the
> definition of __STDC_ISO_10646__ as far as I'm concerned.

Yep - I think we're in violent agreement :)

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org