[adding cygwin and coreutils for a wc issue]

On 01/30/2011 07:04 PM, Bruno Haible wrote:
> Hi,
> 
> It is known for a long time that on native Windows, the wchar_t[] encoding on
> strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same
> for Cygwin >= 1.7. [2]

POSIX requires that 1 wchar_t corresponds to 1 character; so any use of
surrogates to get the full benefit of UTF-16 falls outside the bounds of
POSIX.  At which point, the POSIX definition of those functions no
longer apply, and we can (try) to make the various wc* functions try to
behave as smartly as possible (as is the case with Cygwin); where those
smarts are only needed when you use surrogate pairs.  If cygwin's
approach is correct, then maybe the thing to do is codify those smarts
for all implementations with 16-bit wchar_t as an extension to POSIX
that all gnulib clients can rely on, and thus minimize the #ifdefs in
such clients.

> What consequences does this have?
> 
>   1) All code that uses the functions from <wctype.h> (wide character
>      classification and mapping) or wcwidth() malfunctions on strings that
>      contains Unicode characters outside the BMP, i.e. outside the range
>      U+0000..U+FFFF.

Not necessarily.  Such code falls outside of POSIX, but it may still be
a well-behaved extension if given sane behavior for how to deal with
surrogates.

>   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
>      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
>      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
>      output bytes when it consumes a wchar_t.

>   Now with a chinese character outside the BMP:
>   $ 	
>         1       4
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         3       6
> 
>   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> 
>   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
>         1       5
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         2       7
>
>   So both the number of characters and the number of words are counted
>   wrong as soon as non-BMP characters occur.
>

Does this represent a bug in cygwin's mbrtowc routines that could be
fixed by cygwin?

Or, does this represent a bug in coreutils for using mbrtowc one
character at a time instead of something like mbsrtowcs to do bulk
conversions?

And if we decide that cygwin's extensions are sane, how much harder is
it to characterize what a program must do to be portable to both 16-bit
and 32-bit wchar_t if they are guaranteed the same behavior for all
hosts of the same-size wchar_t?  In other words, would it really require
that many #ifdefs in coreutils to portably and simultaneously support
both sizes of wchar_t?

> I'm more in favour of overriding wchar_t and all functions that depend on it -
> like we did successfully for the socket functions.
> 
> In practice, this would mean that on Windows (both native Windows and
> Cygwin >= 1.7) the use of a 'wchar_t' module will
>   - override wchar_t to be 32 bits, like in glibc,
>   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
>     corresponding system functions are unusable, the replacements will use the
>     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).

That's a lot of overriding, for anything that uses wchar_t in its API,
and throws out a lot of what cygwin already provides.  It also means
that compiler primitives, like L"xyz", which result in 16-bit wchar_t
arrays, will be unusable with your 32-bit wchar_t override.  In other
words, I don't think it's a good idea to be doing that.

C1x will be adding compiler support for mandatory char16_t and char32_t
types for UTF-16 and UTF-32 data, independently of whether wchar_t is
16-bit or 32-bit; maybe the better thing is to proactively start
providing the new interfaces in <uchar.h> that will result from C1x
adoption (and convert GNU programs to use this rather than wchar_t for
character operations), although without compiler support for u"" and U""
(and even u8""), we are no better than ditching compiler support for L""
if you force a wchar_t size override.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

7.27 Unicode utilities <uchar.h>
1 The header <uchar.h> declares types and functions for manipulating Unicode
 characters.
2 The types declared are mbstate_t (described in 7.29.1) and size_t
(described in
 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the
same type as
uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the
same type as
uint_least32_t (also described in 7.20.1.2).

mbrtoc16
c16rtomb
mbrtoc32
c32rtomb

but no variants for replacing wprintf and friends (convert to multibyte
and use printf and friends instead).

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org