[adding cygwin and coreutils for a wc issue] On 01/30/2011 07:04 PM, Bruno Haible wrote: > Hi, > > It is known for a long time that on native Windows, the wchar_t[] encoding on > strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same > for Cygwin >= 1.7. [2] POSIX requires that 1 wchar_t corresponds to 1 character; so any use of surrogates to get the full benefit of UTF-16 falls outside the bounds of POSIX. At which point, the POSIX definition of those functions no longer apply, and we can (try) to make the various wc* functions try to behave as smartly as possible (as is the case with Cygwin); where those smarts are only needed when you use surrogate pairs. If cygwin's approach is correct, then maybe the thing to do is codify those smarts for all implementations with 16-bit wchar_t as an extension to POSIX that all gnulib clients can rely on, and thus minimize the #ifdefs in such clients. > What consequences does this have? > > 1) All code that uses the functions from (wide character > classification and mapping) or wcwidth() malfunctions on strings that > contains Unicode characters outside the BMP, i.e. outside the range > U+0000..U+FFFF. Not necessarily. Such code falls outside of POSIX, but it may still be a well-behaved extension if given sane behavior for how to deal with surrogates. > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent > but somewhat surprising way: wcrtomb() may return 0, that is, produce no > output bytes when it consumes a wchar_t. > Now with a chinese character outside the BMP: > $ > 1 4 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 3 6 > > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5): > > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > 1 5 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 2 7 > > So both the number of characters and the number of words are counted > wrong as soon as non-BMP characters occur. > Does this represent a bug in cygwin's mbrtowc routines that could be fixed by cygwin? Or, does this represent a bug in coreutils for using mbrtowc one character at a time instead of something like mbsrtowcs to do bulk conversions? And if we decide that cygwin's extensions are sane, how much harder is it to characterize what a program must do to be portable to both 16-bit and 32-bit wchar_t if they are guaranteed the same behavior for all hosts of the same-size wchar_t? In other words, would it really require that many #ifdefs in coreutils to portably and simultaneously support both sizes of wchar_t? > I'm more in favour of overriding wchar_t and all functions that depend on it - > like we did successfully for the socket functions. > > In practice, this would mean that on Windows (both native Windows and > Cygwin >= 1.7) the use of a 'wchar_t' module will > - override wchar_t to be 32 bits, like in glibc, > - cause functions from mbrtowc() to wcwidth() to be overridden. Since the > corresponding system functions are unusable, the replacements will use the > modules from libunistring (such as unictype/ctype-alnum and uniwidth/width). That's a lot of overriding, for anything that uses wchar_t in its API, and throws out a lot of what cygwin already provides. It also means that compiler primitives, like L"xyz", which result in 16-bit wchar_t arrays, will be unusable with your 32-bit wchar_t override. In other words, I don't think it's a good idea to be doing that. C1x will be adding compiler support for mandatory char16_t and char32_t types for UTF-16 and UTF-32 data, independently of whether wchar_t is 16-bit or 32-bit; maybe the better thing is to proactively start providing the new interfaces in that will result from C1x adoption (and convert GNU programs to use this rather than wchar_t for character operations), although without compiler support for u"" and U"" (and even u8""), we are no better than ditching compiler support for L"" if you force a wchar_t size override. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists: 7.27 Unicode utilities 1 The header declares types and functions for manipulating Unicode characters. 2 The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19); char16_t which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and char32_t which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2). mbrtoc16 c16rtomb mbrtoc32 c32rtomb but no variants for replacing wprintf and friends (convert to multibyte and use printf and friends instead). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org