* Re: 16-bit wchar_t on Windows and Cygwin [not found] <201101310304.42975.bruno@clisp.org> @ 2011-01-31 19:16 ` Eric Blake 2011-01-31 20:49 ` Corinna Vinschen 2011-02-02 11:29 ` Bruno Haible 0 siblings, 2 replies; 21+ messages in thread From: Eric Blake @ 2011-01-31 19:16 UTC (permalink / raw) To: Bruno Haible; +Cc: bug-gnulib, cygwin, bug-coreutils [-- Attachment #1: Type: text/plain, Size: 5113 bytes --] [adding cygwin and coreutils for a wc issue] On 01/30/2011 07:04 PM, Bruno Haible wrote: > Hi, > > It is known for a long time that on native Windows, the wchar_t[] encoding on > strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same > for Cygwin >= 1.7. [2] POSIX requires that 1 wchar_t corresponds to 1 character; so any use of surrogates to get the full benefit of UTF-16 falls outside the bounds of POSIX. At which point, the POSIX definition of those functions no longer apply, and we can (try) to make the various wc* functions try to behave as smartly as possible (as is the case with Cygwin); where those smarts are only needed when you use surrogate pairs. If cygwin's approach is correct, then maybe the thing to do is codify those smarts for all implementations with 16-bit wchar_t as an extension to POSIX that all gnulib clients can rely on, and thus minimize the #ifdefs in such clients. > What consequences does this have? > > 1) All code that uses the functions from <wctype.h> (wide character > classification and mapping) or wcwidth() malfunctions on strings that > contains Unicode characters outside the BMP, i.e. outside the range > U+0000..U+FFFF. Not necessarily. Such code falls outside of POSIX, but it may still be a well-behaved extension if given sane behavior for how to deal with surrogates. > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent > but somewhat surprising way: wcrtomb() may return 0, that is, produce no > output bytes when it consumes a wchar_t. > Now with a chinese character outside the BMP: > $ > 1 4 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 3 6 > > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5): > > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > 1 5 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 2 7 > > So both the number of characters and the number of words are counted > wrong as soon as non-BMP characters occur. > Does this represent a bug in cygwin's mbrtowc routines that could be fixed by cygwin? Or, does this represent a bug in coreutils for using mbrtowc one character at a time instead of something like mbsrtowcs to do bulk conversions? And if we decide that cygwin's extensions are sane, how much harder is it to characterize what a program must do to be portable to both 16-bit and 32-bit wchar_t if they are guaranteed the same behavior for all hosts of the same-size wchar_t? In other words, would it really require that many #ifdefs in coreutils to portably and simultaneously support both sizes of wchar_t? > I'm more in favour of overriding wchar_t and all functions that depend on it - > like we did successfully for the socket functions. > > In practice, this would mean that on Windows (both native Windows and > Cygwin >= 1.7) the use of a 'wchar_t' module will > - override wchar_t to be 32 bits, like in glibc, > - cause functions from mbrtowc() to wcwidth() to be overridden. Since the > corresponding system functions are unusable, the replacements will use the > modules from libunistring (such as unictype/ctype-alnum and uniwidth/width). That's a lot of overriding, for anything that uses wchar_t in its API, and throws out a lot of what cygwin already provides. It also means that compiler primitives, like L"xyz", which result in 16-bit wchar_t arrays, will be unusable with your 32-bit wchar_t override. In other words, I don't think it's a good idea to be doing that. C1x will be adding compiler support for mandatory char16_t and char32_t types for UTF-16 and UTF-32 data, independently of whether wchar_t is 16-bit or 32-bit; maybe the better thing is to proactively start providing the new interfaces in <uchar.h> that will result from C1x adoption (and convert GNU programs to use this rather than wchar_t for character operations), although without compiler support for u"" and U"" (and even u8""), we are no better than ditching compiler support for L"" if you force a wchar_t size override. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists: 7.27 Unicode utilities <uchar.h> 1 The header <uchar.h> declares types and functions for manipulating Unicode characters. 2 The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19); char16_t which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and char32_t which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2). mbrtoc16 c16rtomb mbrtoc32 c32rtomb but no variants for replacing wprintf and friends (convert to multibyte and use printf and friends instead). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 619 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake @ 2011-01-31 20:49 ` Corinna Vinschen 2011-02-02 11:29 ` Bruno Haible 1 sibling, 0 replies; 21+ messages in thread From: Corinna Vinschen @ 2011-01-31 20:49 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On Jan 31 09:58, Eric Blake wrote: > > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent > > but somewhat surprising way: wcrtomb() may return 0, that is, produce no > > output bytes when it consumes a wchar_t. > > > Now with a chinese character outside the BMP: > > $ > > 1 4 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 3 6 > > > > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5): > > > > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > > 1 5 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 2 7 > > > > So both the number of characters and the number of words are counted > > wrong as soon as non-BMP characters occur. > > > > Does this represent a bug in cygwin's mbrtowc routines that could be > fixed by cygwin? > > Or, does this represent a bug in coreutils for using mbrtowc one > character at a time instead of something like mbsrtowcs to do bulk > conversions? Just to clarify a bit. This has been discussed on the cygwin-developer mailing list back in 2009. The original code which handled UTF-16 surrogates always wrote at least 1 byte to the destination UTF-8 string. However, the problem is that Windows filenames may contain lone surrogate pairs, even though the filename is usually interpreted as UTF-16. So the current code returns 0 bytes for the first surrogate half and only writes the full UTF-8 sequence after the second surrogate half has been evaluated. In the case where a lone high surrogate is still pending, but the low surrogate is missing, we can just write out the high surrogate in CESU-8 encoding. This would not have been possible if we had already written the first byte of the UTF-8 string. Lone low surrogates are written as CESU-8 sequence immediately so they are nothing to worry about. As for wctomb/wcrtomb returning 0: Even if this looks like kind of a stretch, this should not be a problem per POSIX. A return value of 0 from wctomb/wcrtomb has no special meaning(*). Even in the case where the incoming wide char is L'\0', the resulting \0 is written and 1 is returned. Since 0 bytes have been written to the destination string, returning 0 is perfectly valid. If a calling function misinterprets the return value of 0 as an error or EOF, it's not a bug in wctomb/wcrtomb. For the original discussion, see http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html Corinna (*) http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcrtomb.html -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake 2011-01-31 20:49 ` Corinna Vinschen @ 2011-02-02 11:29 ` Bruno Haible 2011-02-02 12:15 ` Corinna Vinschen ` (3 more replies) 1 sibling, 4 replies; 21+ messages in thread From: Bruno Haible @ 2011-02-02 11:29 UTC (permalink / raw) To: Eric Blake; +Cc: bug-gnulib, cygwin, bug-coreutils Hello Eric, > ... POSIX requires that 1 wchar_t corresponds to 1 character > ... > > What consequences does this have? > > > > 1) All code that uses the functions from <wctype.h> (wide character > > classification and mapping) or wcwidth() malfunctions on strings that > > contains Unicode characters outside the BMP, i.e. outside the range > > U+0000..U+FFFF. > > Not necessarily. Such code falls outside of POSIX, but it may still be > a well-behaved extension if given sane behavior for how to deal with > surrogates. No. Code that uses <wctype.h> and wcwidth() is written precisely according to POSIX. The problem is that this code cannot work correctly when wchar_t[] is in UTF-16 encoding. There simply is no way to define these functions in a reasonable way for surrogates. For example: U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true) U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false) U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false) U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false) U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true) U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true) There is no way that a system can provide this information through a function 'iswalpha' that takes a single wchar_t argument. It would be possible to provide this information - either through a function iswalpha2 (wchar_t wc1, wchar_t wc2) that takes two wchar_t arguments, - or through a function uc_is_alpha (ucs4_t uc), but that is not POSIX, and it would require rewriting each and every piece of code that currently uses <wctype.h> in the POSIX way. > we can (try) to make the various wc* functions try to > behave as smartly as possible (as is the case with Cygwin); where those > smarts are only needed when you use surrogate pairs. The point is that this approach can work fine for mbrtowc() and wcrtomb(), but it cannot yield a working definition for the <wctype.h> functions and wcwidth(). > > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent > > but somewhat surprising way: wcrtomb() may return 0, that is, produce no > > output bytes when it consumes a wchar_t. > > > Now with a chinese character outside the BMP: > > $ > > 1 4 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 3 6 > > > > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5): > > > > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > > 1 5 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 2 7 > > > > So both the number of characters and the number of words are counted > > wrong as soon as non-BMP characters occur. > > > > Does this represent a bug in cygwin's mbrtowc routines that could be > fixed by cygwin? > > Or, does this represent a bug in coreutils for using mbrtowc one > character at a time instead of something like mbsrtowcs to do bulk > conversions? We agree that it is a bug. And it is caused by - the fact that Cygwin's wchar_t[] encoding is UTF-16, and - there is no way to define the <wctype.h> POSIX functions sanely in this setting, and - coreutils and gnulib make use of the POSIX functions. Even if coreutils were to use mbsrtowcs instead of repeated use of mbrtowc, there would be no way for it to produce the correct result without combining surrogates into entire characters. > And if we decide that cygwin's extensions are sane, how much harder is > it to characterize what a program must do to be portable to both 16-bit > and 32-bit wchar_t if they are guaranteed the same behavior for all > hosts of the same-size wchar_t? In other words, would it really require > that many #ifdefs in coreutils to portably and simultaneously support > both sizes of wchar_t? It would require 1. to change the conversions that use mbrtowc to either convert an entire string at once (use mbsrtowcs), or make a second call to mbrtowc once the first call to mbrtowc has determined a low surrogate. 2. to change all uses of <wctype.h> and wcwidth() to use different functions, either functions that take 2 wchar_t arguments, or functions that require the caller to combine the surrogates. This means, lots of logic that goes against the spirit of wchar_t in ANSI C Amd. 1 and POSIX. > > I'm more in favour of overriding wchar_t and all functions that depend on it - > > like we did successfully for the socket functions. > > > > In practice, this would mean that on Windows (both native Windows and > > Cygwin >= 1.7) the use of a 'wchar_t' module will > > - override wchar_t to be 32 bits, like in glibc, > > - cause functions from mbrtowc() to wcwidth() to be overridden. Since the > > corresponding system functions are unusable, the replacements will use the > > modules from libunistring (such as unictype/ctype-alnum and uniwidth/width). > ... > compiler primitives, like L"xyz", which result in 16-bit wchar_t > arrays, will be unusable Good point. I agree then that overriding wchar_t should better not be done. > C1x will be adding compiler support for mandatory char16_t and char32_t > types for UTF-16 and UTF-32 data, independently of whether wchar_t is > 16-bit or 32-bit; maybe the better thing is to proactively start > providing the new interfaces in <uchar.h> that will result from C1x > adoption (and convert GNU programs to use this rather than wchar_t for > character operations) > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists: A newer draft is at https://www.opengroup.org/platform/single_unix_specification/uploads/40/23495/n1548.pdf This is a good point, but would have two drawbacks: - It throws out the use of a POSIX API for a not-yet-standard API, - Performance: For the non-UTF-8 locales (ISO-8859-15, EUC-JP, and similar) on platforms like MacOS X, FreeBSD, Solaris, the 'wchar_t' representation is essentially a packed multibyte representation. Which makes mbrtowc() fast, because it does not have to do a table lookup for the conversion from/to Unicode. If you use mbrtoc32 instead of mbrtowc, you add extra runtime overhead for a conversion to Unicode, that would not be necessary when using mbrtowc(). In other words, your proposal would solve the Windows wchar_t problem, but at the price of a performance penalty on traditional Unix systems. Here's a new proposal: - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t on Windows platforms and to 'wchar_t' otherwise. - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar. Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha', 'wcwidth' on most platforms, and a use of libunistring modules on Windows platforms. With this proposal, - The code that uses <wctype.h> has to be changed, but in a trivial way that introduces no complicated logic: Just change 'w' to 'ww'. Not more difficult than, say, using strtoll() instead of strtol(). - The runtime penalty on non-Windows systems is minimal. - On Windows platforms, surrogates are handled correctly, and code that uses wchar_t or <windows.h> is left alone. How does that sound? Comments? Bruno -- In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 11:29 ` Bruno Haible @ 2011-02-02 12:15 ` Corinna Vinschen 2011-02-02 12:21 ` Corinna Vinschen 2011-02-02 16:03 ` Bruno Haible ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: Corinna Vinschen @ 2011-02-02 12:15 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On Feb 2 12:29, Bruno Haible wrote: > Hello Eric, > > > ... POSIX requires that 1 wchar_t corresponds to 1 character > > ... > > > What consequences does this have? > > > > > > 1) All code that uses the functions from <wctype.h> (wide character > > > classification and mapping) or wcwidth() malfunctions on strings that > > > contains Unicode characters outside the BMP, i.e. outside the range > > > U+0000..U+FFFF. > > > > Not necessarily. Such code falls outside of POSIX, but it may still be > > a well-behaved extension if given sane behavior for how to deal with > > surrogates. > > No. Code that uses <wctype.h> and wcwidth() is written precisely according > to POSIX. The problem is that this code cannot work correctly when wchar_t[] > is in UTF-16 encoding. There simply is no way to define these functions > in a reasonable way for surrogates. > > For example: > U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true) > U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false) > U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false) > U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false) > U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true) > U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true) > There is no way that a system can provide this information through a > function 'iswalpha' that takes a single wchar_t argument. iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, the function can return the correct value, provided that the application converts the UTF-16 surrogate to UTF-32 before calling iswalpha. > We agree that it is a bug. And it is caused by > - the fact that Cygwin's wchar_t[] encoding is UTF-16, and > - there is no way to define the <wctype.h> POSIX functions sanely in this > setting, and See above. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 12:15 ` Corinna Vinschen @ 2011-02-02 12:21 ` Corinna Vinschen 0 siblings, 0 replies; 21+ messages in thread From: Corinna Vinschen @ 2011-02-02 12:21 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On Feb 2 13:14, Corinna Vinschen wrote: > On Feb 2 12:29, Bruno Haible wrote: > > Hello Eric, > > > > > ... POSIX requires that 1 wchar_t corresponds to 1 character > > > ... > > > > What consequences does this have? > > > > > > > > 1) All code that uses the functions from <wctype.h> (wide character > > > > classification and mapping) or wcwidth() malfunctions on strings that > > > > contains Unicode characters outside the BMP, i.e. outside the range > > > > U+0000..U+FFFF. > > > > > > Not necessarily. Such code falls outside of POSIX, but it may still be > > > a well-behaved extension if given sane behavior for how to deal with > > > surrogates. > > > > No. Code that uses <wctype.h> and wcwidth() is written precisely according > > to POSIX. The problem is that this code cannot work correctly when wchar_t[] > > is in UTF-16 encoding. There simply is no way to define these functions > > in a reasonable way for surrogates. > > > > For example: > > U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true) > > U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false) > > U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false) > > U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false) > > U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true) > > U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true) > > There is no way that a system can provide this information through a > > function 'iswalpha' that takes a single wchar_t argument. > > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > the function can return the correct value, provided that the application > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. And, please note the wording in SUSv4, for instance in http://calimero.vinschen.de/susv4/functions/iswalpha.html The wc argument is a wint_t, the value of which the application shall ^^^^^^ ^^^^^^^^^^^ ensure is a wide-character code corresponding to a valid character in the current locale, or equal to the value of the macro WEOF. If the argument has any other value, the behavior is undefined. I don't see any words in that which would disallow to convert UTF-16 wchar_t surrogates to a wint_t UTF-32 value before calling one of the wctype functions. Just like you have to be careful not to call the ctype functions with a signed char. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 11:29 ` Bruno Haible 2011-02-02 12:15 ` Corinna Vinschen @ 2011-02-02 16:03 ` Bruno Haible 2011-02-02 16:28 ` Corinna Vinschen 2011-02-02 17:52 ` bug#7948: " Paul Eggert 2011-02-02 21:24 ` Eric Blake 3 siblings, 1 reply; 21+ messages in thread From: Bruno Haible @ 2011-02-02 16:03 UTC (permalink / raw) To: bug-gnulib, cygwin, bug-coreutils, Eric Blake Hello Corinna, > And, please note the wording in SUSv4, for instance in > http://calimero.vinschen.de/susv4/functions/iswalpha.html Likewise in POSIX:2008, at the URL http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html > The wc argument is a wint_t, the value of which the application shall > ^^^^^^ ^^^^^^^^^^^ > ensure is a wide-character code corresponding to a valid character in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > the current locale, or equal to the value of the macro WEOF. If the > argument has any other value, the behavior is undefined. What this sentence means in formulas, is that when an application passes a 'wint_t x' to iswalpha(), it has to satisfy x == (wint_t) (wchar_t) x || x == EOF > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > the function can return the correct value, provided that the application > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. When an application does this, is passes an invalid wint_t value to iswalpha(), according to the spec paragraph that you have just cited. So the application uses an extension to POSIX functionality, not POSIX itself. I see that Cygwin 1.7.x iswalpha() works in this way you describe (but mingw's iswalpha() doesn't). So this means that gnulib's proposed iswwalpha(wwchar_t) function could be implemented using iswalpha() on Cygwin 1.7.x and will not cause the Unicode based tables to be included in the executable. This is good and nice. But if you say that the application should convert UTF-16 surrogates to UTF-32 before calling iswalpha: That's certainly a requirement for Cygwin 1.7.x application that want to support the entire Unicode character set. But it's outside of POSIX, and many GNU programs will not want to include this added complexity. Just try to apply this suggestion to gnulib's quotearg.c, then estimate the time someone would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, coreutils/src/wc.c, and so on. For this reason I propose the wwchar_t type with an API that is similar to POSIX <wctype.h> but includes the surrogate handling, rather than pushing it into each application's code. Bruno -- In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 16:03 ` Bruno Haible @ 2011-02-02 16:28 ` Corinna Vinschen 2011-02-02 16:35 ` Corinna Vinschen 0 siblings, 1 reply; 21+ messages in thread From: Corinna Vinschen @ 2011-02-02 16:28 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils Hi Bruno, On Feb 2 17:02, Bruno Haible wrote: > Hello Corinna, > > > And, please note the wording in SUSv4, for instance in > > http://calimero.vinschen.de/susv4/functions/iswalpha.html > > Likewise in POSIX:2008, at the URL > http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html Oops, sorry for the wrong URL! I'm using a local copy of SUSv4 for speed, but forgot that entirely when copy/pasting it. > > The wc argument is a wint_t, the value of which the application shall > > ^^^^^^ ^^^^^^^^^^^ > > ensure is a wide-character code corresponding to a valid character in > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > the current locale, or equal to the value of the macro WEOF. If the > > argument has any other value, the behavior is undefined. > > What this sentence means in formulas, is that when an application passes > a 'wint_t x' to iswalpha(), it has to satisfy > > x == (wint_t) (wchar_t) x || x == EOF Sure, I agree. But it doesn't say this *exactly*, so I took the liberty to stretch the limits a bit so that there is *some* way for applications to use the wctype functions despite using UTF-16 and despite having a surrogate value. > > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > > the function can return the correct value, provided that the application > > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. > > When an application does this, is passes an invalid wint_t value to > iswalpha(), according to the spec paragraph that you have just cited. > So the application uses an extension to POSIX functionality, not > POSIX itself. Well, given that the description doesn't explicitely talk about a value given as wchar_t, but instead about a "wide-character code corresponding to a valid character" I saw some room for interpretation... > I see that Cygwin 1.7.x iswalpha() works in this way you describe (but > mingw's iswalpha() doesn't). So this means that gnulib's proposed > iswwalpha(wwchar_t) function could be implemented using iswalpha() > on Cygwin 1.7.x and will not cause the Unicode based tables to be > included in the executable. This is good and nice. I'm glad you see it that way. > But if you say that the application should convert UTF-16 surrogates > to UTF-32 before calling iswalpha: That's certainly a requirement > for Cygwin 1.7.x application that want to support the entire Unicode > character set. But it's outside of POSIX, and many GNU programs will > not want to include this added complexity. Just try to apply this > suggestion to gnulib's quotearg.c, then estimate the time someone > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, > coreutils/src/wc.c, and so on. Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including surrogate handling. It only required two changes in the code. But I see what you mean. Another layer which abstracts this problem looks like the right thing to do. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 16:28 ` Corinna Vinschen @ 2011-02-02 16:35 ` Corinna Vinschen 2011-02-02 20:28 ` Andy Koppe 2011-02-04 22:46 ` Warren Young 0 siblings, 2 replies; 21+ messages in thread From: Corinna Vinschen @ 2011-02-02 16:35 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On Feb 2 17:28, Corinna Vinschen wrote: > On Feb 2 17:02, Bruno Haible wrote: > > But if you say that the application should convert UTF-16 surrogates > > to UTF-32 before calling iswalpha: That's certainly a requirement > > for Cygwin 1.7.x application that want to support the entire Unicode > > character set. But it's outside of POSIX, and many GNU programs will > > not want to include this added complexity. Just try to apply this > > suggestion to gnulib's quotearg.c, then estimate the time someone > > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, > > coreutils/src/wc.c, and so on. > > Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including > surrogate handling. It only required two changes in the code. Btw., I would be sure glad if Cygwin would use a wchar_t of 4 bytes as well. The problem is that this requires too many changes at once to work right, and it would introduce a lot of backward compatibility problems which would have to be handled. If only the one's who decided that wchar_t in Cygwin should have the same size as WCHAR_T in the underlying Windows would have thought twice about the implications... Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 16:35 ` Corinna Vinschen @ 2011-02-02 20:28 ` Andy Koppe 2011-02-04 22:46 ` Warren Young 1 sibling, 0 replies; 21+ messages in thread From: Andy Koppe @ 2011-02-02 20:28 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On 2 February 2011 16:35, Corinna Vinschen wrote: > On Feb 2 17:28, Corinna Vinschen wrote: >> On Feb 2 17:02, Bruno Haible wrote: >> > But if you say that the application should convert UTF-16 surrogates >> > to UTF-32 before calling iswalpha: That's certainly a requirement >> > for Cygwin 1.7.x application that want to support the entire Unicode >> > character set. But it's outside of POSIX, and many GNU programs will >> > not want to include this added complexity. Just try to apply this >> > suggestion to gnulib's quotearg.c, then estimate the time someone >> > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, >> > coreutils/src/wc.c, and so on. >> >> Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including >> surrogate handling. It only required two changes in the code. > > Btw., I would be sure glad if Cygwin would use a wchar_t of 4 bytes as > well. The problem is that this requires too many changes at once to > work right, and it would introduce a lot of backward compatibility > problems which would have to be handled. Cygwin 1.7 might have been a good point for that change, because the lack of proper locale and charset support in previous versions meant that backward compatibility was much less of a concern than it is now. But it's a difficult change indeed, and it's not entirely clear that it's worthwhile. I guess 64-bit Cygwin (if or when it happens) might be the next opportunity. > If only the one's who decided that wchar_t in Cygwin should have the > same size as WCHAR_T in the underlying Windows would have thought twice > about the implications... Windows Unicode support was introduced with Windows NT in 1993, whereas Unicode was only extended beyond 16 bits with version 2.0 in 1996. Cygwin was first released the year before. If the Unicode extension was a consideration at all (which I'd doubt), wchar_t != WCHAR probably seemed far more daunting than having to deal with surrogates at some point down the line. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 16:35 ` Corinna Vinschen 2011-02-02 20:28 ` Andy Koppe @ 2011-02-04 22:46 ` Warren Young 1 sibling, 0 replies; 21+ messages in thread From: Warren Young @ 2011-02-04 22:46 UTC (permalink / raw) To: cygwin, bug-gnulib, bug-coreutils On 2/2/2011 9:35 AM, Corinna Vinschen wrote: > > If only the one's who decided that wchar_t in Cygwin should have the > same size as WCHAR_T in the underlying Windows would have thought twice > about the implications... Cygwin 1.9? Or maybe 2.0, if it breaks ABIs? -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin 2011-02-02 11:29 ` Bruno Haible 2011-02-02 12:15 ` Corinna Vinschen 2011-02-02 16:03 ` Bruno Haible @ 2011-02-02 17:52 ` Paul Eggert 2011-02-02 18:57 ` Bruno Haible 2011-02-03 12:57 ` Ulf Zibis 2011-02-02 21:24 ` Eric Blake 3 siblings, 2 replies; 21+ messages in thread From: Paul Eggert @ 2011-02-02 17:52 UTC (permalink / raw) To: Bruno Haible; +Cc: Eric Blake, bug-gnulib, cygwin, bug-coreutils On 02/02/11 03:29, Bruno Haible wrote: > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t > on Windows platforms and to 'wchar_t' otherwise. As a minor point, would it be OK to call this type 'xchar_t' instead? 'x' is the successor to 'w', after all, and it can be thought of as an abbreviation for 'eXtended'. A problem with the 'ww' prefix is that mentally I start thinking "World Wide ..." -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin 2011-02-02 17:52 ` bug#7948: " Paul Eggert @ 2011-02-02 18:57 ` Bruno Haible 2011-02-02 20:43 ` Andy Koppe 2011-02-03 12:57 ` Ulf Zibis 1 sibling, 1 reply; 21+ messages in thread From: Bruno Haible @ 2011-02-02 18:57 UTC (permalink / raw) To: Paul Eggert; +Cc: Eric Blake, bug-gnulib, cygwin, bug-coreutils Hi Paul, > > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t > > on Windows platforms and to 'wchar_t' otherwise. > > As a minor point, would it be OK to call this type > 'xchar_t' instead? 'x' is the successor to 'w', after all, > and it can be thought of as an abbreviation for 'eXtended'. 'wwchar_t' means "wide wide character". In fact it's not really an "extended" character or "complex character". It's just what POSIX calls a 'wchar_t'. I like the analogy between strtol and strtoll. In the beginning, people thought a 'long int' would be enough for everything. Then they discovered a 'long long int' is needed. The same story repeats itself here with the "wide characters" which turn out to be not wide enough, and "wide wide characters" are needed. > A problem with the 'ww' prefix is that mentally I start thinking > "World Wide ..." Indeed this meaning can come to mind, but I think it's not dangerous since the term "world wide" has no meaning in a programming language. Bruno -- In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin 2011-02-02 18:57 ` Bruno Haible @ 2011-02-02 20:43 ` Andy Koppe 0 siblings, 0 replies; 21+ messages in thread From: Andy Koppe @ 2011-02-02 20:43 UTC (permalink / raw) To: cygwin; +Cc: Paul Eggert, Eric Blake, bug-gnulib, bug-coreutils On 2 February 2011 18:57, Bruno Haible wrote: > Hi Paul, > >> > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t >> > on Windows platforms and to 'wchar_t' otherwise. >> >> As a minor point, would it be OK to call this type >> 'xchar_t' instead? 'x' is the successor to 'w', after all, >> and it can be thought of as an abbreviation for 'eXtended'. > > 'wwchar_t' means "wide wide character". > > In fact it's not really an "extended" character or "complex character". > It's just what POSIX calls a 'wchar_t'. It's extended in the sense that the original Unicode was only 16 bits wide (which of course is why wchar_t on Windows is 16 bits). Also, I think 'xchar_t' is less prone to typos, in particular forgetting one of the dubyas. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin 2011-02-02 17:52 ` bug#7948: " Paul Eggert 2011-02-02 18:57 ` Bruno Haible @ 2011-02-03 12:57 ` Ulf Zibis 1 sibling, 0 replies; 21+ messages in thread From: Ulf Zibis @ 2011-02-03 12:57 UTC (permalink / raw) To: Paul Eggert; +Cc: Bruno Haible, bug-coreutils, cygwin, bug-gnulib, Eric Blake Hi, I think there is a kind of similar bug in discussion on GNU: bug#7960: [PATCH] fmt: fix formatting multibyte text (bug #7372) -Ulf Am 02.02.2011 18:51, schrieb Paul Eggert: > On 02/02/11 03:29, Bruno Haible wrote: >> - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t >> on Windows platforms and to 'wchar_t' otherwise. > As a minor point, would it be OK to call this type > 'xchar_t' instead? 'x' is the successor to 'w', after all, > and it can be thought of as an abbreviation for 'eXtended'. > > A problem with the 'ww' prefix is that mentally I start thinking > "World Wide ..." > > > > -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 11:29 ` Bruno Haible ` (2 preceding siblings ...) 2011-02-02 17:52 ` bug#7948: " Paul Eggert @ 2011-02-02 21:24 ` Eric Blake 2011-02-02 21:39 ` Corinna Vinschen 2011-02-02 23:03 ` Bruno Haible 3 siblings, 2 replies; 21+ messages in thread From: Eric Blake @ 2011-02-02 21:24 UTC (permalink / raw) To: cygwin, bug-gnulib [-- Attachment #1: Type: text/plain, Size: 1685 bytes --] [dropping coreutils at this point] On 02/02/2011 04:29 AM, Bruno Haible wrote: > Good point. I agree then that overriding wchar_t should better not be > done. > > Here's a new proposal: > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t > on Windows platforms and to 'wchar_t' otherwise. > - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar. > Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha', > 'wcwidth' on most platforms, and a use of libunistring modules on > Windows platforms. I like the idea of making a new type wrapper. Are you thinking of making a sane wrapping around either 4-byte wchar_t or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes it a thin wrapper on both Linux and Cygwin, but needing more work on mingw), or are you thinking that it is always a 4-byte type (needing lots more memory manipulation on cygwin to convert between 2- and 4-byte representations when using cygwin's functions, or else reimplementing everything from scratch by completely bypassing cygwin)? As to the name: I agree the opinion of others that xchar_t is easier to type and easier to avoid typos of a missing 'w' than wwchar_t. On the other hand, I can see wwprintf that takes wide-wchar_t values, but gnulib already has xprintf as a counterpart to xmalloc (which calls exit() if the printf fails for memory allocation or other non-I/O related reasons), so we can't blindly use 'x' instead of 'ww' when replacing existing 'w' in POSIX APIs. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 619 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 21:24 ` Eric Blake @ 2011-02-02 21:39 ` Corinna Vinschen 2011-02-02 23:03 ` Bruno Haible 1 sibling, 0 replies; 21+ messages in thread From: Corinna Vinschen @ 2011-02-02 21:39 UTC (permalink / raw) To: cygwin, bug-gnulib On Feb 2 14:24, Eric Blake wrote: > [dropping coreutils at this point] > > On 02/02/2011 04:29 AM, Bruno Haible wrote: > > Good point. I agree then that overriding wchar_t should better not be > > done. > > > > Here's a new proposal: > > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t > > on Windows platforms and to 'wchar_t' otherwise. > > - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar. > > Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha', > > 'wcwidth' on most platforms, and a use of libunistring modules on > > Windows platforms. > > I like the idea of making a new type wrapper. > > Are you thinking of making a sane wrapping around either 4-byte wchar_t > or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes > it a thin wrapper on both Linux and Cygwin, but needing more work on > mingw), or are you thinking that it is always a 4-byte type (needing > lots more memory manipulation on cygwin to convert between 2- and 4-byte > representations when using cygwin's functions, or else reimplementing > everything from scratch by completely bypassing cygwin)? > > As to the name: I agree the opinion of others that xchar_t is easier to > type and easier to avoid typos of a missing 'w' than wwchar_t. On the > other hand, I can see wwprintf that takes wide-wchar_t values, but > gnulib already has xprintf as a counterpart to xmalloc (which calls > exit() if the printf fails for memory allocation or other non-I/O > related reasons), so we can't blindly use 'x' instead of 'ww' when > replacing existing 'w' in POSIX APIs. May I suggest a compromise? What about "xwchar_t"? It avoids the potential typo by accidentally dropping the second w. It still contains "wchar" which implies that it's a *wide* char type. And the x could be read as "extended". Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 21:24 ` Eric Blake 2011-02-02 21:39 ` Corinna Vinschen @ 2011-02-02 23:03 ` Bruno Haible 2011-02-02 23:19 ` Eric Blake 1 sibling, 1 reply; 21+ messages in thread From: Bruno Haible @ 2011-02-02 23:03 UTC (permalink / raw) To: bug-gnulib; +Cc: Eric Blake, cygwin Hello Eric, > > Here's a new proposal: > > - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t > > on Windows platforms and to 'wchar_t' otherwise. > > - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar. > > Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha', > > 'wcwidth' on most platforms, and a use of libunistring modules on > > Windows platforms. > ... > Are you thinking of making a sane wrapping around either 4-byte wchar_t > or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes > it a thin wrapper on both Linux and Cygwin, but needing more work on > mingw), or are you thinking that it is always a 4-byte type (needing > lots more memory manipulation on cygwin to convert between 2- and 4-byte > representations when using cygwin's functions, or else reimplementing > everything from scratch by completely bypassing cygwin)? I'm not sure I understand your question. The plan is that - On platforms with a 32-bit wchar_t, like glibc, *BSD, and many others, 'wwchar_t' is identical to 'wchar_t', and the function wrappers are simple redirections. - On Cygwin and mingw, wwchar_t is 'uint32_t' (so as to accommodate all Unicode characters and WEOF and so that it plays well with 'wint_t'). mbrtowwc is implemented by 1 or 2 calls to mbrtowc. mbsrtowwcs may be implemented by a call to mbsrtowcs and an additional conversion loop, or it might be implemented on top of mbrtowwc; that's merely a speed vs. memory trade-off. The plan is not to "completely bypassing cygwin", but to use as much of Cygwin's built-ins as makes sense. - On platforms with a 16-bit wchar_t but where the wchar_t[] encoding in Unicode locales is merely UCS-2, like AIX, use the no-op thin wrappers as well. If the platform does not support more than the BMP, it makes not much sense for GNU programs to try to work around that. > As to the name: I agree the opinion of others that xchar_t is easier to > type and easier to avoid typos of a missing 'w' than wwchar_t. If a developer makes a typo here, he's likely to get a gcc warning or a link error. But yes, it's possible to pass a 'wwchar_t' to iswalpha(), which will yield wrong results. I don't think this risk can be much reduced through a different name. > gnulib already has xprintf as a counterpart to xmalloc (which calls > exit() if the printf fails for memory allocation or other non-I/O > related reasons), so we can't blindly use 'x' Good point. The 'x' prefix has already several meanings in gnulib: - checking against memory allocation failure, - checking against errors, - no size limitation, - a more convenient interface, - a wrapper that prints an error message. It doesn't seem wise to add another meaning to it. Thanks for the feedback. -- In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 23:03 ` Bruno Haible @ 2011-02-02 23:19 ` Eric Blake 2011-02-03 0:13 ` Bruno Haible 0 siblings, 1 reply; 21+ messages in thread From: Eric Blake @ 2011-02-02 23:19 UTC (permalink / raw) To: Bruno Haible; +Cc: bug-gnulib, cygwin [-- Attachment #1: Type: text/plain, Size: 2854 bytes --] On 02/02/2011 04:03 PM, Bruno Haible wrote: >> Are you thinking of making a sane wrapping around either 4-byte wchar_t >> or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes >> it a thin wrapper on both Linux and Cygwin, but needing more work on >> mingw), or are you thinking that it is always a 4-byte type (needing >> lots more memory manipulation on cygwin to convert between 2- and 4-byte >> representations when using cygwin's functions, or else reimplementing >> everything from scratch by completely bypassing cygwin)? > > I'm not sure I understand your question. The plan is that > > - On platforms with a 32-bit wchar_t, like glibc, *BSD, and many others, > 'wwchar_t' is identical to 'wchar_t', and the function wrappers are > simple redirections. > > - On Cygwin and mingw, wwchar_t is 'uint32_t' (so as to accommodate > all Unicode characters and WEOF and so that it plays well with 'wint_t'). > mbrtowwc is implemented by 1 or 2 calls to mbrtowc. mbsrtowwcs may be > implemented by a call to mbsrtowcs and an additional conversion loop, > or it might be implemented on top of mbrtowwc; that's merely a speed > vs. memory trade-off. > The plan is not to "completely bypassing cygwin", but to use as much > of Cygwin's built-ins as makes sense. You answered my question in spite of myself. I was asking: should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but unlike the POSIX definition of wchar_t being always 1 character per unit, the new type is explicitly documented as being multi-unit on some platforms but with sane semantics or should it always be 4-bytes, where conversion from wchar_t to wwchar_t requires some efforts, and where the new type must be used everywhere (which means wrapping a lot of APIs), but where you can once again assume POSIX semantics of 1 character per unit, simplifying life of callers at the expense of converting to the new type And on asking the question in those more detailed words, I agree with your conclusion - on cygwin, wwchar_t should be 4 bytes. > > - On platforms with a 16-bit wchar_t but where the wchar_t[] encoding > in Unicode locales is merely UCS-2, like AIX, use the no-op thin > wrappers as well. If the platform does not support more than the BMP, > it makes not much sense for GNU programs to try to work around that. Agreed. Next question/thought. Gnulib should definitely tackle this first. But if it works out, should we also add wwchar_t natively into cygwin? It would certainly be easier to add new interfaces incrementally, in preparation for a possible future ABI conversion to make wchar_t become 4 bytes. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 619 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-02 23:19 ` Eric Blake @ 2011-02-03 0:13 ` Bruno Haible 2011-02-03 9:42 ` Corinna Vinschen 0 siblings, 1 reply; 21+ messages in thread From: Bruno Haible @ 2011-02-03 0:13 UTC (permalink / raw) To: Eric Blake; +Cc: bug-gnulib, cygwin Hi Eric, > I was asking: > > should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but > unlike the POSIX definition of wchar_t being always 1 character per > unit, the new type is explicitly documented as being multi-unit on some > platforms but with sane semantics > > or should it always be 4-bytes, where conversion from wchar_t to > wwchar_t requires some efforts, and where the new type must be used > everywhere (which means wrapping a lot of APIs), but where you can once > again assume POSIX semantics of 1 character per unit, simplifying life > of callers at the expense of converting to the new type In the first case we wouldn't need a new type. The plan is the second alternative. The goal is *not* to have to extend each of quotearg.c, regcomp.c, mbchar.h, wc.c, etc. to handle UTF-16 explicitly with #ifdefs, more variables, and more logic. > if it works out, should we also add wwchar_t natively into cygwin? More and more Unix platforms offer only UTF-8 locales. One can predict that in 10 years, all Unix platforms will offer only UTF-8 locales. At this point wchar_t will be UCS-4 on all these platforms (except AIX). The mbrtoc32 function from the C1X API that you pointed to will then be equivalent to mbrtowwc. So, you can view 'wwchar_t' as a temporary measure that will bridge the gap between the ANSI C Amd. 1 API and the C1X API. Bruno -- In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-03 0:13 ` Bruno Haible @ 2011-02-03 9:42 ` Corinna Vinschen 2011-02-03 10:48 ` Bruno Haible 0 siblings, 1 reply; 21+ messages in thread From: Corinna Vinschen @ 2011-02-03 9:42 UTC (permalink / raw) To: cygwin, bug-gnulib On Feb 3 01:12, Bruno Haible wrote: > Hi Eric, > > > I was asking: > > > > should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but > > unlike the POSIX definition of wchar_t being always 1 character per > > unit, the new type is explicitly documented as being multi-unit on some > > platforms but with sane semantics > > > > or should it always be 4-bytes, where conversion from wchar_t to > > wwchar_t requires some efforts, and where the new type must be used > > everywhere (which means wrapping a lot of APIs), but where you can once > > again assume POSIX semantics of 1 character per unit, simplifying life > > of callers at the expense of converting to the new type > > In the first case we wouldn't need a new type. > > The plan is the second alternative. The goal is *not* to have to extend > each of quotearg.c, regcomp.c, mbchar.h, wc.c, etc. to handle UTF-16 > explicitly with #ifdefs, more variables, and more logic. > > > if it works out, should we also add wwchar_t natively into cygwin? > > More and more Unix platforms offer only UTF-8 locales. One can predict > that in 10 years, all Unix platforms will offer only UTF-8 locales. At this > point wchar_t will be UCS-4 on all these platforms (except AIX). > > The mbrtoc32 function from the C1X API that you pointed to will then be > equivalent to mbrtowwc. > > So, you can view 'wwchar_t' as a temporary measure that will bridge the > gap between the ANSI C Amd. 1 API and the C1X API. Maybe I'm just dense, but isn't wwchar_t equivalent to wint_t on all platforms? On UCS-4 platforms sizeof(wint_t) == sizeof(wchar_t) == 4 because there's no reason to make it bigger. On UCS-2 and UTF-16 platforms sizeof(wint_t) == 4 because it must be able to hold EOF as well. So, why not just use the wint_t type for the time being? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 16-bit wchar_t on Windows and Cygwin 2011-02-03 9:42 ` Corinna Vinschen @ 2011-02-03 10:48 ` Bruno Haible 0 siblings, 0 replies; 21+ messages in thread From: Bruno Haible @ 2011-02-03 10:48 UTC (permalink / raw) To: bug-gnulib, cygwin Corinna Vinschen wrote: > isn't wwchar_t equivalent to wint_t on all > platforms? On UCS-4 platforms sizeof(wint_t) == sizeof(wchar_t) == 4 > because there's no reason to make it bigger. On UCS-2 and UTF-16 > platforms sizeof(wint_t) == 4 because it must be able to hold EOF as > well. So, why not just use the wint_t type for the time being? The "must be able to hold WEOF as well" argument holds for the argument type of iswwalpha. If we were to call it 'wwint_t', it would be the same as 'wint_t', yes. For this reason, we don't need a separate type 'wwint_t'. But 'wwchar_t' is the base type for wide wide character _arrays_. Such arrays don't need to hold the WEOF value. On AIX platforms, where wchar_t[] is the UCS-2 encoding, wwchar_t[] can be synonymous to it. There is no need to make wwchar_t 32 bits wide on these platforms. So, my current code looks like this: # if (defined _WIN32 || defined __WIN32__) || defined __CYGWIN__ /* Define 'wwchar_t' as a type that - can hold 32 bits, unlike wchar_t which can hold only 16 bits, - promotes to 'wint_t' under the default argument promotions. */ typedef wint_t wwchar_t; /* actually 'unsigned int' or 'uint32_t' */ # else typedef wchar_t wwchar_t; # endif Bruno -- In memoriam Buddy Holly <http://en.wikipedia.org/wiki/Buddy_Holly> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-02-04 22:46 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <201101310304.42975.bruno@clisp.org> 2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake 2011-01-31 20:49 ` Corinna Vinschen 2011-02-02 11:29 ` Bruno Haible 2011-02-02 12:15 ` Corinna Vinschen 2011-02-02 12:21 ` Corinna Vinschen 2011-02-02 16:03 ` Bruno Haible 2011-02-02 16:28 ` Corinna Vinschen 2011-02-02 16:35 ` Corinna Vinschen 2011-02-02 20:28 ` Andy Koppe 2011-02-04 22:46 ` Warren Young 2011-02-02 17:52 ` bug#7948: " Paul Eggert 2011-02-02 18:57 ` Bruno Haible 2011-02-02 20:43 ` Andy Koppe 2011-02-03 12:57 ` Ulf Zibis 2011-02-02 21:24 ` Eric Blake 2011-02-02 21:39 ` Corinna Vinschen 2011-02-02 23:03 ` Bruno Haible 2011-02-02 23:19 ` Eric Blake 2011-02-03 0:13 ` Bruno Haible 2011-02-03 9:42 ` Corinna Vinschen 2011-02-03 10:48 ` Bruno Haible
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).