Re: 16-bit wchar_t on Windows and Cygwin

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* Re: 16-bit wchar_t on Windows and Cygwin
       [not found] <201101310304.42975.bruno@clisp.org>
@ 2011-01-31 19:16 ` Eric Blake
  2011-01-31 20:49   ` Corinna Vinschen
  2011-02-02 11:29   ` Bruno Haible
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Blake @ 2011-01-31 19:16 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib, cygwin, bug-coreutils

[-- Attachment #1: Type: text/plain, Size: 5113 bytes --]

[adding cygwin and coreutils for a wc issue]

On 01/30/2011 07:04 PM, Bruno Haible wrote:
> Hi,
> 
> It is known for a long time that on native Windows, the wchar_t[] encoding on
> strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same
> for Cygwin >= 1.7. [2]

POSIX requires that 1 wchar_t corresponds to 1 character; so any use of
surrogates to get the full benefit of UTF-16 falls outside the bounds of
POSIX.  At which point, the POSIX definition of those functions no
longer apply, and we can (try) to make the various wc* functions try to
behave as smartly as possible (as is the case with Cygwin); where those
smarts are only needed when you use surrogate pairs.  If cygwin's
approach is correct, then maybe the thing to do is codify those smarts
for all implementations with 16-bit wchar_t as an extension to POSIX
that all gnulib clients can rely on, and thus minimize the #ifdefs in
such clients.

> What consequences does this have?
> 
>   1) All code that uses the functions from <wctype.h> (wide character
>      classification and mapping) or wcwidth() malfunctions on strings that
>      contains Unicode characters outside the BMP, i.e. outside the range
>      U+0000..U+FFFF.

Not necessarily.  Such code falls outside of POSIX, but it may still be
a well-behaved extension if given sane behavior for how to deal with
surrogates.

>   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
>      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
>      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
>      output bytes when it consumes a wchar_t.

>   Now with a chinese character outside the BMP:
>   $ 	
>         1       4
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         3       6
> 
>   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> 
>   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
>         1       5
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         2       7
>
>   So both the number of characters and the number of words are counted
>   wrong as soon as non-BMP characters occur.
>

Does this represent a bug in cygwin's mbrtowc routines that could be
fixed by cygwin?

Or, does this represent a bug in coreutils for using mbrtowc one
character at a time instead of something like mbsrtowcs to do bulk
conversions?

And if we decide that cygwin's extensions are sane, how much harder is
it to characterize what a program must do to be portable to both 16-bit
and 32-bit wchar_t if they are guaranteed the same behavior for all
hosts of the same-size wchar_t?  In other words, would it really require
that many #ifdefs in coreutils to portably and simultaneously support
both sizes of wchar_t?

> I'm more in favour of overriding wchar_t and all functions that depend on it -
> like we did successfully for the socket functions.
> 
> In practice, this would mean that on Windows (both native Windows and
> Cygwin >= 1.7) the use of a 'wchar_t' module will
>   - override wchar_t to be 32 bits, like in glibc,
>   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
>     corresponding system functions are unusable, the replacements will use the
>     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).

That's a lot of overriding, for anything that uses wchar_t in its API,
and throws out a lot of what cygwin already provides.  It also means
that compiler primitives, like L"xyz", which result in 16-bit wchar_t
arrays, will be unusable with your 32-bit wchar_t override.  In other
words, I don't think it's a good idea to be doing that.

C1x will be adding compiler support for mandatory char16_t and char32_t
types for UTF-16 and UTF-32 data, independently of whether wchar_t is
16-bit or 32-bit; maybe the better thing is to proactively start
providing the new interfaces in <uchar.h> that will result from C1x
adoption (and convert GNU programs to use this rather than wchar_t for
character operations), although without compiler support for u"" and U""
(and even u8""), we are no better than ditching compiler support for L""
if you force a wchar_t size override.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

7.27 Unicode utilities <uchar.h>
1 The header <uchar.h> declares types and functions for manipulating Unicode
 characters.
2 The types declared are mbstate_t (described in 7.29.1) and size_t
(described in
 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the
same type as
uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the
same type as
uint_least32_t (also described in 7.20.1.2).

mbrtoc16
c16rtomb
mbrtoc32
c32rtomb

but no variants for replacing wprintf and friends (convert to multibyte
and use printf and friends instead).

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake
@ 2011-01-31 20:49   ` Corinna Vinschen
  2011-02-02 11:29   ` Bruno Haible
  1 sibling, 0 replies; 21+ messages in thread
From: Corinna Vinschen @ 2011-01-31 20:49 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On Jan 31 09:58, Eric Blake wrote:
> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $ 	
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

Just to clarify a bit.  This has been discussed on the cygwin-developer
mailing list back in 2009.  The original code which handled UTF-16
surrogates always wrote at least 1 byte to the destination UTF-8 string.
However, the problem is that Windows filenames may contain lone
surrogate pairs, even though the filename is usually interpreted as
UTF-16.

So the current code returns 0 bytes for the first surrogate half and
only writes the full UTF-8 sequence after the second surrogate half has
been evaluated.  In the case where a lone high surrogate is still
pending, but the low surrogate is missing, we can just write out the
high surrogate in CESU-8 encoding.  This would not have been possible if
we had already written the first byte of the UTF-8 string.  Lone low
surrogates are written as CESU-8 sequence immediately so they are nothing
to worry about.

As for wctomb/wcrtomb returning 0:  Even if this looks like kind of a
stretch, this should not be a problem per POSIX.  A return value of 0
from wctomb/wcrtomb has no special meaning(*).  Even in the case where
the incoming wide char is L'\0', the resulting \0 is written and 1 is
returned.  Since 0 bytes have been written to the destination string,
returning 0 is perfectly valid.  If a calling function misinterprets the
return value of 0 as an error or EOF, it's not a bug in wctomb/wcrtomb.

For the original discussion, see
http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html

Corinna

(*) http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcrtomb.html

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake
  2011-01-31 20:49   ` Corinna Vinschen
@ 2011-02-02 11:29   ` Bruno Haible
  2011-02-02 12:15     ` Corinna Vinschen
                       ` (3 more replies)
  1 sibling, 4 replies; 21+ messages in thread
From: Bruno Haible @ 2011-02-02 11:29 UTC (permalink / raw)
  To: Eric Blake; +Cc: bug-gnulib, cygwin, bug-coreutils

Hello Eric,

> ... POSIX requires that 1 wchar_t corresponds to 1 character
> ...
> > What consequences does this have?
> > 
> >   1) All code that uses the functions from <wctype.h> (wide character
> >      classification and mapping) or wcwidth() malfunctions on strings that
> >      contains Unicode characters outside the BMP, i.e. outside the range
> >      U+0000..U+FFFF.
> 
> Not necessarily.  Such code falls outside of POSIX, but it may still be
> a well-behaved extension if given sane behavior for how to deal with
> surrogates.

No. Code that uses <wctype.h> and wcwidth() is written precisely according
to POSIX. The problem is that this code cannot work correctly when wchar_t[]
is in UTF-16 encoding. There simply is no way to define these functions
in a reasonable way for surrogates.

For example:
  U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
  U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
  U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
  U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
  U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
  U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
There is no way that a system can provide this information through a
function 'iswalpha' that takes a single wchar_t argument.

It would be possible to provide this information
  - either through a function iswalpha2 (wchar_t wc1, wchar_t wc2)
    that takes two wchar_t arguments,
  - or through a function uc_is_alpha (ucs4_t uc),
but that is not POSIX, and it would require rewriting each and every
piece of code that currently uses <wctype.h> in the POSIX way.

> we can (try) to make the various wc* functions try to
> behave as smartly as possible (as is the case with Cygwin); where those
> smarts are only needed when you use surrogate pairs.

The point is that this approach can work fine for mbrtowc() and wcrtomb(),
but it cannot yield a working definition for the <wctype.h> functions and
wcwidth().

> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $ 	
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

We agree that it is a bug. And it is caused by
  - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
  - there is no way to define the <wctype.h> POSIX functions sanely in this
    setting, and
  - coreutils and gnulib make use of the POSIX functions.

Even if coreutils were to use mbsrtowcs instead of repeated use of
mbrtowc, there would be no way for it to produce the correct result
without combining surrogates into entire characters.

> And if we decide that cygwin's extensions are sane, how much harder is
> it to characterize what a program must do to be portable to both 16-bit
> and 32-bit wchar_t if they are guaranteed the same behavior for all
> hosts of the same-size wchar_t?  In other words, would it really require
> that many #ifdefs in coreutils to portably and simultaneously support
> both sizes of wchar_t?

It would require
  1. to change the conversions that use mbrtowc to either convert an
     entire string at once (use mbsrtowcs), or make a second call to
     mbrtowc once the first call to mbrtowc has determined a low
     surrogate.
  2. to change all uses of <wctype.h> and wcwidth() to use different
     functions, either functions that take 2 wchar_t arguments, or
     functions that require the caller to combine the surrogates.

This means, lots of logic that goes against the spirit of wchar_t
in ANSI C Amd. 1 and POSIX.

> > I'm more in favour of overriding wchar_t and all functions that depend on it -
> > like we did successfully for the socket functions.
> > 
> > In practice, this would mean that on Windows (both native Windows and
> > Cygwin >= 1.7) the use of a 'wchar_t' module will
> >   - override wchar_t to be 32 bits, like in glibc,
> >   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
> >     corresponding system functions are unusable, the replacements will use the
> >     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).
> ...
> compiler primitives, like L"xyz", which result in 16-bit wchar_t
> arrays, will be unusable

Good point. I agree then that overriding wchar_t should better not be
done.

> C1x will be adding compiler support for mandatory char16_t and char32_t
> types for UTF-16 and UTF-32 data, independently of whether wchar_t is
> 16-bit or 32-bit; maybe the better thing is to proactively start
> providing the new interfaces in <uchar.h> that will result from C1x
> adoption (and convert GNU programs to use this rather than wchar_t for
> character operations)
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

A newer draft is at
https://www.opengroup.org/platform/single_unix_specification/uploads/40/23495/n1548.pdf

This is a good point, but would have two drawbacks:

  - It throws out the use of a POSIX API for a not-yet-standard API,

  - Performance: For the non-UTF-8 locales (ISO-8859-15, EUC-JP, and
    similar) on platforms like MacOS X, FreeBSD, Solaris, the 'wchar_t'
    representation is essentially a packed multibyte representation.
    Which makes mbrtowc() fast, because it does not have to do a table
    lookup for the conversion from/to Unicode. If you use mbrtoc32
    instead of mbrtowc, you add extra runtime overhead for a conversion
    to Unicode, that would not be necessary when using mbrtowc().

In other words, your proposal would solve the Windows wchar_t problem,
but at the price of a performance penalty on traditional Unix systems.

Here's a new proposal:
  - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
    on Windows platforms and to 'wchar_t' otherwise.
  - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
    Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
    'wcwidth' on most platforms, and a use of libunistring modules on
    Windows platforms.

With this proposal,

  - The code that uses <wctype.h> has to be changed, but in a trivial
    way that introduces no complicated logic: Just change 'w' to 'ww'.
    Not more difficult than, say, using strtoll() instead of strtol().

  - The runtime penalty on non-Windows systems is minimal.

  - On Windows platforms, surrogates are handled correctly, and
    code that uses wchar_t or <windows.h> is left alone.

How does that sound? Comments?

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 11:29   ` Bruno Haible
@ 2011-02-02 12:15     ` Corinna Vinschen
  2011-02-02 12:21       ` Corinna Vinschen
  2011-02-02 16:03     ` Bruno Haible
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-02 12:15 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On Feb  2 12:29, Bruno Haible wrote:
> Hello Eric,
> 
> > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > ...
> > > What consequences does this have?
> > > 
> > >   1) All code that uses the functions from <wctype.h> (wide character
> > >      classification and mapping) or wcwidth() malfunctions on strings that
> > >      contains Unicode characters outside the BMP, i.e. outside the range
> > >      U+0000..U+FFFF.
> > 
> > Not necessarily.  Such code falls outside of POSIX, but it may still be
> > a well-behaved extension if given sane behavior for how to deal with
> > surrogates.
> 
> No. Code that uses <wctype.h> and wcwidth() is written precisely according
> to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> is in UTF-16 encoding. There simply is no way to define these functions
> in a reasonable way for surrogates.
> 
> For example:
>   U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
>   U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
>   U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
>   U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
>   U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
>   U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
> There is no way that a system can provide this information through a
> function 'iswalpha' that takes a single wchar_t argument.

iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
the function can return the correct value, provided that the application
converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

> We agree that it is a bug. And it is caused by
>   - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
>   - there is no way to define the <wctype.h> POSIX functions sanely in this
>     setting, and

See above.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 12:15     ` Corinna Vinschen
@ 2011-02-02 12:21       ` Corinna Vinschen
  0 siblings, 0 replies; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-02 12:21 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On Feb  2 13:14, Corinna Vinschen wrote:
> On Feb  2 12:29, Bruno Haible wrote:
> > Hello Eric,
> > 
> > > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > > ...
> > > > What consequences does this have?
> > > > 
> > > >   1) All code that uses the functions from <wctype.h> (wide character
> > > >      classification and mapping) or wcwidth() malfunctions on strings that
> > > >      contains Unicode characters outside the BMP, i.e. outside the range
> > > >      U+0000..U+FFFF.
> > > 
> > > Not necessarily.  Such code falls outside of POSIX, but it may still be
> > > a well-behaved extension if given sane behavior for how to deal with
> > > surrogates.
> > 
> > No. Code that uses <wctype.h> and wcwidth() is written precisely according
> > to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> > is in UTF-16 encoding. There simply is no way to define these functions
> > in a reasonable way for surrogates.
> > 
> > For example:
> >   U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
> >   U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
> >   U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
> >   U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
> >   U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
> >   U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
> > There is no way that a system can provide this information through a
> > function 'iswalpha' that takes a single wchar_t argument.
> 
> iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
> the function can return the correct value, provided that the application
> converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

And, please note the wording in SUSv4, for instance in
http://calimero.vinschen.de/susv4/functions/iswalpha.html

  The wc argument is a wint_t, the value of which the application shall
                       ^^^^^^                         ^^^^^^^^^^^
  ensure is a wide-character code corresponding to a valid character in
  the current locale, or equal to the value of the macro WEOF. If the
  argument has any other value, the behavior is undefined.

I don't see any words in that which would disallow to convert UTF-16
wchar_t surrogates to a wint_t UTF-32 value before calling one of
the wctype functions.  Just like you have to be careful not to call
the ctype functions with a signed char.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 11:29   ` Bruno Haible
  2011-02-02 12:15     ` Corinna Vinschen
@ 2011-02-02 16:03     ` Bruno Haible
  2011-02-02 16:28       ` Corinna Vinschen
  2011-02-02 17:52     ` bug#7948: " Paul Eggert
  2011-02-02 21:24     ` Eric Blake
  3 siblings, 1 reply; 21+ messages in thread
From: Bruno Haible @ 2011-02-02 16:03 UTC (permalink / raw)
  To: bug-gnulib, cygwin, bug-coreutils, Eric Blake

Hello Corinna,

> And, please note the wording in SUSv4, for instance in
> http://calimero.vinschen.de/susv4/functions/iswalpha.html

Likewise in POSIX:2008, at the URL
http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html

>   The wc argument is a wint_t, the value of which the application shall
>                        ^^^^^^                         ^^^^^^^^^^^
>   ensure is a wide-character code corresponding to a valid character in
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   the current locale, or equal to the value of the macro WEOF. If the
>   argument has any other value, the behavior is undefined.

What this sentence means in formulas, is that when an application passes
a 'wint_t x' to iswalpha(), it has to satisfy

   x == (wint_t) (wchar_t) x || x == EOF

> iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
> the function can return the correct value, provided that the application
> converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

When an application does this, is passes an invalid wint_t value to
iswalpha(), according to the spec paragraph that you have just cited.
So the application uses an extension to POSIX functionality, not
POSIX itself.

I see that Cygwin 1.7.x iswalpha() works in this way you describe (but
mingw's iswalpha() doesn't). So this means that gnulib's proposed
iswwalpha(wwchar_t) function could be implemented using iswalpha()
on Cygwin 1.7.x and will not cause the Unicode based tables to be
included in the executable. This is good and nice.

But if you say that the application should convert UTF-16 surrogates
to UTF-32 before calling iswalpha: That's certainly a requirement
for Cygwin 1.7.x application that want to support the entire Unicode
character set. But it's outside of POSIX, and many GNU programs will
not want to include this added complexity. Just try to apply this
suggestion to gnulib's quotearg.c, then estimate the time someone
would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,
coreutils/src/wc.c, and so on.

For this reason I propose the wwchar_t type with an API that is similar
to POSIX <wctype.h> but includes the surrogate handling, rather than
pushing it into each application's code.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 16:03     ` Bruno Haible
@ 2011-02-02 16:28       ` Corinna Vinschen
  2011-02-02 16:35         ` Corinna Vinschen
  0 siblings, 1 reply; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-02 16:28 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

Hi Bruno,

On Feb  2 17:02, Bruno Haible wrote:
> Hello Corinna,
> 
> > And, please note the wording in SUSv4, for instance in
> > http://calimero.vinschen.de/susv4/functions/iswalpha.html
> 
> Likewise in POSIX:2008, at the URL
> http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html

Oops, sorry for the wrong URL!  I'm using a local copy of SUSv4 for
speed, but forgot that entirely when copy/pasting it.

> >   The wc argument is a wint_t, the value of which the application shall
> >                        ^^^^^^                         ^^^^^^^^^^^
> >   ensure is a wide-character code corresponding to a valid character in
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >   the current locale, or equal to the value of the macro WEOF. If the
> >   argument has any other value, the behavior is undefined.
> 
> What this sentence means in formulas, is that when an application passes
> a 'wint_t x' to iswalpha(), it has to satisfy
> 
>    x == (wint_t) (wchar_t) x || x == EOF

Sure, I agree.  But it doesn't say this *exactly*, so I took the liberty
to stretch the limits a bit so that there is *some* way for applications
to use the wctype functions despite using UTF-16 and despite having a
surrogate value.

> > iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
> > the function can return the correct value, provided that the application
> > converts the UTF-16 surrogate to UTF-32 before calling iswalpha.
> 
> When an application does this, is passes an invalid wint_t value to
> iswalpha(), according to the spec paragraph that you have just cited.
> So the application uses an extension to POSIX functionality, not
> POSIX itself.

Well, given that the description doesn't explicitely talk about a value
given as wchar_t, but instead about a "wide-character code corresponding
to a valid character" I saw some room for interpretation...

> I see that Cygwin 1.7.x iswalpha() works in this way you describe (but
> mingw's iswalpha() doesn't). So this means that gnulib's proposed
> iswwalpha(wwchar_t) function could be implemented using iswalpha()
> on Cygwin 1.7.x and will not cause the Unicode based tables to be
> included in the executable. This is good and nice.

I'm glad you see it that way.

> But if you say that the application should convert UTF-16 surrogates
> to UTF-32 before calling iswalpha: That's certainly a requirement
> for Cygwin 1.7.x application that want to support the entire Unicode
> character set. But it's outside of POSIX, and many GNU programs will
> not want to include this added complexity. Just try to apply this
> suggestion to gnulib's quotearg.c, then estimate the time someone
> would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,
> coreutils/src/wc.c, and so on.

Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including
surrogate handling.  It only required two changes in the code.

But I see what you mean.  Another layer which abstracts this problem
looks like the right thing to do.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 16:28       ` Corinna Vinschen
@ 2011-02-02 16:35         ` Corinna Vinschen
  2011-02-02 20:28           ` Andy Koppe
  2011-02-04 22:46           ` Warren Young
  0 siblings, 2 replies; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-02 16:35 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On Feb  2 17:28, Corinna Vinschen wrote:
> On Feb  2 17:02, Bruno Haible wrote:
> > But if you say that the application should convert UTF-16 surrogates
> > to UTF-32 before calling iswalpha: That's certainly a requirement
> > for Cygwin 1.7.x application that want to support the entire Unicode
> > character set. But it's outside of POSIX, and many GNU programs will
> > not want to include this added complexity. Just try to apply this
> > suggestion to gnulib's quotearg.c, then estimate the time someone
> > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,
> > coreutils/src/wc.c, and so on.
> 
> Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including
> surrogate handling.  It only required two changes in the code.

Btw., I would be sure glad if Cygwin would use a wchar_t of 4 bytes as
well.  The problem is that this requires too many changes at once to
work right, and it would introduce a lot of backward compatibility
problems which would have to be handled.
If only the one's who decided that wchar_t in Cygwin should have the
same size as WCHAR_T in the underlying Windows would have thought twice
about the implications...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 11:29   ` Bruno Haible
  2011-02-02 12:15     ` Corinna Vinschen
  2011-02-02 16:03     ` Bruno Haible
@ 2011-02-02 17:52     ` Paul Eggert
  2011-02-02 18:57       ` Bruno Haible
  2011-02-03 12:57       ` Ulf Zibis
  2011-02-02 21:24     ` Eric Blake
  3 siblings, 2 replies; 21+ messages in thread
From: Paul Eggert @ 2011-02-02 17:52 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Eric Blake, bug-gnulib, cygwin, bug-coreutils

On 02/02/11 03:29, Bruno Haible wrote:
>   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>     on Windows platforms and to 'wchar_t' otherwise.

As a minor point, would it be OK to call this type
'xchar_t' instead?  'x' is the successor to 'w', after all,
and it can be thought of as an abbreviation for 'eXtended'.

A problem with the 'ww' prefix is that mentally I start thinking
"World Wide ..."

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 17:52     ` bug#7948: " Paul Eggert
@ 2011-02-02 18:57       ` Bruno Haible
  2011-02-02 20:43         ` Andy Koppe
  2011-02-03 12:57       ` Ulf Zibis
  1 sibling, 1 reply; 21+ messages in thread
From: Bruno Haible @ 2011-02-02 18:57 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eric Blake, bug-gnulib, cygwin, bug-coreutils

Hi Paul,

> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
> >     on Windows platforms and to 'wchar_t' otherwise.
> 
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.

'wwchar_t' means "wide wide character".

In fact it's not really an "extended" character or "complex character".
It's just what POSIX calls a 'wchar_t'.

I like the analogy between strtol and strtoll. In the beginning, people
thought a 'long int' would be enough for everything. Then they discovered
a 'long long int' is needed. The same story repeats itself here with
the "wide characters" which turn out to be not wide enough, and
"wide wide characters" are needed.

> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."

Indeed this meaning can come to mind, but I think it's not dangerous
since the term "world wide" has no meaning in a programming language.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 16:35         ` Corinna Vinschen
@ 2011-02-02 20:28           ` Andy Koppe
  2011-02-04 22:46           ` Warren Young
  1 sibling, 0 replies; 21+ messages in thread
From: Andy Koppe @ 2011-02-02 20:28 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On 2 February 2011 16:35, Corinna Vinschen wrote:
> On Feb  2 17:28, Corinna Vinschen wrote:
>> On Feb  2 17:02, Bruno Haible wrote:
>> > But if you say that the application should convert UTF-16 surrogates
>> > to UTF-32 before calling iswalpha: That's certainly a requirement
>> > for Cygwin 1.7.x application that want to support the entire Unicode
>> > character set. But it's outside of POSIX, and many GNU programs will
>> > not want to include this added complexity. Just try to apply this
>> > suggestion to gnulib's quotearg.c, then estimate the time someone
>> > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,
>> > coreutils/src/wc.c, and so on.
>>
>> Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including
>> surrogate handling.  It only required two changes in the code.
>
> Btw., I would be sure glad if Cygwin would use a wchar_t of 4 bytes as
> well.  The problem is that this requires too many changes at once to
> work right, and it would introduce a lot of backward compatibility
> problems which would have to be handled.

Cygwin 1.7 might have been a good point for that change, because the
lack of proper locale and charset support in previous versions meant
that backward compatibility was much less of a concern than it is now.
But it's a difficult change indeed, and it's not entirely clear that
it's worthwhile. I guess 64-bit Cygwin (if or when it happens) might
be the next opportunity.

> If only the one's who decided that wchar_t in Cygwin should have the
> same size as WCHAR_T in the underlying Windows would have thought twice
> about the implications...

Windows Unicode support was introduced with Windows NT in 1993,
whereas Unicode was only extended beyond 16 bits with version 2.0 in
1996. Cygwin was first released the year before. If the Unicode
extension was a consideration at all (which I'd doubt), wchar_t !=
WCHAR probably seemed far more daunting than having to deal with
surrogates at some point down the line.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 18:57       ` Bruno Haible
@ 2011-02-02 20:43         ` Andy Koppe
  0 siblings, 0 replies; 21+ messages in thread
From: Andy Koppe @ 2011-02-02 20:43 UTC (permalink / raw)
  To: cygwin; +Cc: Paul Eggert, Eric Blake, bug-gnulib, bug-coreutils

On 2 February 2011 18:57, Bruno Haible wrote:
> Hi Paul,
>
>> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>> >     on Windows platforms and to 'wchar_t' otherwise.
>>
>> As a minor point, would it be OK to call this type
>> 'xchar_t' instead?  'x' is the successor to 'w', after all,
>> and it can be thought of as an abbreviation for 'eXtended'.
>
> 'wwchar_t' means "wide wide character".
>
> In fact it's not really an "extended" character or "complex character".
> It's just what POSIX calls a 'wchar_t'.

It's extended in the sense that the original Unicode was only 16 bits
wide (which of course is why wchar_t on Windows is 16 bits). Also, I
think 'xchar_t' is less prone to typos, in particular forgetting one
of the dubyas.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 11:29   ` Bruno Haible
                       ` (2 preceding siblings ...)
  2011-02-02 17:52     ` bug#7948: " Paul Eggert
@ 2011-02-02 21:24     ` Eric Blake
  2011-02-02 21:39       ` Corinna Vinschen
  2011-02-02 23:03       ` Bruno Haible
  3 siblings, 2 replies; 21+ messages in thread
From: Eric Blake @ 2011-02-02 21:24 UTC (permalink / raw)
  To: cygwin, bug-gnulib

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]

[dropping coreutils at this point]

On 02/02/2011 04:29 AM, Bruno Haible wrote:
> Good point. I agree then that overriding wchar_t should better not be
> done.
> 
> Here's a new proposal:
>   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>     on Windows platforms and to 'wchar_t' otherwise.
>   - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
>     Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
>     'wcwidth' on most platforms, and a use of libunistring modules on
>     Windows platforms.

I like the idea of making a new type wrapper.

Are you thinking of making a sane wrapping around either 4-byte wchar_t
or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes
it a thin wrapper on both Linux and Cygwin, but needing more work on
mingw), or are you thinking that it is always a 4-byte type (needing
lots more memory manipulation on cygwin to convert between 2- and 4-byte
representations when using cygwin's functions, or else reimplementing
everything from scratch by completely bypassing cygwin)?

As to the name: I agree the opinion of others that xchar_t is easier to
type and easier to avoid typos of a missing 'w' than wwchar_t.  On the
other hand, I can see wwprintf that takes wide-wchar_t values, but
gnulib already has xprintf as a counterpart to xmalloc (which calls
exit() if the printf fails for memory allocation or other non-I/O
related reasons), so we can't blindly use 'x' instead of 'ww' when
replacing existing 'w' in POSIX APIs.

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 21:24     ` Eric Blake
@ 2011-02-02 21:39       ` Corinna Vinschen
  2011-02-02 23:03       ` Bruno Haible
  1 sibling, 0 replies; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-02 21:39 UTC (permalink / raw)
  To: cygwin, bug-gnulib

On Feb  2 14:24, Eric Blake wrote:
> [dropping coreutils at this point]
> 
> On 02/02/2011 04:29 AM, Bruno Haible wrote:
> > Good point. I agree then that overriding wchar_t should better not be
> > done.
> > 
> > Here's a new proposal:
> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
> >     on Windows platforms and to 'wchar_t' otherwise.
> >   - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
> >     Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
> >     'wcwidth' on most platforms, and a use of libunistring modules on
> >     Windows platforms.
> 
> I like the idea of making a new type wrapper.
> 
> Are you thinking of making a sane wrapping around either 4-byte wchar_t
> or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes
> it a thin wrapper on both Linux and Cygwin, but needing more work on
> mingw), or are you thinking that it is always a 4-byte type (needing
> lots more memory manipulation on cygwin to convert between 2- and 4-byte
> representations when using cygwin's functions, or else reimplementing
> everything from scratch by completely bypassing cygwin)?
> 
> As to the name: I agree the opinion of others that xchar_t is easier to
> type and easier to avoid typos of a missing 'w' than wwchar_t.  On the
> other hand, I can see wwprintf that takes wide-wchar_t values, but
> gnulib already has xprintf as a counterpart to xmalloc (which calls
> exit() if the printf fails for memory allocation or other non-I/O
> related reasons), so we can't blindly use 'x' instead of 'ww' when
> replacing existing 'w' in POSIX APIs.

May I suggest a compromise?  What about "xwchar_t"?  It avoids the
potential typo by accidentally dropping the second w.  It still contains
"wchar" which implies that it's a *wide* char type.  And the x could be
read as "extended".


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 21:24     ` Eric Blake
  2011-02-02 21:39       ` Corinna Vinschen
@ 2011-02-02 23:03       ` Bruno Haible
  2011-02-02 23:19         ` Eric Blake
  1 sibling, 1 reply; 21+ messages in thread
From: Bruno Haible @ 2011-02-02 23:03 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Eric Blake, cygwin

Hello Eric,

> > Here's a new proposal:
> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
> >     on Windows platforms and to 'wchar_t' otherwise.
> >   - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
> >     Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
> >     'wcwidth' on most platforms, and a use of libunistring modules on
> >     Windows platforms.
> ...
> Are you thinking of making a sane wrapping around either 4-byte wchar_t
> or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes
> it a thin wrapper on both Linux and Cygwin, but needing more work on
> mingw), or are you thinking that it is always a 4-byte type (needing
> lots more memory manipulation on cygwin to convert between 2- and 4-byte
> representations when using cygwin's functions, or else reimplementing
> everything from scratch by completely bypassing cygwin)?

I'm not sure I understand your question. The plan is that

  - On platforms with a 32-bit wchar_t, like glibc, *BSD, and many others,
    'wwchar_t' is identical to 'wchar_t', and the function wrappers are
    simple redirections.

  - On Cygwin and mingw, wwchar_t is 'uint32_t' (so as to accommodate
    all Unicode characters and WEOF and so that it plays well with 'wint_t').
    mbrtowwc is implemented by 1 or 2 calls to mbrtowc. mbsrtowwcs may be
    implemented by a call to mbsrtowcs and an additional conversion loop,
    or it might be implemented on top of mbrtowwc; that's merely a speed
    vs. memory trade-off.
    The plan is not to "completely bypassing cygwin", but to use as much
    of Cygwin's built-ins as makes sense.

  - On platforms with a 16-bit wchar_t but where the wchar_t[] encoding
    in Unicode locales is merely UCS-2, like AIX, use the no-op thin
    wrappers as well. If the platform does not support more than the BMP,
    it makes not much sense for GNU programs to try to work around that.

> As to the name: I agree the opinion of others that xchar_t is easier to
> type and easier to avoid typos of a missing 'w' than wwchar_t.

If a developer makes a typo here, he's likely to get a gcc warning or
a link error. But yes, it's possible to pass a 'wwchar_t' to
iswalpha(), which will yield wrong results. I don't think this risk
can be much reduced through a different name.

> gnulib already has xprintf as a counterpart to xmalloc (which calls
> exit() if the printf fails for memory allocation or other non-I/O
> related reasons), so we can't blindly use 'x'

Good point. The 'x' prefix has already several meanings in gnulib:
  - checking against memory allocation failure,
  - checking against errors,
  - no size limitation,
  - a more convenient interface,
  - a wrapper that prints an error message.
It doesn't seem wise to add another meaning to it.

Thanks for the feedback.

-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 23:03       ` Bruno Haible
@ 2011-02-02 23:19         ` Eric Blake
  2011-02-03  0:13           ` Bruno Haible
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Blake @ 2011-02-02 23:19 UTC (permalink / raw)
  To: Bruno Haible; +Cc: bug-gnulib, cygwin

[-- Attachment #1: Type: text/plain, Size: 2854 bytes --]

On 02/02/2011 04:03 PM, Bruno Haible wrote:
>> Are you thinking of making a sane wrapping around either 4-byte wchar_t
>> or which maps to 2-byte wchar_t but sanely handles UTF-16 (which makes
>> it a thin wrapper on both Linux and Cygwin, but needing more work on
>> mingw), or are you thinking that it is always a 4-byte type (needing
>> lots more memory manipulation on cygwin to convert between 2- and 4-byte
>> representations when using cygwin's functions, or else reimplementing
>> everything from scratch by completely bypassing cygwin)?
> 
> I'm not sure I understand your question. The plan is that
> 
>   - On platforms with a 32-bit wchar_t, like glibc, *BSD, and many others,
>     'wwchar_t' is identical to 'wchar_t', and the function wrappers are
>     simple redirections.
> 
>   - On Cygwin and mingw, wwchar_t is 'uint32_t' (so as to accommodate
>     all Unicode characters and WEOF and so that it plays well with 'wint_t').
>     mbrtowwc is implemented by 1 or 2 calls to mbrtowc. mbsrtowwcs may be
>     implemented by a call to mbsrtowcs and an additional conversion loop,
>     or it might be implemented on top of mbrtowwc; that's merely a speed
>     vs. memory trade-off.
>     The plan is not to "completely bypassing cygwin", but to use as much
>     of Cygwin's built-ins as makes sense.

You answered my question in spite of myself.  I was asking:

should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but
unlike the POSIX definition of wchar_t being always 1 character per
unit, the new type is explicitly documented as being multi-unit on some
platforms but with sane semantics

or should it always be 4-bytes, where conversion from wchar_t to
wwchar_t requires some efforts, and where the new type must be used
everywhere (which means wrapping a lot of APIs), but where you can once
again assume POSIX semantics of 1 character per unit, simplifying life
of callers at the expense of converting to the new type

And on asking the question in those more detailed words, I agree with
your conclusion - on cygwin, wwchar_t should be 4 bytes.

> 
>   - On platforms with a 16-bit wchar_t but where the wchar_t[] encoding
>     in Unicode locales is merely UCS-2, like AIX, use the no-op thin
>     wrappers as well. If the platform does not support more than the BMP,
>     it makes not much sense for GNU programs to try to work around that.

Agreed.

Next question/thought.  Gnulib should definitely tackle this first.  But
if it works out, should we also add wwchar_t natively into cygwin?  It
would certainly be easier to add new interfaces incrementally, in
preparation for a possible future ABI conversion to make wchar_t become
4 bytes.

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 23:19         ` Eric Blake
@ 2011-02-03  0:13           ` Bruno Haible
  2011-02-03  9:42             ` Corinna Vinschen
  0 siblings, 1 reply; 21+ messages in thread
From: Bruno Haible @ 2011-02-03  0:13 UTC (permalink / raw)
  To: Eric Blake; +Cc: bug-gnulib, cygwin

Hi Eric,

> I was asking:
> 
> should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but
> unlike the POSIX definition of wchar_t being always 1 character per
> unit, the new type is explicitly documented as being multi-unit on some
> platforms but with sane semantics
> 
> or should it always be 4-bytes, where conversion from wchar_t to
> wwchar_t requires some efforts, and where the new type must be used
> everywhere (which means wrapping a lot of APIs), but where you can once
> again assume POSIX semantics of 1 character per unit, simplifying life
> of callers at the expense of converting to the new type

In the first case we wouldn't need a new type.

The plan is the second alternative. The goal is *not* to have to extend
each of quotearg.c, regcomp.c, mbchar.h, wc.c, etc. to handle UTF-16
explicitly with #ifdefs, more variables, and more logic.

> if it works out, should we also add wwchar_t natively into cygwin? 

More and more Unix platforms offer only UTF-8 locales. One can predict
that in 10 years, all Unix platforms will offer only UTF-8 locales. At this
point wchar_t will be UCS-4 on all these platforms (except AIX).

The mbrtoc32 function from the C1X API that you pointed to will then be
equivalent to mbrtowwc.

So, you can view 'wwchar_t' as a temporary measure that will bridge the
gap between the ANSI C Amd. 1 API and the C1X API.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-03  0:13           ` Bruno Haible
@ 2011-02-03  9:42             ` Corinna Vinschen
  2011-02-03 10:48               ` Bruno Haible
  0 siblings, 1 reply; 21+ messages in thread
From: Corinna Vinschen @ 2011-02-03  9:42 UTC (permalink / raw)
  To: cygwin, bug-gnulib

On Feb  3 01:12, Bruno Haible wrote:
> Hi Eric,
> 
> > I was asking:
> > 
> > should wwchar_t (or xwchar_t, but not xchar_t) be 2-bytes on cygwin, but
> > unlike the POSIX definition of wchar_t being always 1 character per
> > unit, the new type is explicitly documented as being multi-unit on some
> > platforms but with sane semantics
> > 
> > or should it always be 4-bytes, where conversion from wchar_t to
> > wwchar_t requires some efforts, and where the new type must be used
> > everywhere (which means wrapping a lot of APIs), but where you can once
> > again assume POSIX semantics of 1 character per unit, simplifying life
> > of callers at the expense of converting to the new type
> 
> In the first case we wouldn't need a new type.
> 
> The plan is the second alternative. The goal is *not* to have to extend
> each of quotearg.c, regcomp.c, mbchar.h, wc.c, etc. to handle UTF-16
> explicitly with #ifdefs, more variables, and more logic.
> 
> > if it works out, should we also add wwchar_t natively into cygwin? 
> 
> More and more Unix platforms offer only UTF-8 locales. One can predict
> that in 10 years, all Unix platforms will offer only UTF-8 locales. At this
> point wchar_t will be UCS-4 on all these platforms (except AIX).
> 
> The mbrtoc32 function from the C1X API that you pointed to will then be
> equivalent to mbrtowwc.
> 
> So, you can view 'wwchar_t' as a temporary measure that will bridge the
> gap between the ANSI C Amd. 1 API and the C1X API.

Maybe I'm just dense, but isn't wwchar_t equivalent to wint_t on all
platforms?  On UCS-4 platforms sizeof(wint_t) == sizeof(wchar_t) == 4
because there's no reason to make it bigger.  On UCS-2 and UTF-16
platforms sizeof(wint_t) == 4 because it must be able to hold EOF as
well.  So, why not just use the wint_t type for the time being?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-03  9:42             ` Corinna Vinschen
@ 2011-02-03 10:48               ` Bruno Haible
  0 siblings, 0 replies; 21+ messages in thread
From: Bruno Haible @ 2011-02-03 10:48 UTC (permalink / raw)
  To: bug-gnulib, cygwin

Corinna Vinschen wrote:
> isn't wwchar_t equivalent to wint_t on all
> platforms?  On UCS-4 platforms sizeof(wint_t) == sizeof(wchar_t) == 4
> because there's no reason to make it bigger.  On UCS-2 and UTF-16
> platforms sizeof(wint_t) == 4 because it must be able to hold EOF as
> well.  So, why not just use the wint_t type for the time being?

The "must be able to hold WEOF as well" argument holds for the argument
type of iswwalpha.  If we were to call it 'wwint_t', it would be the same as
'wint_t', yes. For this reason, we don't need a separate type 'wwint_t'.

But 'wwchar_t' is the base type for wide wide character _arrays_.
Such arrays don't need to hold the WEOF value. On AIX platforms, where
wchar_t[] is the UCS-2 encoding, wwchar_t[] can be synonymous to it.
There is no need to make wwchar_t 32 bits wide on these platforms.

So, my current code looks like this:

# if (defined _WIN32 || defined __WIN32__) || defined __CYGWIN__
/* Define 'wwchar_t' as a type that
     - can hold 32 bits, unlike wchar_t which can hold only 16 bits,
     - promotes to 'wint_t' under the default argument promotions.  */
typedef wint_t wwchar_t; /* actually 'unsigned int' or 'uint32_t' */
# else
typedef wchar_t wwchar_t;
# endif

Bruno
-- 
In memoriam Buddy Holly <http://en.wikipedia.org/wiki/Buddy_Holly>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 17:52     ` bug#7948: " Paul Eggert
  2011-02-02 18:57       ` Bruno Haible
@ 2011-02-03 12:57       ` Ulf Zibis
  1 sibling, 0 replies; 21+ messages in thread
From: Ulf Zibis @ 2011-02-03 12:57 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Bruno Haible, bug-coreutils, cygwin, bug-gnulib, Eric Blake

Hi,

I think there is a kind of similar bug in discussion on GNU:
bug#7960: [PATCH] fmt: fix formatting multibyte text (bug #7372)

-Ulf


Am 02.02.2011 18:51, schrieb Paul Eggert:
> On 02/02/11 03:29, Bruno Haible wrote:
>>    - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>>      on Windows platforms and to 'wchar_t' otherwise.
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.
>
> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."
>
>
>
>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 16-bit wchar_t on Windows and Cygwin
  2011-02-02 16:35         ` Corinna Vinschen
  2011-02-02 20:28           ` Andy Koppe
@ 2011-02-04 22:46           ` Warren Young
  1 sibling, 0 replies; 21+ messages in thread
From: Warren Young @ 2011-02-04 22:46 UTC (permalink / raw)
  To: cygwin, bug-gnulib, bug-coreutils

On 2/2/2011 9:35 AM, Corinna Vinschen wrote:
>
> If only the one's who decided that wchar_t in Cygwin should have the
> same size as WCHAR_T in the underlying Windows would have thought twice
> about the implications...

Cygwin 1.9?

Or maybe 2.0, if it breaks ABIs?

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-02-04 22:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <201101310304.42975.bruno@clisp.org>
2011-01-31 19:16 ` 16-bit wchar_t on Windows and Cygwin Eric Blake
2011-01-31 20:49   ` Corinna Vinschen
2011-02-02 11:29   ` Bruno Haible
2011-02-02 12:15     ` Corinna Vinschen
2011-02-02 12:21       ` Corinna Vinschen
2011-02-02 16:03     ` Bruno Haible
2011-02-02 16:28       ` Corinna Vinschen
2011-02-02 16:35         ` Corinna Vinschen
2011-02-02 20:28           ` Andy Koppe
2011-02-04 22:46           ` Warren Young
2011-02-02 17:52     ` bug#7948: " Paul Eggert
2011-02-02 18:57       ` Bruno Haible
2011-02-02 20:43         ` Andy Koppe
2011-02-03 12:57       ` Ulf Zibis
2011-02-02 21:24     ` Eric Blake
2011-02-02 21:39       ` Corinna Vinschen
2011-02-02 23:03       ` Bruno Haible
2011-02-02 23:19         ` Eric Blake
2011-02-03  0:13           ` Bruno Haible
2011-02-03  9:42             ` Corinna Vinschen
2011-02-03 10:48               ` Bruno Haible

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).