From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12701 invoked by alias); 29 Jan 2011 18:12:48 -0000 Received: (qmail 12644 invoked by uid 22791); 29 Jan 2011 18:12:39 -0000 X-Spam-Check-By: sourceware.org Received: from aquarius.hirmke.de (HELO calimero.vinschen.de) (217.91.18.234) by sourceware.org (qpsmtpd/0.83/v0.83-20-g38e4449) with ESMTP; Sat, 29 Jan 2011 18:12:34 +0000 Received: by calimero.vinschen.de (Postfix, from userid 500) id CEE952CA2D6; Sat, 29 Jan 2011 19:12:31 +0100 (CET) Date: Sun, 30 Jan 2011 11:34:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org Subject: Re: Bug in libiconv? Message-ID: <20110129181231.GC1057@calimero.vinschen.de> Reply-To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org Mail-Followup-To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org References: <201101282312.50298.bruno@clisp.org> <20110129123014.GA8671@calimero.vinschen.de> <4D442DDA.4050807@redhat.com> <20110129160157.GA1057@calimero.vinschen.de> <4D444CAC.2010300@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <4D444CAC.2010300@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2011-01/txt/msg00423.txt.bz2 On Jan 29 10:21, Eric Blake wrote: > On 01/29/2011 09:01 AM, Corinna Vinschen wrote: > >> So, using UTF-16 surrogate encodings for characters outside the basic > >> plane violates POSIX, but it's the best we can do for those characters. > > > > Right, and we discussed this already on this list. Or the developer > > list, I don't remember. Maybe we should have stick to the base plane > > and only use UCS-2 to be more POSIX compatible. > > The burden is on the application, not on cygwin. If the application > wants POSIX behavior, then they obey __STDC_ISO_10646__ and use ONLY > characters from the basic plane (no surrogates), at which point their > use of wchar_t fits the POSIX definition (one wchar_t per character). > The moment they pass a surrogate, they are no longer honoring the > restriction documented by __STDC_ISO_10646__ so they are no longer under > the rules of POSIX, and then cygwin can do whatever it wants (and in Erm... hang on. __STDC_ISO_10646__ and the POSIX requirement are two different beasts. I still think that __STDC_ISO_10646__ does not restrict a 2 byte wchar_t to UCS-2. Per the definition UTF-16 is a valid coded representation of characters from ISO/IEC 10646. So, to say it with your words, the moment applications pass a surrogate, they are no longer under the rules of POSIX, but they still honor the restriction documented by __STDC_ISO_10646__. However, *usually* an application shouldn't really notice that a surrogate has been used, at least as long as they only manipulate entire strings. > this case, QoI demands that we honor surrogates to the best of our > ability for full UTF-16 support, and you can have multi-wchar_t > characters just as you already have multi-byte UTF-8 char characters). > In other words, cygwin IS being POSIX-compliant by advertising only the > Unicode 4.0 character set in the __STDC_ISO_10646__, while still > supporting Unicode 5.2 (should we upgrade to Unicode 6.0?) as an > extension when you no longer care about POSIX. > > > However, the POSIX definition doesn't contradict what I said about the > > definition of __STDC_ISO_10646__ as far as I'm concerned. > > Yep - I think we're in violent agreement :) Hmm, I'm not quite sure, see above. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple