From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19923 invoked by alias); 2 Feb 2011 16:28:31 -0000 Received: (qmail 19879 invoked by uid 22791); 2 Feb 2011 16:28:16 -0000 X-Spam-Check-By: sourceware.org Received: from aquarius.hirmke.de (HELO calimero.vinschen.de) (217.91.18.234) by sourceware.org (qpsmtpd/0.83/v0.83-20-g38e4449) with ESMTP; Wed, 02 Feb 2011 16:28:11 +0000 Received: by calimero.vinschen.de (Postfix, from userid 500) id 35C942C048C; Wed, 2 Feb 2011 17:28:01 +0100 (CET) Date: Wed, 02 Feb 2011 16:28:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com, bug-gnulib@gnu.org, bug-coreutils@gnu.org Subject: Re: 16-bit wchar_t on Windows and Cygwin Message-ID: <20110202162801.GH2675@calimero.vinschen.de> Reply-To: cygwin@cygwin.com, bug-gnulib@gnu.org, bug-coreutils@gnu.org Mail-Followup-To: cygwin@cygwin.com, bug-gnulib@gnu.org, bug-coreutils@gnu.org References: <20110202122102.GD2675@calimero.vinschen.de> <201102021229.04623.bruno@clisp.org> <201102021702.57387.bruno@clisp.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <201102021702.57387.bruno@clisp.org> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2011-02/txt/msg00045.txt.bz2 Hi Bruno, On Feb 2 17:02, Bruno Haible wrote: > Hello Corinna, > > > And, please note the wording in SUSv4, for instance in > > http://calimero.vinschen.de/susv4/functions/iswalpha.html > > Likewise in POSIX:2008, at the URL > http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html Oops, sorry for the wrong URL! I'm using a local copy of SUSv4 for speed, but forgot that entirely when copy/pasting it. > > The wc argument is a wint_t, the value of which the application shall > > ^^^^^^ ^^^^^^^^^^^ > > ensure is a wide-character code corresponding to a valid character in > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > the current locale, or equal to the value of the macro WEOF. If the > > argument has any other value, the behavior is undefined. > > What this sentence means in formulas, is that when an application passes > a 'wint_t x' to iswalpha(), it has to satisfy > > x == (wint_t) (wchar_t) x || x == EOF Sure, I agree. But it doesn't say this *exactly*, so I took the liberty to stretch the limits a bit so that there is *some* way for applications to use the wctype functions despite using UTF-16 and despite having a surrogate value. > > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > > the function can return the correct value, provided that the application > > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. > > When an application does this, is passes an invalid wint_t value to > iswalpha(), according to the spec paragraph that you have just cited. > So the application uses an extension to POSIX functionality, not > POSIX itself. Well, given that the description doesn't explicitely talk about a value given as wchar_t, but instead about a "wide-character code corresponding to a valid character" I saw some room for interpretation... > I see that Cygwin 1.7.x iswalpha() works in this way you describe (but > mingw's iswalpha() doesn't). So this means that gnulib's proposed > iswwalpha(wwchar_t) function could be implemented using iswalpha() > on Cygwin 1.7.x and will not cause the Unicode based tables to be > included in the executable. This is good and nice. I'm glad you see it that way. > But if you say that the application should convert UTF-16 surrogates > to UTF-32 before calling iswalpha: That's certainly a requirement > for Cygwin 1.7.x application that want to support the entire Unicode > character set. But it's outside of POSIX, and many GNU programs will > not want to include this added complexity. Just try to apply this > suggestion to gnulib's quotearg.c, then estimate the time someone > would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, > coreutils/src/wc.c, and so on. Cygwin's regcomp is taken from FreeBSD and is UTF-16 capable, including surrogate handling. It only required two changes in the code. But I see what you mean. Another layer which abstracts this problem looks like the right thing to do. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple