From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 5650 invoked by alias); 2 Feb 2011 16:03:21 -0000 Received: (qmail 5640 invoked by uid 22791); 2 Feb 2011 16:03:20 -0000 X-SWARE-Spam-Status: No, hits=-1.0 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_DNSWL_NONE,TW_WW X-Spam-Check-By: sourceware.org Received: from mo-p00-ob.rzone.de (HELO mo-p00-ob.rzone.de) (81.169.146.161) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 02 Feb 2011 16:03:13 +0000 X-RZG-AUTH: :Ln4Re0+Ic/6oZXR1YgKryK8brksyK8dozXDwHXjf9hj/zDNRbfA44+iwyQ== X-RZG-CLASS-ID: mo00 Received: from linuix.haible.de (dslb-088-068-046-137.pools.arcor-ip.net [88.68.46.137]) by post.strato.de (jimi mo16) (RZmta 25.1) with ESMTPA id L02466n12Easey ; Wed, 2 Feb 2011 17:02:58 +0100 (MET) From: Bruno Haible To: bug-gnulib@gnu.org, cygwin , "bug-coreutils" , Eric Blake Subject: Re: 16-bit wchar_t on Windows and Cygwin Date: Wed, 02 Feb 2011 16:03:00 -0000 User-Agent: KMail/1.9.9 MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110202122102.GD2675@calimero.vinschen.de> References: <20110202122102.GD2675@calimero.vinschen.de> <201102021229.04623.bruno@clisp.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201102021702.57387.bruno@clisp.org> Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2011-02/txt/msg00044.txt.bz2 Hello Corinna, > And, please note the wording in SUSv4, for instance in > http://calimero.vinschen.de/susv4/functions/iswalpha.html Likewise in POSIX:2008, at the URL http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html > The wc argument is a wint_t, the value of which the application shall > ^^^^^^ ^^^^^^^^^^^ > ensure is a wide-character code corresponding to a valid character in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > the current locale, or equal to the value of the macro WEOF. If the > argument has any other value, the behavior is undefined. What this sentence means in formulas, is that when an application passes a 'wint_t x' to iswalpha(), it has to satisfy x == (wint_t) (wchar_t) x || x == EOF > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > the function can return the correct value, provided that the application > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. When an application does this, is passes an invalid wint_t value to iswalpha(), according to the spec paragraph that you have just cited. So the application uses an extension to POSIX functionality, not POSIX itself. I see that Cygwin 1.7.x iswalpha() works in this way you describe (but mingw's iswalpha() doesn't). So this means that gnulib's proposed iswwalpha(wwchar_t) function could be implemented using iswalpha() on Cygwin 1.7.x and will not cause the Unicode based tables to be included in the executable. This is good and nice. But if you say that the application should convert UTF-16 surrogates to UTF-32 before calling iswalpha: That's certainly a requirement for Cygwin 1.7.x application that want to support the entire Unicode character set. But it's outside of POSIX, and many GNU programs will not want to include this added complexity. Just try to apply this suggestion to gnulib's quotearg.c, then estimate the time someone would need to apply it also to regcomp.c, strftime.c, mbscasestr.c, coreutils/src/wc.c, and so on. For this reason I propose the wwchar_t type with an API that is similar to POSIX but includes the surrogate handling, rather than pushing it into each application's code. Bruno -- In memoriam Carl Friedrich Goerdeler -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple