From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from omta001.cacentral1.a.cloudfilter.net (omta001.cacentral1.a.cloudfilter.net [3.97.99.32]) by sourceware.org (Postfix) with ESMTPS id 8F6C93858D39 for ; Sat, 27 Nov 2021 07:24:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F6C93858D39 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=SystematicSw.ab.ca Authentication-Results: sourceware.org; spf=none smtp.mailfrom=systematicsw.ab.ca Received: from shw-obgw-4004a.ext.cloudfilter.net ([10.228.9.227]) by cmsmtp with ESMTP id qdh1mGdh0lW5qqs42mOQTX; Sat, 27 Nov 2021 07:24:02 +0000 Received: from [192.168.1.105] ([68.147.0.90]) by cmsmtp with ESMTP id qs42mHWpYd5Unqs42mRFKD; Sat, 27 Nov 2021 07:24:02 +0000 X-Authority-Analysis: v=2.4 cv=FrgWQknq c=1 sm=1 tr=0 ts=61a1dd12 a=T+ovY1NZ+FAi/xYICV7Bgg==:117 a=T+ovY1NZ+FAi/xYICV7Bgg==:17 a=IkcTkHD0fZMA:10 a=CCpqsmhAAAAA:8 a=fFEOjooe64AjwK5xVnUA:9 a=QEXdDO2ut3YA:10 a=ul9cdbp4aOFLsgKbc677:22 Message-ID: <528c7bd3-e39a-5b7a-5819-5a6b4e3c71c5@SystematicSw.ab.ca> Date: Sat, 27 Nov 2021 00:24:02 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.3.2 Reply-To: cygwin@cygwin.com Subject: Re: raise(-1) has stopped returning an error recently Content-Language: en-CA To: cygwin@cygwin.com References: <42c9bb90-dd78-edfa-99ff-f65f7e000956@SystematicSw.ab.ca> <643c1cb7-9b18-25cf-62b0-8085c8fab137@Shaw.ca> From: Brian Inglis Organization: Systematic Software In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CMAE-Envelope: MS4xfFmeBBPnHPIZbAAlXJjwIQfk0GJieEdAX0oGbMv4FYULRgeLQJISOvfEfD995CbaQSOERK0g04qFisWtUslAsmCUnJc3Ud2B4XwjQc+iFBZ2ACb2rxHk wJvxTITty5ZkHxIlOoP0wVrl439xdmJQTL4jtvvdfItJZaX/mi6H3jwjQXJJVOZEl4/nlzZIAWppdd7Yk42vvBUj5ZDEkUctINM= X-Spam-Status: No, score=-1161.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Nov 2021 07:24:05 -0000 On 2021-11-25 05:54, Corinna Vinschen via Cygwin wrote: > On Nov 24 11:01, Brian Inglis via Cygwin wrote: >> On 2021-11-24 02:25, Corinna Vinschen via Cygwin wrote: >>>> On Tue, Nov 23, 2021 at 11:18:25AM -0700, Brian Inglis wrote: >>>>> Do Cygwin and/or Windows support surrogate pairs in UTF-8? >>> >>> You mean UTF-16. UTF-8 doesn't know surrogate pairs, UTF-16 does. >>> Originally there was UCS-2, 16 bits, with only 65536 code points. >>> However, Unicode left the BMP already with version 2.0 in 1996, so >>> UTF-16 and surrogate pairs became necessary. Windows as well as Cygwin >>> support them. >> >> How does Cygwin support UTF-16 locales with surrogate pairs? > > UTF-16 locales? There's no such thing. UTF-16 is just the 16 bit > representation for Unicode, and as such, is independent of the locale. > On the user side, Cygwin only supports UTF-8 as Unicode representation. > Internally you can then convert them to wchar_t which is UTF-16. > >> Are they the "native" locales inherited from Windows if others are not >> specified e.g. UTF-8, some OEM SBCS or MBCS? > > Just try `locale -av' and you'll see all supported locales and their > respective default codeset. All of them can be used with .utf8 > specifier to use UTF-8 instead of the default codeset. Some of them > use UTF-8 as default codeset anyway, e. g., fa_IR or yo_NG. > >>>> There are 3 tests in surrogate-pair and only the 3rd one failed. So I guess >>>> surrogate pairs in UTF-8 "mostly work". >>> >>> UTF-16. The surrogate stuff is evil at times. Have a look at the >>> __utf8_wctomb function in >>> https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdlib/wctomb_r.c >>> Lone surrogate halfs in an input stream are a problem, for instance. >> >> Thus the confusion with grep surrogate pair tests which appear to be running >> under a UTF-8 locale: see attached surrogate pair extract from cygport >> --debug grep.cygport check. > > An STC in plain C might be helpful. I think I might finally have got the point of the test, not knowing much about legacy UTF-16 UCS encoding nor surrogate pairs. From what I can see: 𐐅 U+010405 f0 90 90 85 DESERET CAPITAL LETTER LONG OO fails to match itself, presumably others do also. Presumably this is converted internally on some platforms, including Cygwin, to a UTF-16 surrogate pair, and a grep comparison fails, although a bash comparison succeeds. $ printf '\U10405\n' | iconv -f utf-8 -t utf-16be | xxd -g2 00000000: d801 dc05 000a $ printf '\U10405\n' > t $ grep -f t t; echo $? 1 $ oo=`printf '\U10405\n'`; [ $oo = $oo ] && echo same || echo diff same -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada This email may be disturbing to some readers as it contains too much technical detail. Reader discretion is advised. [Data in binary units and prefixes, physical quantities in SI.]