From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.kundenserver.de (mout.kundenserver.de [217.72.192.74]) by sourceware.org (Postfix) with ESMTPS id E48913858D28 for ; Thu, 16 Mar 2023 19:31:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E48913858D28 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=cygwin.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=cygwin.com Received: from calimero.vinschen.de ([24.134.7.25]) by mrelayeu.kundenserver.de (mreue109 [212.227.15.183]) with ESMTPSA (Nemesis) id 1MQMmF-1pplW90KbB-00MIpk; Thu, 16 Mar 2023 20:31:56 +0100 Received: by calimero.vinschen.de (Postfix, from userid 500) id 69C66A80858; Thu, 16 Mar 2023 20:31:55 +0100 (CET) Date: Thu, 16 Mar 2023 20:31:55 +0100 From: Corinna Vinschen To: Brian Inglis Cc: cygwin-apps@cygwin.com Subject: Re: grep rebuild? Message-ID: Reply-To: cygwin-apps@cygwin.com Mail-Followup-To: Brian Inglis , cygwin-apps@cygwin.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Provags-ID: V03:K1:3RSP3l9a1KkkthRqdj0Teb6ltrmW+SxZbNhwcgq2SC7U/bYhrqx wpr+5K7sk/PDoYmKiz2NutZLMyJuYcyns3w7lCiG+/TpZEHQo1bvkzXXu2xB0dr75YD6Ccp e91egNu+SN5BGr6y5GBzR07w56HhMxUF5csJBzgxuNIVSCUph3UONwgCGaHjI/EaQcrBk0O Ao31Cc1/uanw1K+wHJLFw== UI-OutboundReport: notjunk:1;M01:P0:bKfnAfvcUj8=;l2KpTP3RP7BNZltbL26pCcpVqd+ I2RyY2wse+tQa+eOGS8Q/7ZKOmXJXgunhTCRKiEBuTDBvzdeaEO0bCbqeseRQyoJWKQLBqOut s8zhLzBryK1/YDftlymsiiO+D+WxQh3A4qTBa9QiEudlSczl/dh0/p7O/IkPrC0eTeWLiYJH9 /m2ifez2XnRvTkLMwJ+iCPQyzcficUPKn6FvKBD/+sZCtAju4Xz5Yn2LZfaOPsIFkFboKSC+2 kpz2x1MPscjO9fr/e/gBvm+xH/rMllA8ZbMALrQHuniJh0uhXuG25jax76M8U8/5Qdfxw/+cD Civ/fsXwJaRfu34Q3VgE+N3wSKqTyTTjzBaWUKli/T1NEdv+lZA73lYct/9+3sl8Ik1DNrNKw MZfAljnUdCX9GzoLz33rNswWC5hKPmQqYMNfi5XCs67nT5QGzS8H/KL3NY36PP9bWIMgGy3Ww o4ZdNQMA1AgucWQ3WSCl+jWoSU5TBWw8V9WdwKAQX7/9jRa3HYWFT0q8s2fRvICZu64TKalz4 MIFbmYlMM7yRpN3LTaadKIWCi+cJP/ddKVaa4lLWr1TquVa32VO5pfiTlYuu8cd+G+k00zlWz QnwAcho/CZQjoIrd8dlPO9H/LVq5YyDfh7h7hpNoxiFfXsTTVkYNsoVJ+Tf1wDMZVBUKO4c6b dcuPTW1EMFfcxcJ6rK7+lwXXl6aDutX5fF2I+ijE+w== X-Spam-Status: No, score=-97.5 required=5.0 tests=BAYES_00,GOOD_FROM_CORINNA_CYGWIN,KAM_DMARC_NONE,KAM_DMARC_STATUS,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_FAIL,SPF_HELO_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mar 16 10:50, Brian Inglis via Cygwin-apps wrote: > On 2023-03-16 06:08, Corinna Vinschen via Cygwin-apps wrote: > > Hi Brian, > > > > there's a problem with the grep package. It uses the internally > > provided GNULIB regex library. > > > > Unfortunately, that's the default if the system doesn't provide a more > > recent GLibc. Which we'll never do. The problem is this: Native > > language support in GNULIB's regex is *only* available, if it's built as > > part of GLibc. > > > > I'd like to ask you to rebuild grep 3.9 with the > > --without-included-regex option. > > > > That will allow grep to use Cygwin's own regex, which already comes with > > basic native language support, and which I'm working on to sbetter > > support equivalence class and collation symbol expressions. > > Hi Corinna, > > We discussed this and I was going to release grep 3.8 test release 3, for > testing with snapshots or when Cygwin 3.5.0 is released, then grep 3.9 came > out, and I realized grep is updated every few months, so that went on the > back burner. I can do a test release for 3.9-2 with that configuration > change. > > The current release passes all the class tests and works for me and Andrey. > Are there any other implications of language support affecting grep? As I wrote above, equivalence class and collation symbol expressions. Character clasess are easy and basically always supported, they don't really count. Here's what I expect to work: First an example with equivalence class. "./fnmatch" is a simple application calling fnmatch, with 1st arg being the glob expression and the 2nd arg being the search expression. Locale is simple en_US.utf8. Note the accented uppercase À! $ /fnmatch '[[=a=]]' 'a' fnmatch ([[=a=]], a, 0) = 0 (en_US.utf8) $ ./fnmatch '[[=a=]]' 'b' fnmatch ([[=a=]], b, 0) = 1 (en_US.utf8) $ ./fnmatch '[[=a=]]' 'À' fnmatch ([[=a=]], À, 0) = 0 (en_US.utf8) $ ./fnmatch '[[=À=]]' 'a' fnmatch ([[=À=]], a, 0) = 0 (en_US.utf8) As you can see, the non-accented a and the accented À belong to the same equivalence class. Now let's try grep on Cygwin: $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' a $ echo 'b' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' $ echo 'À' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=À=]]' grep: Invalid collation character The first two results are expected, but not the third and forth result. Let's try the same on Linux: $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' a $ echo 'b' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' $ echo 'À' | LC_COLLATE=en_US.utf8 grep '[[=a=]]' À $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=À=]]' a See the difference? Next, let's try a collating element: "./glob" is a simple test app calling glob and setting the locale to the second argument. There's a file called "chakref" in the CWD: There's no collating element "ch" in English: $ ./glob '[[.ch.]]*' en_US.utf8 glob ([[.ch.]]*) = -3 But in Czech: $ ./glob '[[.ch.]]*' cs_CZ.utf8 chakref Try this with current grep: $ ls -1 | LC_COLLATE=en_US.utf8 grep '^[[.ch.]].*' grep: Invalid collation character Ok. $ ls -1 | LC_COLLATE=cs_CZ.utf8 grep '^[[.ch.]].*' grep: Invalid collation character Not ok. On Linux: $ ls -1 | LC_COLLATE=en_US.utf8 grep '^[[.ch.]].*' grep: Invalid collation character Ok. *[~]$ ls -1 | LC_COLLATE=cs_CZ.utf8 grep '^[[.ch.]].*' chakref Ok. Please note that, right now, collating symbols and equivalence classes *only* work in the Cygwin main branch in glob(3) and fnmatch(3), but NOT YET in regex(3). That's what I'm planning to add in the next couple of weeks (or months...) Thanks, Corinna