public inbox for cygwin-apps@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: Brian Inglis <Brian.Inglis@shaw.ca>
Cc: cygwin-apps@cygwin.com
Subject: Re: grep rebuild?
Date: Thu, 16 Mar 2023 20:31:55 +0100	[thread overview]
Message-ID: <ZBNuq6tWhIoVu4Wd@calimero.vinschen.de> (raw)
In-Reply-To: <e4ee6ee2-28c9-1f2f-3683-9215a80089ec@Shaw.ca>

On Mar 16 10:50, Brian Inglis via Cygwin-apps wrote:
> On 2023-03-16 06:08, Corinna Vinschen via Cygwin-apps wrote:
> > Hi Brian,
> > 
> > there's a problem with the grep package.  It uses the internally
> > provided GNULIB regex library.
> > 
> > Unfortunately, that's the default if the system doesn't provide a more
> > recent GLibc.  Which we'll never do.  The problem is this: Native
> > language support in GNULIB's regex is *only* available, if it's built as
> > part of GLibc.
> > 
> > I'd like to ask you to rebuild grep 3.9 with the
> > --without-included-regex option.
> > 
> > That will allow grep to use Cygwin's own regex, which already comes with
> > basic native language support, and which I'm working on to sbetter
> > support equivalence class and collation symbol expressions.
> 
> Hi Corinna,
> 
> We discussed this and I was going to release grep 3.8 test release 3, for
> testing with snapshots or when Cygwin 3.5.0 is released, then grep 3.9 came
> out, and I realized grep is updated every few months, so that went on the
> back burner. I can do a test release for 3.9-2 with that configuration
> change.
> 
> The current release passes all the class tests and works for me and Andrey.
> Are there any other implications of language support affecting grep?

As I wrote above, equivalence class and collation symbol expressions.
Character clasess are easy and basically always supported, they don't
really count.

Here's what I expect to work:

First an example with equivalence class. "./fnmatch" is a simple
application calling fnmatch, with 1st arg being the glob expression
and the 2nd arg being the search expression. Locale is simple
en_US.utf8.  Note the accented uppercase À!

  $ /fnmatch '[[=a=]]' 'a'
  fnmatch ([[=a=]], a, 0) = 0 (en_US.utf8)
  $ ./fnmatch '[[=a=]]' 'b'
  fnmatch ([[=a=]], b, 0) = 1 (en_US.utf8)
  $ ./fnmatch '[[=a=]]' 'À'
  fnmatch ([[=a=]], À, 0) = 0 (en_US.utf8)
  $ ./fnmatch '[[=À=]]' 'a'
  fnmatch ([[=À=]], a, 0) = 0 (en_US.utf8)

As you can see, the non-accented a and the accented À belong
to the same equivalence class.

Now let's try grep on Cygwin:

  $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  a
  $ echo 'b' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  $ echo 'À' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=À=]]'
  grep: Invalid collation character

The first two results are expected, but not the third and forth result.

Let's try the same on Linux:

  $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  a
  $ echo 'b' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  $ echo 'À' | LC_COLLATE=en_US.utf8 grep '[[=a=]]'
  À
  $ echo 'a' | LC_COLLATE=en_US.utf8 grep '[[=À=]]'
  a

See the difference?

Next, let's try a collating element:

"./glob" is a simple test app calling glob and
setting the locale to the second argument.  There's a file called
"chakref" in the CWD:

There's no collating element "ch" in English:

  $ ./glob '[[.ch.]]*' en_US.utf8
  glob ([[.ch.]]*) = -3

But in Czech:

  $  ./glob '[[.ch.]]*' cs_CZ.utf8
  chakref

Try this with current grep:

  $ ls -1 | LC_COLLATE=en_US.utf8 grep '^[[.ch.]].*'
  grep: Invalid collation character

Ok.

  $ ls -1 | LC_COLLATE=cs_CZ.utf8 grep '^[[.ch.]].*'
  grep: Invalid collation character

Not ok.

On Linux:
  
  $ ls -1 | LC_COLLATE=en_US.utf8 grep '^[[.ch.]].*'
  grep: Invalid collation character

Ok.

  *[~]$ ls -1 | LC_COLLATE=cs_CZ.utf8 grep '^[[.ch.]].*'
  chakref

Ok.

Please note that, right now, collating symbols and equivalence classes
*only* work in the Cygwin main branch in glob(3) and fnmatch(3), but NOT
YET in regex(3).  That's what I'm planning to add in the next couple of
weeks (or months...)


Thanks,
Corinna

  reply	other threads:[~2023-03-16 19:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-16 12:08 Corinna Vinschen
2023-03-16 16:50 ` Brian Inglis
2023-03-16 19:31   ` Corinna Vinschen [this message]
2023-03-16 19:35     ` Corinna Vinschen
2023-03-17  0:50   ` Brian Inglis
2023-03-17  9:03     ` Corinna Vinschen
2023-03-17  9:15       ` Corinna Vinschen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZBNuq6tWhIoVu4Wd@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=Brian.Inglis@shaw.ca \
    --cc=cygwin-apps@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).