fnmatch improvements

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* fnmatch improvements
@ 2023-07-27 10:15 Bruno Haible
  2023-07-27 18:24 ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 10:15 UTC (permalink / raw)
  To: cygwin

Hi,

Gnulib has, for the first time, an fnmatch() implementation that supports
characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin
with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g.
  fnmatch ("x?y", "x\360\237\230\213y", 0)
now returns 0.

This implementation also implements GNU extensions, as documented in
https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html

Now, I see that in the Cygwin master branch the fnmatch implementation has
been improved, supposedly handling non-BMP characters and character classes
as well.

Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch()
now still gets overridden by the gnulib one and, if no, whether it passes the
gnulib test suite.

I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
help, here's how to:
  1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
     March 2023 or newer).
  2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
  3. tar xfz testdir-fnmatch.tar.gz
  4. cd testdir-fnmatch-posix
     ./configure 2>&1 | tee log1
     make
     make check
     grep fnmatch log1
     grep REPLACE_FNMATCH config.status
     cd ..
  5. cd testdir-fnmatch-gnu
     ./configure 2>&1 | tee log1
     make
     make check
     grep fnmatch log1
     grep REPLACE_FNMATCH config.status
     cd ..
and provide the build and grep results.

Thanks!

         Bruno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 10:15 fnmatch improvements Bruno Haible
@ 2023-07-27 18:24 ` Corinna Vinschen
  2023-07-27 19:05   ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-27 18:24 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

Hi Bruno,

On Jul 27 12:15, Bruno Haible via Cygwin wrote:
> Hi,
> 
> Gnulib has, for the first time, an fnmatch() implementation that supports
> characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin
> with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g.
>   fnmatch ("x?y", "x\360\237\230\213y", 0)
> now returns 0.
> 
> This implementation also implements GNU extensions, as documented in
> https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html
> 
> Now, I see that in the Cygwin master branch the fnmatch implementation has
> been improved, supposedly handling non-BMP characters and character classes
> as well.

The major changes are using 32 bit unicode values internally and
implementing collating symbols and equivalence class expressions.

> Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch()
> now still gets overridden by the gnulib one and, if no, whether it passes the
> gnulib test suite.

I'm looking into that.  First thing, your testsuite uncovered a bug in
the latest fnmatch in the C locale. Comparing pointers instead of
comparing characters was never a good idea for pattern matching...

When I'm done I hope that our 3.5 fnmatch won't be overridden by the
gnulib version :}

> I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
> help, here's how to:
>   1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
>      March 2023 or newer).
>   2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
>   3. tar xfz testdir-fnmatch.tar.gz
>   4. cd testdir-fnmatch-posix
>      ./configure 2>&1 | tee log1
>      make
>      make check
>      grep fnmatch log1
>      grep REPLACE_FNMATCH config.status
>      cd ..
>   5. cd testdir-fnmatch-gnu
>      ./configure 2>&1 | tee log1
>      make
>      make check
>      grep fnmatch log1
>      grep REPLACE_FNMATCH config.status
>      cd ..
> and provide the build and grep results.
> 
> Thanks!
> 
>          Bruno

No worries, thanks for the testcases, I think I have some result
tomorrow.


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 18:24 ` Corinna Vinschen
@ 2023-07-27 19:05   ` Corinna Vinschen
  2023-07-27 20:25     ` Brian Inglis
  2023-07-27 21:40     ` Bruno Haible
  0 siblings, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-27 19:05 UTC (permalink / raw)
  To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible

On Jul 27 20:24, Corinna Vinschen via Cygwin wrote:
> On Jul 27 12:15, Bruno Haible via Cygwin wrote:
> I'm looking into that.  First thing, your testsuite uncovered a bug in
> the latest fnmatch in the C locale. Comparing pointers instead of
> comparing characters was never a good idea for pattern matching...
> 
> When I'm done I hope that our 3.5 fnmatch won't be overridden by the
> gnulib version :}
> 
> > I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
> > help, here's how to:
> >   1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
> >      March 2023 or newer).
> >   2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
> >   3. tar xfz testdir-fnmatch.tar.gz
> >   4. cd testdir-fnmatch-posix
> >      ./configure 2>&1 | tee log1
> >      make
> >      make check

I fixed the above problem and the POSIX check now works fine:

> >      grep fnmatch log1

    checking for fnmatch.h... yes
    checking for fnmatch... yes
    checking for working POSIX fnmatch... yes

I also extraced the fnmatch configure testcase and ran it manually.
It returns 0 now.  But:

> >      grep REPLACE_FNMATCH config.status

    S["REPLACE_FNMATCH"]="1"

Looks like the reason is that we don't have a uchar.h file?  Seems 
like this is of interest for AIX, but why should this be of
interest for fnmatch on other systems?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 19:05   ` Corinna Vinschen
@ 2023-07-27 20:25     ` Brian Inglis
  2023-07-27 21:22       ` Bruno Haible
  2023-07-27 21:40     ` Bruno Haible
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-27 20:25 UTC (permalink / raw)
  To: Corinna Vinschen via Cygwin, Bruno Haible

On 2023-07-27 13:05, Corinna Vinschen via Cygwin wrote:
> On Jul 27 20:24, Corinna Vinschen via Cygwin wrote:
>> On Jul 27 12:15, Bruno Haible via Cygwin wrote:
>> I'm looking into that.  First thing, your testsuite uncovered a bug in
>> the latest fnmatch in the C locale. Comparing pointers instead of
>> comparing characters was never a good idea for pattern matching...
>>
>> When I'm done I hope that our 3.5 fnmatch won't be overridden by the
>> gnulib version :}
>>
>>> I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
>>> help, here's how to:
>>>    1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
>>>       March 2023 or newer).
>>>    2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
>>>    3. tar xfz testdir-fnmatch.tar.gz
>>>    4. cd testdir-fnmatch-posix
>>>       ./configure 2>&1 | tee log1
>>>       make
>>>       make check
> 
> I fixed the above problem and the POSIX check now works fine:
> 
>>>       grep fnmatch log1
> 
>      checking for fnmatch.h... yes
>      checking for fnmatch... yes
>      checking for working POSIX fnmatch... yes
> 
> I also extraced the fnmatch configure testcase and ran it manually.
> It returns 0 now.  But:
> 
>>>       grep REPLACE_FNMATCH config.status
> 
>      S["REPLACE_FNMATCH"]="1"
> 
> Looks like the reason is that we don't have a uchar.h file?  Seems
> like this is of interest for AIX, but why should this be of
> interest for fnmatch on other systems?

It was added in C99 TR19769, integrated in C/++11, available in libicu-devel:

	https://cplusplus.com/reference/cuchar/

	https://open-std.org/jtc1/sc22/open/n3579.pdf

	https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf

	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416

	$ find /usr/include/ -name uchar.h
	/usr/include/unicode/uchar.h

	$ cygcheck -f /usr/include/unicode/uchar.h
	libicu-devel-72.1-1

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 20:25     ` Brian Inglis
@ 2023-07-27 21:22       ` Bruno Haible
  2023-07-27 22:17         ` Brian Inglis
  0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 21:22 UTC (permalink / raw)
  To: Brian.Inglis; +Cc: cygwin

Brian Inglis wrote:
> It was added in C99 TR19769, integrated in C/++11

Yes.

> available in libicu-devel:
> 
> 	https://cplusplus.com/reference/cuchar/
> 
> 	https://open-std.org/jtc1/sc22/open/n3579.pdf
> 
> 	https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
> 
> 	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
> 
> 	$ find /usr/include/ -name uchar.h
> 	/usr/include/unicode/uchar.h
> 
> 	$ cygcheck -f /usr/include/unicode/uchar.h
> 	libicu-devel-72.1-1

This file, <unicode/uchar.h> from ICU4C, is something completely different than
ISO C's <uchar.h>.

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 19:05   ` Corinna Vinschen
  2023-07-27 20:25     ` Brian Inglis
@ 2023-07-27 21:40     ` Bruno Haible
  2023-07-28  8:53       ` Corinna Vinschen
  1 sibling, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 21:40 UTC (permalink / raw)
  To: Corinna Vinschen, Bruno Haible

Corinna Vinschen wrote:
> > >   4. cd testdir-fnmatch-posix
> > >      ./configure 2>&1 | tee log1
> > >      make
> > >      make check
> 
> I fixed the above problem and the POSIX check now works fine:

Glad that the test suite was helpful (and that you fixed it before 3.5.0 —
so, no additional configure tests needed on the gnulib side).

> > >      grep fnmatch log1
> 
>     checking for fnmatch.h... yes
>     checking for fnmatch... yes
>     checking for working POSIX fnmatch... yes
>
> I also extraced the fnmatch configure testcase and ran it manually.
> It returns 0 now.  But:
> 
> > >      grep REPLACE_FNMATCH config.status
> 
>     S["REPLACE_FNMATCH"]="1"
>
> Looks like the reason is that we don't have a uchar.h file?  Seems 
> like this is of interest for AIX, but why should this be of
> interest for fnmatch on other systems?

Ah, that's because I made the assumption that if wchar_t is only 16-bits
wide, fnmatch() can't be correct. Which is true for AIX (and on this
platform, I prefer not to test the available locales). But not true
with your implementation any more.

What are the test suite results if you do

  - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
    in config.status,
  - make clean
  - ./config.status
  - make
  - make check

Then the tests will be run against Cygwin's fnmatch() function.
If all tests pass, I will add the following patch to gnulib.

diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4
index 2e1442eff7..e99737a476 100644
--- a/m4/fnmatch.m4
+++ b/m4/fnmatch.m4
@@ -1,4 +1,4 @@
-# Check for fnmatch - serial 18  -*- coding: utf-8 -*-
+# Check for fnmatch - serial 19  -*- coding: utf-8 -*-
 
 # Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc.
 # This file is free software; the Free Software Foundation
@@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
   m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX])
 
   AC_REQUIRE([gl_FNMATCH_H])
-  AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles
+  AC_REQUIRE([AC_CANONICAL_HOST])
   gl_fnmatch_required_lowercase=`
     echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]'
   `
@@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
     dnl This is due to wchar_t being only 16 bits wide.
     AC_REQUIRE([gl_UCHAR_H])
     if test $SMALL_WCHAR_T = 1; then
-      REPLACE_FNMATCH=1
+      case "$host_os" in
+        cygwin*)
+          dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no',
+          dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters,
+          dnl despite wchar_t being only 16 bits wide (because internally it
+          dnl works on wint_t values).
+          ;;
+        *)
+          REPLACE_FNMATCH=1
+          ;;
+      esac
     fi
   fi
   if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 21:22       ` Bruno Haible
@ 2023-07-27 22:17         ` Brian Inglis
  2023-07-28  9:00           ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-27 22:17 UTC (permalink / raw)
  To: cygwin; +Cc: Bruno Haible

On 2023-07-27 15:22, Bruno Haible wrote:
> Brian Inglis wrote:
>> It was added in C99 TR19769, integrated in C/++11
> 
> Yes.
> 
>> available in libicu-devel:
>>
>> 	https://cplusplus.com/reference/cuchar/
>>
>> 	https://open-std.org/jtc1/sc22/open/n3579.pdf
>>
>> 	https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
>>
>> 	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
>>
>> 	$ find /usr/include/ -name uchar.h
>> 	/usr/include/unicode/uchar.h
>>
>> 	$ cygcheck -f /usr/include/unicode/uchar.h
>> 	libicu-devel-72.1-1
> 
> This file, <unicode/uchar.h> from ICU4C, is something completely different than
> ISO C's <uchar.h>.

This would then be a *newlib* AT sourceware DOT org addition so we could use 
FreeBSD's:

https://cgit.freebsd.org/src/blame/include/uchar.h?id=9f9d157d82e2332b74d9c45b596748e3e4691f2d

plus consideration of:

gnulib:

https://git.savannah.gnu.org/gitweb/?p=gnulib.git&a=search&h=HEAD&st=commit&s=uchar.h

and C2023 CD2:

	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf

there are only symbol formatting changes in N3148 comments and N3149 is a zip 
with a password protected PDF so likely FDIS!

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 21:40     ` Bruno Haible
@ 2023-07-28  8:53       ` Corinna Vinschen
  2023-07-28 10:56         ` Bruno Haible
  2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
  0 siblings, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28  8:53 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 27 23:40, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > > >   4. cd testdir-fnmatch-posix
> > > >      ./configure 2>&1 | tee log1
> > > >      make
> > > >      make check
> > 
> > I fixed the above problem and the POSIX check now works fine:
> 
> Glad that the test suite was helpful (and that you fixed it before 3.5.0 —
> so, no additional configure tests needed on the gnulib side).
> 
> > > >      grep fnmatch log1
> > 
> >     checking for fnmatch.h... yes
> >     checking for fnmatch... yes
> >     checking for working POSIX fnmatch... yes
> >
> > I also extraced the fnmatch configure testcase and ran it manually.
> > It returns 0 now.  But:
> > 
> > > >      grep REPLACE_FNMATCH config.status
> > 
> >     S["REPLACE_FNMATCH"]="1"
> >
> > Looks like the reason is that we don't have a uchar.h file?  Seems 
> > like this is of interest for AIX, but why should this be of
> > interest for fnmatch on other systems?
> 
> Ah, that's because I made the assumption that if wchar_t is only 16-bits
> wide, fnmatch() can't be correct. Which is true for AIX (and on this
> platform, I prefer not to test the available locales). But not true
> with your implementation any more.
> 
> What are the test suite results if you do
> 
>   - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
>     in config.status,
>   - make clean
>   - ./config.status
>   - make

The build fails here.  The reason is that the GNU extension FNM_EXTMATCH
is not supported by the FreeBSD code base of fnmatch, so it's not
defined in our fnmatch.h system header.  Gnulib still tries to build
fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on
using the system header?

>   - make check
> 
> Then the tests will be run against Cygwin's fnmatch() function.
> If all tests pass, I will add the following patch to gnulib.

After the above fail, I tried from scratch with your below patch,
and I still get

  $ grep REPLACE_FNMATCH ./config.status
  S["REPLACE_FNMATCH"]="1"

Even though

  $ grep fnmatch log1
  checking for fnmatch.h... yes
  checking for fnmatch... yes
  checking for working POSIX fnmatch... yes

I'm quite puzzled.


Corinna


> 
> diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4
> index 2e1442eff7..e99737a476 100644
> --- a/m4/fnmatch.m4
> +++ b/m4/fnmatch.m4
> @@ -1,4 +1,4 @@
> -# Check for fnmatch - serial 18  -*- coding: utf-8 -*-
> +# Check for fnmatch - serial 19  -*- coding: utf-8 -*-
>  
>  # Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc.
>  # This file is free software; the Free Software Foundation
> @@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
>    m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX])
>  
>    AC_REQUIRE([gl_FNMATCH_H])
> -  AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles
> +  AC_REQUIRE([AC_CANONICAL_HOST])
>    gl_fnmatch_required_lowercase=`
>      echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]'
>    `
> @@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
>      dnl This is due to wchar_t being only 16 bits wide.
>      AC_REQUIRE([gl_UCHAR_H])
>      if test $SMALL_WCHAR_T = 1; then
> -      REPLACE_FNMATCH=1
> +      case "$host_os" in
> +        cygwin*)
> +          dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no',
> +          dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters,
> +          dnl despite wchar_t being only 16 bits wide (because internally it
> +          dnl works on wint_t values).
> +          ;;
> +        *)
> +          REPLACE_FNMATCH=1
> +          ;;
> +      esac
>      fi
>    fi
>    if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then
> 
> 
> 
> 
> -- 
> Problem reports:      https://cygwin.com/problems.html
> FAQ:                  https://cygwin.com/faq/
> Documentation:        https://cygwin.com/docs.html
> Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-27 22:17         ` Brian Inglis
@ 2023-07-28  9:00           ` Corinna Vinschen
  2023-07-28  9:53             ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28  9:00 UTC (permalink / raw)
  To: cygwin

On Jul 27 16:17, Brian Inglis via Cygwin wrote:
> On 2023-07-27 15:22, Bruno Haible wrote:
> > Brian Inglis wrote:
> > > It was added in C99 TR19769, integrated in C/++11
> > 
> > Yes.
> > 
> > > available in libicu-devel:
> > > 
> > > 	https://cplusplus.com/reference/cuchar/
> > > 
> > > 	https://open-std.org/jtc1/sc22/open/n3579.pdf
> > > 
> > > 	https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
> > > 
> > > 	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
> > > 
> > > 	$ find /usr/include/ -name uchar.h
> > > 	/usr/include/unicode/uchar.h
> > > 
> > > 	$ cygcheck -f /usr/include/unicode/uchar.h
> > > 	libicu-devel-72.1-1
> > 
> > This file, <unicode/uchar.h> from ICU4C, is something completely different than
> > ISO C's <uchar.h>.
> 
> This would then be a *newlib* AT sourceware DOT org addition so we could use
> FreeBSD's:

We can use FreeBSDs version as role model, but we can't use the code
verbatim, given FreeBSD assumes sizeof(wchar_t) == 4.

Since that's a Cygwin-only issue (2 byte wchar_t, that is), I guess we
should merge the code into the Cygwin code base, rather than newlib.

For mbrtoc32/c32rtomb, we can use the wirtomb/mbrtowi function I
introduced for the globbing code.  If we do that, I think the functions
should actually be renamed accordingly and the globbing code should use
uchar32_t rather than wint_t.

Also, it might be helpful to add the mbrtoc8/c8rtomb extensions at one
point, which are missing in FreeBSD.

Either way, I'd be grateful for patches in this area.

Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28  9:00           ` Corinna Vinschen
@ 2023-07-28  9:53             ` Corinna Vinschen
  0 siblings, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28  9:53 UTC (permalink / raw)
  To: cygwin

On Jul 28 11:00, Corinna Vinschen via Cygwin wrote:
>   If we do that, I think the functions
> should actually be renamed accordingly and the globbing code should use
> uchar32_t rather than wint_t.

s/uchar32_t/char32_t/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28  8:53       ` Corinna Vinschen
@ 2023-07-28 10:56         ` Bruno Haible
  2023-07-28 11:14           ` Corinna Vinschen
  2023-07-28 18:59           ` Corinna Vinschen
  2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
  1 sibling, 2 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 10:56 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> After the above fail, I tried from scratch with your below patch,
> and I still get
> 
>   $ grep REPLACE_FNMATCH ./config.status
>   S["REPLACE_FNMATCH"]="1"
> 
> Even though
> 
>   $ grep fnmatch log1
>   checking for fnmatch.h... yes
>   checking for fnmatch... yes
>   checking for working POSIX fnmatch... yes
> 
> I'm quite puzzled.

It's sometimes hard to make incremental changes to generated files of the
GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .

The expected result is:
  1. cd testdir-fnmatch-posix
     ./configure
     grep REPLACE_FNMATCH config.status
     (Expected: REPLACE_FNMATCH is 0)
     make
     make check
     (Expected: No test failures)
     cd ..
  2. cd testdir-fnmatch-gnu
     ./configure
     grep REPLACE_FNMATCH config.status
     (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)
     make
     make check
     (Expected: No test failures)
     cd ..

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28  8:53       ` Corinna Vinschen
  2023-07-28 10:56         ` Bruno Haible
@ 2023-07-28 11:12         ` Corinna Vinschen
  2023-07-28 11:22           ` Bruno Haible
  2023-07-28 21:42           ` Bill Stewart
  1 sibling, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 11:12 UTC (permalink / raw)
  To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible

On Jul 28 10:53, Corinna Vinschen via Cygwin wrote:
> On Jul 27 23:40, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > >     S["REPLACE_FNMATCH"]="1"
> > >
> > > Looks like the reason is that we don't have a uchar.h file?  Seems 
> > > like this is of interest for AIX, but why should this be of
> > > interest for fnmatch on other systems?
> > 
> > Ah, that's because I made the assumption that if wchar_t is only 16-bits
> > wide, fnmatch() can't be correct. Which is true for AIX (and on this
> > platform, I prefer not to test the available locales). But not true
> > with your implementation any more.
> > 
> > What are the test suite results if you do
> > 
> >   - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
> >     in config.status,
> >   - make clean
> >   - ./config.status
> >   - make
> 
> The build fails here.  The reason is that the GNU extension FNM_EXTMATCH
> is not supported by the FreeBSD code base of fnmatch, so it's not
> defined in our fnmatch.h system header.  Gnulib still tries to build
> fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on
> using the system header?
> [...]
> After the above fail, I tried from scratch with your below patch,
> and I still get
> 
>   $ grep REPLACE_FNMATCH ./config.status
>   S["REPLACE_FNMATCH"]="1"
> 
> Even though
> 
>   $ grep fnmatch log1
>   checking for fnmatch.h... yes
>   checking for fnmatch... yes
>   checking for working POSIX fnmatch... yes
> 
> I'm quite puzzled.

I'm puzzled because I'm an idiot.  I forgot autoreconf.  After that and
another configure run, REPLACE_FNMATCH is correctly set to 0 *and* the
build runs fine.

I'll do the rest of the test later today.

Sorry,
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28 10:56         ` Bruno Haible
@ 2023-07-28 11:14           ` Corinna Vinschen
  2023-07-28 18:59           ` Corinna Vinschen
  1 sibling, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 11:14 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 28 12:56, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > After the above fail, I tried from scratch with your below patch,
> > and I still get
> > 
> >   $ grep REPLACE_FNMATCH ./config.status
> >   S["REPLACE_FNMATCH"]="1"
> > 
> > Even though
> > 
> >   $ grep fnmatch log1
> >   checking for fnmatch.h... yes
> >   checking for fnmatch... yes
> >   checking for working POSIX fnmatch... yes
> > 
> > I'm quite puzzled.
> 
> It's sometimes hard to make incremental changes to generated files of the
> GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
> at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .

Thanks, I'll use that for testing later today.


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
@ 2023-07-28 11:22           ` Bruno Haible
  2023-07-28 21:42           ` Bill Stewart
  1 sibling, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 11:22 UTC (permalink / raw)
  To: Corinna Vinschen, Bruno Haible

Corinna Vinschen wrote:
> I'm puzzled because I'm an idiot.  I forgot autoreconf.

Things like that happen to me as well. There are so many
generation phases (collect *.m4 files; autoconf; configure;
make) that it's easy to forget one when making incremental changes.
It's more reliable to regenerate the testdir from scratch.

Bruno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28 10:56         ` Bruno Haible
  2023-07-28 11:14           ` Corinna Vinschen
@ 2023-07-28 18:59           ` Corinna Vinschen
  2023-07-28 19:33             ` Bruno Haible
  2023-07-28 19:54             ` GB18030 locale Bruno Haible
  1 sibling, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 18:59 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 28 12:56, Bruno Haible via Cygwin wrote:
> It's sometimes hard to make incremental changes to generated files of the
> GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
> at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .
> 
> The expected result is:
>   1. cd testdir-fnmatch-posix
>      ./configure
>      grep REPLACE_FNMATCH config.status
>      (Expected: REPLACE_FNMATCH is 0)

  $ grep REPLACE_FNMATCH config.status
  S["REPLACE_FNMATCH"]="0"

>      make
>      make check
>      (Expected: No test failures)

  # TOTAL: 218
  # PASS:  178
  # SKIP:  40
  # XFAIL: 0
  # FAIL:  0
  # XPASS: 0
  # ERROR: 0

  test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.

>      cd ..
>   2. cd testdir-fnmatch-gnu
>      ./configure
>      grep REPLACE_FNMATCH config.status
>      (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)

  $ grep REPLACE_FNMATCH config.status
  S["REPLACE_FNMATCH"]="1"

>      make
>      make check
>      (Expected: No test failures)

  # TOTAL: 218
  # PASS:  178
  # SKIP:  40
  # XFAIL: 0
  # FAIL:  0
  # XPASS: 0
  # ERROR: 0

  Same SKIP of test-fnmatch-5.sh.

Does that look ok?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28 18:59           ` Corinna Vinschen
@ 2023-07-28 19:33             ` Bruno Haible
  2023-07-28 19:54             ` GB18030 locale Bruno Haible
  1 sibling, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 19:33 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> >   1. cd testdir-fnmatch-posix
> >      ./configure
> >      grep REPLACE_FNMATCH config.status
> >      (Expected: REPLACE_FNMATCH is 0)
> 
>   $ grep REPLACE_FNMATCH config.status
>   S["REPLACE_FNMATCH"]="0"
> 
> >      make
> >      make check
> >      (Expected: No test failures)
> 
>   # TOTAL: 218
>   # PASS:  178
>   # SKIP:  40
>   # XFAIL: 0
>   # FAIL:  0
>   # XPASS: 0
>   # ERROR: 0
> 
>   test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
> 
> >      cd ..
> >   2. cd testdir-fnmatch-gnu
> >      ./configure
> >      grep REPLACE_FNMATCH config.status
> >      (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)
> 
>   $ grep REPLACE_FNMATCH config.status
>   S["REPLACE_FNMATCH"]="1"
> 
> >      make
> >      make check
> >      (Expected: No test failures)
> 
>   # TOTAL: 218
>   # PASS:  178
>   # SKIP:  40
>   # XFAIL: 0
>   # FAIL:  0
>   # XPASS: 0
>   # ERROR: 0
> 
>   Same SKIP of test-fnmatch-5.sh.
> 
> Does that look ok?

Yes, that's all OK and as expected. I'll commit the fnmatch.m4 patch today.

When the user asks for an fnmatch() with FNM_EXTMATCH support, they will get
the Gnulib fnmatch(), as it supports these GNU extensions. I'll think about
how to make [=X=] and [.X.] work in this case too...

Thank you for your constructive cooperation!

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GB18030 locale
  2023-07-28 18:59           ` Corinna Vinschen
  2023-07-28 19:33             ` Bruno Haible
@ 2023-07-28 19:54             ` Bruno Haible
  2023-07-29  9:23               ` Corinna Vinschen
  1 sibling, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 19:54 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
>   test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.

Hmm? When I read winsup/cygwin/release/3.5.0 and the commit
5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030
locale (which on native Windows is called "Chinese_China.54936")
should be supported.

The Gnulib code which determines whether this locale is supported
is in m4/locale-zh.m4. Why does the
"checking for a transitional chinese locale..." test fail on
your system, when you call it as
  LC_ALL=zh_CN.GB18030 LC_TIME= LC_CTYPE= ./conftest
?

Bruno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: fnmatch improvements
  2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
  2023-07-28 11:22           ` Bruno Haible
@ 2023-07-28 21:42           ` Bill Stewart
  1 sibling, 0 replies; 32+ messages in thread
From: Bill Stewart @ 2023-07-28 21:42 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 150 bytes --]

On Fri, Jul 28, 2023 at 5:12 AM Corinna Vinschen wrote:

I'm puzzled because I'm an idiot.
>

That's one thing you certainly are not.

Bill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GB18030 locale
  2023-07-28 19:54             ` GB18030 locale Bruno Haible
@ 2023-07-29  9:23               ` Corinna Vinschen
  2023-07-29  9:53                 ` Bruno Haible
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-29  9:23 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 28 21:54, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> >   test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
> 
> Hmm? When I read winsup/cygwin/release/3.5.0 and the commit
> 5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030
> locale (which on native Windows is called "Chinese_China.54936")
> should be supported.

You're right, I always had the idea to add GB18030 support and forgot
that I supposedly did that in 5da71b605995 ("Cygwin: add support for
GB18030 codeset"), sorry.

However, on debugging this, I see it's totally broken.  Trying to fix
this in the existing functions is futile.  We need dedicated
support functions for GB18030, kind of like the FreeBSD functions,
just with extra support for surrogate pairs, as with our UTF8 stuff.

Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GB18030 locale
  2023-07-29  9:23               ` Corinna Vinschen
@ 2023-07-29  9:53                 ` Bruno Haible
  2023-07-31 10:07                   ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-29  9:53 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> However, on debugging this, I see it's totally broken.  Trying to fix
> this in the existing functions is futile.  We need dedicated
> support functions for GB18030, kind of like the FreeBSD functions,
> just with extra support for surrogate pairs, as with our UTF8 stuff.

In case it helps: Find here a test suite for the various multibyte
functions with GB18030 specific test cases. (Extracted from gnulib.)
https://haible.de/bruno/gnu/testdir-gb18030.tar.gz

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GB18030 locale
  2023-07-29  9:53                 ` Bruno Haible
@ 2023-07-31 10:07                   ` Corinna Vinschen
  2023-07-31 13:38                     ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 10:07 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 29 11:53, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > However, on debugging this, I see it's totally broken.  Trying to fix
> > this in the existing functions is futile.  We need dedicated
> > support functions for GB18030, kind of like the FreeBSD functions,
> > just with extra support for surrogate pairs, as with our UTF8 stuff.
> 
> In case it helps: Find here a test suite for the various multibyte
> functions with GB18030 specific test cases. (Extracted from gnulib.)
> https://haible.de/bruno/gnu/testdir-gb18030.tar.gz

Thank you, I'm already hacking and testing :)


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GB18030 locale
  2023-07-31 10:07                   ` Corinna Vinschen
@ 2023-07-31 13:38                     ` Corinna Vinschen
  2023-07-31 14:06                       ` character class "alpha" Bruno Haible
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 13:38 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

Hi Bruno,

On Jul 31 12:07, Corinna Vinschen via Cygwin wrote:
> On Jul 29 11:53, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > > However, on debugging this, I see it's totally broken.  Trying to fix
> > > this in the existing functions is futile.  We need dedicated
> > > support functions for GB18030, kind of like the FreeBSD functions,
> > > just with extra support for surrogate pairs, as with our UTF8 stuff.
> > 
> > In case it helps: Find here a test suite for the various multibyte
> > functions with GB18030 specific test cases. (Extracted from gnulib.)
> > https://haible.de/bruno/gnu/testdir-gb18030.tar.gz
> 
> Thank you, I'm already hacking and testing :)

I have a problem with the c32isalpha function.

c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
because it expects the character to be an alphabetic character.

The Cygwin unicode information is automatically generated from the
Unicode data file UnicodeData.txt, fresh from their homepage.  iswalpha
in newlib is checking for the Unicode categories, using the expression:

    return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
          || cat == CAT_Lm || cat == CAT_Lo
	  || cat == CAT_Nl // Letter_Number
	  ;

with CAT_foo being equivalent to Unicode category foo.

Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
alphabetic character.

I see that Glibc returns 1 from c32isalpha for U+FF11, but I don't see
where it takes that info and why this is correct.  Can you point me to
some info on this?

Thanks,
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 13:38                     ` Corinna Vinschen
@ 2023-07-31 14:06                       ` Bruno Haible
  2023-07-31 17:46                         ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 14:06 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> I have a problem with the c32isalpha function.
> 
> c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> because it expects the character to be an alphabetic character.

This is not a big problem. You can see in the test-c32isalpha.c file
that this test is disabled for many platforms, in particular glibc.
There's no problem with disabling it on Cygwin as well.

> The Cygwin unicode information is automatically generated from the
> Unicode data file UnicodeData.txt, fresh from their homepage.  iswalpha
> in newlib is checking for the Unicode categories, using the expression:
> 
>     return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
>           || cat == CAT_Lm || cat == CAT_Lo
> 	  || cat == CAT_Nl // Letter_Number
> 	  ;
> 
> with CAT_foo being equivalent to Unicode category foo.
> 
> Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> alphabetic character.

This is not wrong. However, see the comments in the generator of the
gnulib tables:

https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789

   /* Consider all the non-ASCII digits as alphabetic.
      ISO C 99 forbids us to have them in category "digit",
      but we want iswalnum to return true on them.  */

Likewise in the generator of the glibc tables:

https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274

The original comment (from 2000) was:

  /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
     takes it away:
     7.25.2.1.5:
        The iswdigit function tests for any wide character that corresponds
        to a decimal-digit character (as defined in 5.2.1).
     5.2.1:
        the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
   */
  return (ch >= 0x0030 && ch <= 0x0039);

The question is: In which category do you put these non-ASCII digits?
"print" and "graph", sure. But other than that? "punct" or "alnum"?
"punct" seems wrong. If you, like me, decide to put them in "alnum",
then you they need to be in "alpha" or "digit" (per POSIX
https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 14:06                       ` character class "alpha" Bruno Haible
@ 2023-07-31 17:46                         ` Corinna Vinschen
  2023-07-31 18:20                           ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 17:46 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > I have a problem with the c32isalpha function.
> > 
> > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > because it expects the character to be an alphabetic character.
> 
> This is not a big problem. You can see in the test-c32isalpha.c file
> that this test is disabled for many platforms, in particular glibc.

Which is interesting, because I actually tried that today on glibc, and
for iswalpha (0xff11) it returns 1.  So it actually behaves as the
testcase expects.

> There's no problem with disabling it on Cygwin as well.

I'd rather make Cygwin do the same as glibc.

> > The Cygwin unicode information is automatically generated from the
> > Unicode data file UnicodeData.txt, fresh from their homepage.  iswalpha
> > in newlib is checking for the Unicode categories, using the expression:
> > 
> >     return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
> >           || cat == CAT_Lm || cat == CAT_Lo
> > 	  || cat == CAT_Nl // Letter_Number
> > 	  ;
> > 
> > with CAT_foo being equivalent to Unicode category foo.
> > 
> > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> > alphabetic character.
> 
> This is not wrong. However, see the comments in the generator of the
> gnulib tables:
> 
> https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789
> 
>    /* Consider all the non-ASCII digits as alphabetic.
>       ISO C 99 forbids us to have them in category "digit",
>       but we want iswalnum to return true on them.  */
> 
> Likewise in the generator of the glibc tables:
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274
> 
> The original comment (from 2000) was:
> 
>   /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
>      takes it away:
>      7.25.2.1.5:
>         The iswdigit function tests for any wide character that corresponds
>         to a decimal-digit character (as defined in 5.2.1).
>      5.2.1:
>         the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
>    */
>   return (ch >= 0x0030 && ch <= 0x0039);
> 
> The question is: In which category do you put these non-ASCII digits?
> "print" and "graph", sure. But other than that? "punct" or "alnum"?
> "punct" seems wrong. If you, like me, decide to put them in "alnum",
> then you they need to be in "alpha" or "digit" (per POSIX
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
> But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".

Thanks for the description.  It was clear to me that they don't belong
into the ISO C digit category, but other than that...

So, if we change the expression in iswalpha_l to something like

  return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
      || cat == CAT_Lm || cat == CAT_Lo
      || cat == CAT_Nl // Letter_Number
      /* Also all digits not allowed to be called digits per ISO C 99 */
      || (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9'));
      ;

we're good?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 17:46                         ` Corinna Vinschen
@ 2023-07-31 18:20                           ` Corinna Vinschen
  2023-07-31 18:43                             ` Bruno Haible
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 18:20 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

On Jul 31 19:46, Corinna Vinschen via Cygwin wrote:
> On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > > I have a problem with the c32isalpha function.
> > > 
> > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > > because it expects the character to be an alphabetic character.
> > 
> > This is not a big problem. You can see in the test-c32isalpha.c file
> > that this test is disabled for many platforms, in particular glibc.
> 
> Which is interesting, because I actually tried that today on glibc, and
> for iswalpha (0xff11) it returns 1.  So it actually behaves as the
> testcase expects.
> 
> > There's no problem with disabling it on Cygwin as well.
> 
> I'd rather make Cygwin do the same as glibc.

Hmm, there are more of those expressions which are disabled on glibc and
fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
the better idea to disable them on Cygwin, too, rather than to change
a working system...


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 18:20                           ` Corinna Vinschen
@ 2023-07-31 18:43                             ` Bruno Haible
  2023-07-31 21:12                               ` Corinna Vinschen
  2023-07-31 21:13                               ` Brian Inglis
  0 siblings, 2 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 18:43 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> there are more of those expressions which are disabled on glibc and
> fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
> the better idea to disable them on Cygwin, too, rather than to change
> a working system...

Sure. There is no standard how to map the Unicode properties to POSIX
character classes. Other than the mentioned ISO C constraints for
'digit' and 'xdigit' and a few POSIX constraints, you are free to
map them as you like. For glibc and gnulib, I mapped them in a way
that seemed to make most sense for applications. But different
people might come to different meanings of "make sense".

Bruno

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 18:43                             ` Bruno Haible
@ 2023-07-31 21:12                               ` Corinna Vinschen
  2023-08-01 16:29                                 ` Brian Inglis
  2023-07-31 21:13                               ` Brian Inglis
  1 sibling, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 21:12 UTC (permalink / raw)
  To: Bruno Haible; +Cc: cygwin

Hi Bruno,

On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > there are more of those expressions which are disabled on glibc and
> > fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
> > the better idea to disable them on Cygwin, too, rather than to change
> > a working system...
> 
> Sure. There is no standard how to map the Unicode properties to POSIX
> character classes. Other than the mentioned ISO C constraints for
> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> map them as you like. For glibc and gnulib, I mapped them in a way
> that seemed to make most sense for applications. But different
> people might come to different meanings of "make sense".

Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
support actually work.

Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
implemented in Cygwin and a uchar.h header exists now, too.

Assuming all gnulib tests disabled for GLibc in

  test-c32isalpha.c
  test-c32iscntrl.c
  test-c32isprint.c
  test-c32isgraph.c
  test-c32ispunct.c
  test-c32islower.c

will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
work as desired now.


Thanks for your input and help!
Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 18:43                             ` Bruno Haible
  2023-07-31 21:12                               ` Corinna Vinschen
@ 2023-07-31 21:13                               ` Brian Inglis
  2023-07-31 21:37                                 ` Bruno Haible
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-31 21:13 UTC (permalink / raw)
  To: cygwin; +Cc: Bruno Haible

On 2023-07-31 12:43, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
>> there are more of those expressions which are disabled on glibc and
>> fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
>> the better idea to disable them on Cygwin, too, rather than to change
>> a working system...
> 
> Sure. There is no standard how to map the Unicode properties to POSIX
> character classes. Other than the mentioned ISO C constraints for
> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> map them as you like. For glibc and gnulib, I mapped them in a way
> that seemed to make most sense for applications. But different
> people might come to different meanings of "make sense".

It seems to me that most application developers needing to support 
non-Western-European languages might want a non-POSIX interpretation of digits.

Are the Unicode character attribute classes supported for those application use 
cases that need more than POSIX limitations allow?

I know that I sometimes want to see some alternative numeric digit forms and 
expect to be able to find those with an appropriate grep expression.

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 21:13                               ` Brian Inglis
@ 2023-07-31 21:37                                 ` Bruno Haible
  0 siblings, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 21:37 UTC (permalink / raw)
  To: cygwin, Brian Inglis

Brian Inglis wrote:
> It seems to me that most application developers needing to support 
> non-Western-European languages might want a non-POSIX interpretation of digits.

Sure. GNU libunistring has dedicated API for this:
  - https://www.gnu.org/software/libunistring/manual/html_node/Object-oriented-API.html
    UC_DECIMAL_DIGIT_NUMBER.
  - https://www.gnu.org/software/libunistring/manual/html_node/Decimal-digit-value.html
  - https://www.gnu.org/software/libunistring/manual/html_node/Digit-value.html
  - https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-objects.html
    UC_PROPERTY_DECIMAL_DIGIT
  - https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-functions.html
    uc_is_property_decimal_digit

I'm sure ICU4C has similar APIs too.

> Are the Unicode character attribute classes supported for those application use 
> cases that need more than POSIX limitations allow?

POSIX allows the libc to define additional character classes. But these will be
platform and locale dependent, and I don't know of any application which makes
use of such additional character classes via wctype() and iswctype().

> I know that I sometimes want to see some alternative numeric digit forms and 
> expect to be able to find those with an appropriate grep expression.

I think you can do so with GNU 'grep', when it was built with PCRE support.
PCRE includes support for Unicode character classes.
<https://www.pcre.org/current/doc/html/pcre2pattern.html>

Bruno




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-07-31 21:12                               ` Corinna Vinschen
@ 2023-08-01 16:29                                 ` Brian Inglis
  2023-08-02  7:56                                   ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-08-01 16:29 UTC (permalink / raw)
  To: cygwin; +Cc: Bruno Haible

On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> Hi Bruno,
> 
> On Jul 31 20:43, Bruno Haible via Cygwin wrote:
>> Corinna Vinschen wrote:
>>> there are more of those expressions which are disabled on glibc and
>>> fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
>>> the better idea to disable them on Cygwin, too, rather than to change
>>> a working system...
>>
>> Sure. There is no standard how to map the Unicode properties to POSIX
>> character classes. Other than the mentioned ISO C constraints for
>> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
>> map them as you like. For glibc and gnulib, I mapped them in a way
>> that seemed to make most sense for applications. But different
>> people might come to different meanings of "make sense".
> 
> Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> support actually work.
> 
> Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> implemented in Cygwin and a uchar.h header exists now, too.
> 
> Assuming all gnulib tests disabled for GLibc in
> 
>    test-c32isalpha.c
>    test-c32iscntrl.c
>    test-c32isprint.c
>    test-c32isgraph.c
>    test-c32ispunct.c
>    test-c32islower.c
> 
> will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> work as desired now.

	https://www.iso.org/standard/86539.html		[ISO/IEC/IEEE 9945 CD]

Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX 
Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.

	https://www.iso.org/standard/82075.html		[ISO/IEC 9899 DIS]

Draft Standard C 2023 is being voted on as of 2023-07-14, and if no technical 
issues arise requiring tweaks, will become the new standard, in which Unicode 
utilities <uchar.h> has some additions which you may wish to add; from:

	https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=426

also:

	https://en.cppreference.com/w/c/string/multibyte

	https://en.cppreference.com/w/c/language/arithmetic_types

major additions (note November official standard publication date):

"7.30 Unicode utilities <uchar.h>
1 The header <uchar.h> declares one macro, a few types, and several functions 
for manipulating Unicode characters.
2 The macro
			__STDC_VERSION_UCHAR_H__
is an integer constant expression with a value equivalent to 202311L.

3 The types declared are mbstate_t (described in 7.31.1) and size_t (described 
in 7.21);
			char8_t
which is an unsigned integer type used for 8-bit characters and is the same type 
as unsigned char;
...

7.30.1 Restartable multibyte/wide character conversion functions
...
2 When used in the functions in this subclause, the encoding of char8_t,
char16_t, and char32_t objects, and sequences of such objects, is UTF-8, UTF-16, 
and UTF-32, respectively. Similarly, the encoding of char and wchar_t, and 
sequences of such objects, is the execution and wide execution encodings 
(6.2.9), respectively

7.30.1.1 The mbrtoc8 function
Synopsis
1	#include <uchar.h>
	size_t mbrtoc8(char8_t * restrict pc8, const char * restrict s,
			size_t n, mbstate_t * restrict ps);
Description
2 If s is a null pointer, the mbrtoc8 function is equivalent to the call:
			mbrtoc8(NULL, "", 1, ps)
In this case, the values of the parameters pc8 and n are ignored.
3 If s is not a null pointer, the mbrtoc8 function function inspects at most n 
bytes beginning with the byte pointed to by s to determine the number of bytes 
needed to complete the next multibyte character (including any shift sequences). 
If the function determines that the next multibyte character is complete and 
valid, it determines the values of the corresponding characters and then, if pc8 
is not a null pointer, stores the value of the first (or only) such character in 
the object pointed to by pc8.
Subsequent calls will store successive characters without consuming any 
additional input until all the characters have been stored. If the corresponding 
character is the null character, the resulting state described is the initial 
conversion state.
Returns
4 The mbrtoc8 function returns the first of the following that applies (given 
the current conversion state):
0	if the next n or fewer bytes complete the multibyte character that corresponds 
to the null character (which is the value stored).
between 1 and n inclusive	if the next n or fewer bytes complete a valid 
multibyte character (which is the value stored); the value returned is the 
number of bytes that complete the multibyte character.
(size_t)(-3)	if the next character resulting from a previous call has been 
stored (no bytes from the input have been consumed by this call).
(size_t)(-2)	if the next n bytes contribute to an incomplete (but potentially 
valid) multibyte character, and all n bytes have been processed (no value is 
stored).398)
(size_t)(-1)	if an encoding error occurs, in which case the next n or fewer 
bytes do not contribute to a complete and valid multibyte character (no value is 
stored); the value of the macro EILSEQ is stored in errno, and the conversion 
state is unspecified.

398)When n has at least the value of the MB_CUR_MAX macro, this case can only 
occur if s points at a sequence of redundant
shift sequences (for implementations with state-dependent encodings).

7.30.1.2 The c8rtomb function
Synopsis
1	#include <uchar.h>
	size_t c8rtomb(char * restrict s, char8_t c8, mbstate_t * restrict ps);
Description
2 If s is a null pointer, the c8rtomb function is equivalent to the call
			c8rtomb(buf, u8’\0’, ps)
where buf is an internal buffer.
3 If s is not a null pointer, the c8rtomb function determines the number of 
bytes needed to represent the multibyte character that corresponds to the 
character given or completed by c8 (including any shift sequences), and stores 
the multibyte character representation in the array whose first element is 
pointed to by s, or stores nothing if c8 does not represent a complete 
character. At most MB_CUR_MAX bytes are stored. If c8 is a null character, a 
null byte is stored, preceded by any shift sequence needed to restore the 
initial shift state; the resulting state described is the initial conversion state.
Returns
4 The c8rtomb function returns the number of bytes stored in the array object 
(including any shift sequences). When c8 is not a valid character, an encoding 
error occurs: the function stores the value of the macro EILSEQ in errno and 
returns (size_t)(-1); the conversion state is unspecified.
..."

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-08-01 16:29                                 ` Brian Inglis
@ 2023-08-02  7:56                                   ` Corinna Vinschen
  2023-08-02 15:06                                     ` Corinna Vinschen
  0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-08-02  7:56 UTC (permalink / raw)
  To: cygwin; +Cc: Brian Inglis, Bruno Haible

On Aug  1 10:29, Brian Inglis via Cygwin wrote:
> On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> > Hi Bruno,
> > 
> > On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> > > Corinna Vinschen wrote:
> > > > there are more of those expressions which are disabled on glibc and
> > > > fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
> > > > the better idea to disable them on Cygwin, too, rather than to change
> > > > a working system...
> > > 
> > > Sure. There is no standard how to map the Unicode properties to POSIX
> > > character classes. Other than the mentioned ISO C constraints for
> > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> > > map them as you like. For glibc and gnulib, I mapped them in a way
> > > that seemed to make most sense for applications. But different
> > > people might come to different meanings of "make sense".
> > 
> > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> > support actually work.
> > 
> > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> > implemented in Cygwin and a uchar.h header exists now, too.
> > 
> > Assuming all gnulib tests disabled for GLibc in
> > 
> >    test-c32isalpha.c
> >    test-c32iscntrl.c
> >    test-c32isprint.c
> >    test-c32isgraph.c
> >    test-c32ispunct.c
> >    test-c32islower.c
> > 
> > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> > work as desired now.
> 
> 	https://www.iso.org/standard/86539.html		[ISO/IEC/IEEE 9945 CD]
> 
> Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX
> Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.
> 
> 	https://www.iso.org/standard/82075.html		[ISO/IEC 9899 DIS]
> 
> Draft Standard C 2023 is being voted on as of 2023-07-14, and if no
> technical issues arise requiring tweaks, will become the new standard, in
> which Unicode utilities <uchar.h> has some additions which you may wish to
> add; from:

Maybe at one point, but nobody keeps you from sending patches :)


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: character class "alpha"
  2023-08-02  7:56                                   ` Corinna Vinschen
@ 2023-08-02 15:06                                     ` Corinna Vinschen
  0 siblings, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-08-02 15:06 UTC (permalink / raw)
  To: Corinna Vinschen via Cygwin; +Cc: Brian Inglis, Bruno Haible

On Aug  2 09:56, Corinna Vinschen via Cygwin wrote:
> On Aug  1 10:29, Brian Inglis via Cygwin wrote:
> > On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> > > Hi Bruno,
> > > 
> > > On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> > > > Corinna Vinschen wrote:
> > > > > there are more of those expressions which are disabled on glibc and
> > > > > fail on Cygwin, for instance in test-c32iscntrl.c.  Maybe it's actually
> > > > > the better idea to disable them on Cygwin, too, rather than to change
> > > > > a working system...
> > > > 
> > > > Sure. There is no standard how to map the Unicode properties to POSIX
> > > > character classes. Other than the mentioned ISO C constraints for
> > > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> > > > map them as you like. For glibc and gnulib, I mapped them in a way
> > > > that seemed to make most sense for applications. But different
> > > > people might come to different meanings of "make sense".
> > > 
> > > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> > > support actually work.
> > > 
> > > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> > > implemented in Cygwin and a uchar.h header exists now, too.
> > > 
> > > Assuming all gnulib tests disabled for GLibc in
> > > 
> > >    test-c32isalpha.c
> > >    test-c32iscntrl.c
> > >    test-c32isprint.c
> > >    test-c32isgraph.c
> > >    test-c32ispunct.c
> > >    test-c32islower.c
> > > 
> > > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> > > work as desired now.
> > 
> > 	https://www.iso.org/standard/86539.html		[ISO/IEC/IEEE 9945 CD]
> > 
> > Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX
> > Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.
> > 
> > 	https://www.iso.org/standard/82075.html		[ISO/IEC 9899 DIS]
> > 
> > Draft Standard C 2023 is being voted on as of 2023-07-14, and if no
> > technical issues arise requiring tweaks, will become the new standard, in
> > which Unicode utilities <uchar.h> has some additions which you may wish to
> > add; from:
> 
> Maybe at one point, but nobody keeps you from sending patches :)

Never mind, had a bit of time.

I fixed the uchar.h header and implemented c8rtomb und mbrtoc8.
Still needs testing.  Does anybody know of an easily accessible
testsuite testing these functions?

However, I did not define __STDC_VERSION_UCHAR_H__ yet.  I wasn't sure
my uchar.h is compliant, and Glibc doesn't define that macro yet,
either.


Corinna

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-08-02 15:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-27 10:15 fnmatch improvements Bruno Haible
2023-07-27 18:24 ` Corinna Vinschen
2023-07-27 19:05   ` Corinna Vinschen
2023-07-27 20:25     ` Brian Inglis
2023-07-27 21:22       ` Bruno Haible
2023-07-27 22:17         ` Brian Inglis
2023-07-28  9:00           ` Corinna Vinschen
2023-07-28  9:53             ` Corinna Vinschen
2023-07-27 21:40     ` Bruno Haible
2023-07-28  8:53       ` Corinna Vinschen
2023-07-28 10:56         ` Bruno Haible
2023-07-28 11:14           ` Corinna Vinschen
2023-07-28 18:59           ` Corinna Vinschen
2023-07-28 19:33             ` Bruno Haible
2023-07-28 19:54             ` GB18030 locale Bruno Haible
2023-07-29  9:23               ` Corinna Vinschen
2023-07-29  9:53                 ` Bruno Haible
2023-07-31 10:07                   ` Corinna Vinschen
2023-07-31 13:38                     ` Corinna Vinschen
2023-07-31 14:06                       ` character class "alpha" Bruno Haible
2023-07-31 17:46                         ` Corinna Vinschen
2023-07-31 18:20                           ` Corinna Vinschen
2023-07-31 18:43                             ` Bruno Haible
2023-07-31 21:12                               ` Corinna Vinschen
2023-08-01 16:29                                 ` Brian Inglis
2023-08-02  7:56                                   ` Corinna Vinschen
2023-08-02 15:06                                     ` Corinna Vinschen
2023-07-31 21:13                               ` Brian Inglis
2023-07-31 21:37                                 ` Bruno Haible
2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
2023-07-28 11:22           ` Bruno Haible
2023-07-28 21:42           ` Bill Stewart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).