* fnmatch improvements
@ 2023-07-27 10:15 Bruno Haible
2023-07-27 18:24 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 10:15 UTC (permalink / raw)
To: cygwin
Hi,
Gnulib has, for the first time, an fnmatch() implementation that supports
characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin
with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g.
fnmatch ("x?y", "x\360\237\230\213y", 0)
now returns 0.
This implementation also implements GNU extensions, as documented in
https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html
Now, I see that in the Cygwin master branch the fnmatch implementation has
been improved, supposedly handling non-BMP characters and character classes
as well.
Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch()
now still gets overridden by the gnulib one and, if no, whether it passes the
gnulib test suite.
I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
help, here's how to:
1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
March 2023 or newer).
2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
3. tar xfz testdir-fnmatch.tar.gz
4. cd testdir-fnmatch-posix
./configure 2>&1 | tee log1
make
make check
grep fnmatch log1
grep REPLACE_FNMATCH config.status
cd ..
5. cd testdir-fnmatch-gnu
./configure 2>&1 | tee log1
make
make check
grep fnmatch log1
grep REPLACE_FNMATCH config.status
cd ..
and provide the build and grep results.
Thanks!
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 10:15 fnmatch improvements Bruno Haible
@ 2023-07-27 18:24 ` Corinna Vinschen
2023-07-27 19:05 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-27 18:24 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
Hi Bruno,
On Jul 27 12:15, Bruno Haible via Cygwin wrote:
> Hi,
>
> Gnulib has, for the first time, an fnmatch() implementation that supports
> characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin
> with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g.
> fnmatch ("x?y", "x\360\237\230\213y", 0)
> now returns 0.
>
> This implementation also implements GNU extensions, as documented in
> https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html
>
> Now, I see that in the Cygwin master branch the fnmatch implementation has
> been improved, supposedly handling non-BMP characters and character classes
> as well.
The major changes are using 32 bit unicode values internally and
implementing collating symbols and equivalence class expressions.
> Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch()
> now still gets overridden by the gnulib one and, if no, whether it passes the
> gnulib test suite.
I'm looking into that. First thing, your testsuite uncovered a bug in
the latest fnmatch in the C locale. Comparing pointers instead of
comparing characters was never a good idea for pattern matching...
When I'm done I hope that our 3.5 fnmatch won't be overridden by the
gnulib version :}
> I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
> help, here's how to:
> 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
> March 2023 or newer).
> 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
> 3. tar xfz testdir-fnmatch.tar.gz
> 4. cd testdir-fnmatch-posix
> ./configure 2>&1 | tee log1
> make
> make check
> grep fnmatch log1
> grep REPLACE_FNMATCH config.status
> cd ..
> 5. cd testdir-fnmatch-gnu
> ./configure 2>&1 | tee log1
> make
> make check
> grep fnmatch log1
> grep REPLACE_FNMATCH config.status
> cd ..
> and provide the build and grep results.
>
> Thanks!
>
> Bruno
No worries, thanks for the testcases, I think I have some result
tomorrow.
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 18:24 ` Corinna Vinschen
@ 2023-07-27 19:05 ` Corinna Vinschen
2023-07-27 20:25 ` Brian Inglis
2023-07-27 21:40 ` Bruno Haible
0 siblings, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-27 19:05 UTC (permalink / raw)
To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible
On Jul 27 20:24, Corinna Vinschen via Cygwin wrote:
> On Jul 27 12:15, Bruno Haible via Cygwin wrote:
> I'm looking into that. First thing, your testsuite uncovered a bug in
> the latest fnmatch in the C locale. Comparing pointers instead of
> comparing characters was never a good idea for pattern matching...
>
> When I'm done I hope that our 3.5 fnmatch won't be overridden by the
> gnulib version :}
>
> > I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
> > help, here's how to:
> > 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
> > March 2023 or newer).
> > 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
> > 3. tar xfz testdir-fnmatch.tar.gz
> > 4. cd testdir-fnmatch-posix
> > ./configure 2>&1 | tee log1
> > make
> > make check
I fixed the above problem and the POSIX check now works fine:
> > grep fnmatch log1
checking for fnmatch.h... yes
checking for fnmatch... yes
checking for working POSIX fnmatch... yes
I also extraced the fnmatch configure testcase and ran it manually.
It returns 0 now. But:
> > grep REPLACE_FNMATCH config.status
S["REPLACE_FNMATCH"]="1"
Looks like the reason is that we don't have a uchar.h file? Seems
like this is of interest for AIX, but why should this be of
interest for fnmatch on other systems?
Thanks,
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 19:05 ` Corinna Vinschen
@ 2023-07-27 20:25 ` Brian Inglis
2023-07-27 21:22 ` Bruno Haible
2023-07-27 21:40 ` Bruno Haible
1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-27 20:25 UTC (permalink / raw)
To: Corinna Vinschen via Cygwin, Bruno Haible
On 2023-07-27 13:05, Corinna Vinschen via Cygwin wrote:
> On Jul 27 20:24, Corinna Vinschen via Cygwin wrote:
>> On Jul 27 12:15, Bruno Haible via Cygwin wrote:
>> I'm looking into that. First thing, your testsuite uncovered a bug in
>> the latest fnmatch in the C locale. Comparing pointers instead of
>> comparing characters was never a good idea for pattern matching...
>>
>> When I'm done I hope that our 3.5 fnmatch won't be overridden by the
>> gnulib version :}
>>
>>> I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to
>>> help, here's how to:
>>> 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from
>>> March 2023 or newer).
>>> 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz
>>> 3. tar xfz testdir-fnmatch.tar.gz
>>> 4. cd testdir-fnmatch-posix
>>> ./configure 2>&1 | tee log1
>>> make
>>> make check
>
> I fixed the above problem and the POSIX check now works fine:
>
>>> grep fnmatch log1
>
> checking for fnmatch.h... yes
> checking for fnmatch... yes
> checking for working POSIX fnmatch... yes
>
> I also extraced the fnmatch configure testcase and ran it manually.
> It returns 0 now. But:
>
>>> grep REPLACE_FNMATCH config.status
>
> S["REPLACE_FNMATCH"]="1"
>
> Looks like the reason is that we don't have a uchar.h file? Seems
> like this is of interest for AIX, but why should this be of
> interest for fnmatch on other systems?
It was added in C99 TR19769, integrated in C/++11, available in libicu-devel:
https://cplusplus.com/reference/cuchar/
https://open-std.org/jtc1/sc22/open/n3579.pdf
https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
$ find /usr/include/ -name uchar.h
/usr/include/unicode/uchar.h
$ cygcheck -f /usr/include/unicode/uchar.h
libicu-devel-72.1-1
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 20:25 ` Brian Inglis
@ 2023-07-27 21:22 ` Bruno Haible
2023-07-27 22:17 ` Brian Inglis
0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 21:22 UTC (permalink / raw)
To: Brian.Inglis; +Cc: cygwin
Brian Inglis wrote:
> It was added in C99 TR19769, integrated in C/++11
Yes.
> available in libicu-devel:
>
> https://cplusplus.com/reference/cuchar/
>
> https://open-std.org/jtc1/sc22/open/n3579.pdf
>
> https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
>
> https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
>
> $ find /usr/include/ -name uchar.h
> /usr/include/unicode/uchar.h
>
> $ cygcheck -f /usr/include/unicode/uchar.h
> libicu-devel-72.1-1
This file, <unicode/uchar.h> from ICU4C, is something completely different than
ISO C's <uchar.h>.
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 19:05 ` Corinna Vinschen
2023-07-27 20:25 ` Brian Inglis
@ 2023-07-27 21:40 ` Bruno Haible
2023-07-28 8:53 ` Corinna Vinschen
1 sibling, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-27 21:40 UTC (permalink / raw)
To: Corinna Vinschen, Bruno Haible
Corinna Vinschen wrote:
> > > 4. cd testdir-fnmatch-posix
> > > ./configure 2>&1 | tee log1
> > > make
> > > make check
>
> I fixed the above problem and the POSIX check now works fine:
Glad that the test suite was helpful (and that you fixed it before 3.5.0 —
so, no additional configure tests needed on the gnulib side).
> > > grep fnmatch log1
>
> checking for fnmatch.h... yes
> checking for fnmatch... yes
> checking for working POSIX fnmatch... yes
>
> I also extraced the fnmatch configure testcase and ran it manually.
> It returns 0 now. But:
>
> > > grep REPLACE_FNMATCH config.status
>
> S["REPLACE_FNMATCH"]="1"
>
> Looks like the reason is that we don't have a uchar.h file? Seems
> like this is of interest for AIX, but why should this be of
> interest for fnmatch on other systems?
Ah, that's because I made the assumption that if wchar_t is only 16-bits
wide, fnmatch() can't be correct. Which is true for AIX (and on this
platform, I prefer not to test the available locales). But not true
with your implementation any more.
What are the test suite results if you do
- Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
in config.status,
- make clean
- ./config.status
- make
- make check
Then the tests will be run against Cygwin's fnmatch() function.
If all tests pass, I will add the following patch to gnulib.
diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4
index 2e1442eff7..e99737a476 100644
--- a/m4/fnmatch.m4
+++ b/m4/fnmatch.m4
@@ -1,4 +1,4 @@
-# Check for fnmatch - serial 18 -*- coding: utf-8 -*-
+# Check for fnmatch - serial 19 -*- coding: utf-8 -*-
# Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc.
# This file is free software; the Free Software Foundation
@@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX])
AC_REQUIRE([gl_FNMATCH_H])
- AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles
+ AC_REQUIRE([AC_CANONICAL_HOST])
gl_fnmatch_required_lowercase=`
echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]'
`
@@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
dnl This is due to wchar_t being only 16 bits wide.
AC_REQUIRE([gl_UCHAR_H])
if test $SMALL_WCHAR_T = 1; then
- REPLACE_FNMATCH=1
+ case "$host_os" in
+ cygwin*)
+ dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no',
+ dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters,
+ dnl despite wchar_t being only 16 bits wide (because internally it
+ dnl works on wint_t values).
+ ;;
+ *)
+ REPLACE_FNMATCH=1
+ ;;
+ esac
fi
fi
if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 21:22 ` Bruno Haible
@ 2023-07-27 22:17 ` Brian Inglis
2023-07-28 9:00 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-27 22:17 UTC (permalink / raw)
To: cygwin; +Cc: Bruno Haible
On 2023-07-27 15:22, Bruno Haible wrote:
> Brian Inglis wrote:
>> It was added in C99 TR19769, integrated in C/++11
>
> Yes.
>
>> available in libicu-devel:
>>
>> https://cplusplus.com/reference/cuchar/
>>
>> https://open-std.org/jtc1/sc22/open/n3579.pdf
>>
>> https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
>>
>> https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
>>
>> $ find /usr/include/ -name uchar.h
>> /usr/include/unicode/uchar.h
>>
>> $ cygcheck -f /usr/include/unicode/uchar.h
>> libicu-devel-72.1-1
>
> This file, <unicode/uchar.h> from ICU4C, is something completely different than
> ISO C's <uchar.h>.
This would then be a *newlib* AT sourceware DOT org addition so we could use
FreeBSD's:
https://cgit.freebsd.org/src/blame/include/uchar.h?id=9f9d157d82e2332b74d9c45b596748e3e4691f2d
plus consideration of:
gnulib:
https://git.savannah.gnu.org/gitweb/?p=gnulib.git&a=search&h=HEAD&st=commit&s=uchar.h
and C2023 CD2:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf
there are only symbol formatting changes in N3148 comments and N3149 is a zip
with a password protected PDF so likely FDIS!
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 21:40 ` Bruno Haible
@ 2023-07-28 8:53 ` Corinna Vinschen
2023-07-28 10:56 ` Bruno Haible
2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen
0 siblings, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 8:53 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 27 23:40, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > > > 4. cd testdir-fnmatch-posix
> > > > ./configure 2>&1 | tee log1
> > > > make
> > > > make check
> >
> > I fixed the above problem and the POSIX check now works fine:
>
> Glad that the test suite was helpful (and that you fixed it before 3.5.0 —
> so, no additional configure tests needed on the gnulib side).
>
> > > > grep fnmatch log1
> >
> > checking for fnmatch.h... yes
> > checking for fnmatch... yes
> > checking for working POSIX fnmatch... yes
> >
> > I also extraced the fnmatch configure testcase and ran it manually.
> > It returns 0 now. But:
> >
> > > > grep REPLACE_FNMATCH config.status
> >
> > S["REPLACE_FNMATCH"]="1"
> >
> > Looks like the reason is that we don't have a uchar.h file? Seems
> > like this is of interest for AIX, but why should this be of
> > interest for fnmatch on other systems?
>
> Ah, that's because I made the assumption that if wchar_t is only 16-bits
> wide, fnmatch() can't be correct. Which is true for AIX (and on this
> platform, I prefer not to test the available locales). But not true
> with your implementation any more.
>
> What are the test suite results if you do
>
> - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
> in config.status,
> - make clean
> - ./config.status
> - make
The build fails here. The reason is that the GNU extension FNM_EXTMATCH
is not supported by the FreeBSD code base of fnmatch, so it's not
defined in our fnmatch.h system header. Gnulib still tries to build
fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on
using the system header?
> - make check
>
> Then the tests will be run against Cygwin's fnmatch() function.
> If all tests pass, I will add the following patch to gnulib.
After the above fail, I tried from scratch with your below patch,
and I still get
$ grep REPLACE_FNMATCH ./config.status
S["REPLACE_FNMATCH"]="1"
Even though
$ grep fnmatch log1
checking for fnmatch.h... yes
checking for fnmatch... yes
checking for working POSIX fnmatch... yes
I'm quite puzzled.
Corinna
>
> diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4
> index 2e1442eff7..e99737a476 100644
> --- a/m4/fnmatch.m4
> +++ b/m4/fnmatch.m4
> @@ -1,4 +1,4 @@
> -# Check for fnmatch - serial 18 -*- coding: utf-8 -*-
> +# Check for fnmatch - serial 19 -*- coding: utf-8 -*-
>
> # Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc.
> # This file is free software; the Free Software Foundation
> @@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
> m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX])
>
> AC_REQUIRE([gl_FNMATCH_H])
> - AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles
> + AC_REQUIRE([AC_CANONICAL_HOST])
> gl_fnmatch_required_lowercase=`
> echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]'
> `
> @@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX]
> dnl This is due to wchar_t being only 16 bits wide.
> AC_REQUIRE([gl_UCHAR_H])
> if test $SMALL_WCHAR_T = 1; then
> - REPLACE_FNMATCH=1
> + case "$host_os" in
> + cygwin*)
> + dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no',
> + dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters,
> + dnl despite wchar_t being only 16 bits wide (because internally it
> + dnl works on wint_t values).
> + ;;
> + *)
> + REPLACE_FNMATCH=1
> + ;;
> + esac
> fi
> fi
> if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then
>
>
>
>
> --
> Problem reports: https://cygwin.com/problems.html
> FAQ: https://cygwin.com/faq/
> Documentation: https://cygwin.com/docs.html
> Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-27 22:17 ` Brian Inglis
@ 2023-07-28 9:00 ` Corinna Vinschen
2023-07-28 9:53 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 9:00 UTC (permalink / raw)
To: cygwin
On Jul 27 16:17, Brian Inglis via Cygwin wrote:
> On 2023-07-27 15:22, Bruno Haible wrote:
> > Brian Inglis wrote:
> > > It was added in C99 TR19769, integrated in C/++11
> >
> > Yes.
> >
> > > available in libicu-devel:
> > >
> > > https://cplusplus.com/reference/cuchar/
> > >
> > > https://open-std.org/jtc1/sc22/open/n3579.pdf
> > >
> > > https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf
> > >
> > > https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416
> > >
> > > $ find /usr/include/ -name uchar.h
> > > /usr/include/unicode/uchar.h
> > >
> > > $ cygcheck -f /usr/include/unicode/uchar.h
> > > libicu-devel-72.1-1
> >
> > This file, <unicode/uchar.h> from ICU4C, is something completely different than
> > ISO C's <uchar.h>.
>
> This would then be a *newlib* AT sourceware DOT org addition so we could use
> FreeBSD's:
We can use FreeBSDs version as role model, but we can't use the code
verbatim, given FreeBSD assumes sizeof(wchar_t) == 4.
Since that's a Cygwin-only issue (2 byte wchar_t, that is), I guess we
should merge the code into the Cygwin code base, rather than newlib.
For mbrtoc32/c32rtomb, we can use the wirtomb/mbrtowi function I
introduced for the globbing code. If we do that, I think the functions
should actually be renamed accordingly and the globbing code should use
uchar32_t rather than wint_t.
Also, it might be helpful to add the mbrtoc8/c8rtomb extensions at one
point, which are missing in FreeBSD.
Either way, I'd be grateful for patches in this area.
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 9:00 ` Corinna Vinschen
@ 2023-07-28 9:53 ` Corinna Vinschen
0 siblings, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 9:53 UTC (permalink / raw)
To: cygwin
On Jul 28 11:00, Corinna Vinschen via Cygwin wrote:
> If we do that, I think the functions
> should actually be renamed accordingly and the globbing code should use
> uchar32_t rather than wint_t.
s/uchar32_t/char32_t/
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 8:53 ` Corinna Vinschen
@ 2023-07-28 10:56 ` Bruno Haible
2023-07-28 11:14 ` Corinna Vinschen
2023-07-28 18:59 ` Corinna Vinschen
2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen
1 sibling, 2 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 10:56 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> After the above fail, I tried from scratch with your below patch,
> and I still get
>
> $ grep REPLACE_FNMATCH ./config.status
> S["REPLACE_FNMATCH"]="1"
>
> Even though
>
> $ grep fnmatch log1
> checking for fnmatch.h... yes
> checking for fnmatch... yes
> checking for working POSIX fnmatch... yes
>
> I'm quite puzzled.
It's sometimes hard to make incremental changes to generated files of the
GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .
The expected result is:
1. cd testdir-fnmatch-posix
./configure
grep REPLACE_FNMATCH config.status
(Expected: REPLACE_FNMATCH is 0)
make
make check
(Expected: No test failures)
cd ..
2. cd testdir-fnmatch-gnu
./configure
grep REPLACE_FNMATCH config.status
(Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)
make
make check
(Expected: No test failures)
cd ..
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 8:53 ` Corinna Vinschen
2023-07-28 10:56 ` Bruno Haible
@ 2023-07-28 11:12 ` Corinna Vinschen
2023-07-28 11:22 ` Bruno Haible
2023-07-28 21:42 ` Bill Stewart
1 sibling, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 11:12 UTC (permalink / raw)
To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible
On Jul 28 10:53, Corinna Vinschen via Cygwin wrote:
> On Jul 27 23:40, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > > S["REPLACE_FNMATCH"]="1"
> > >
> > > Looks like the reason is that we don't have a uchar.h file? Seems
> > > like this is of interest for AIX, but why should this be of
> > > interest for fnmatch on other systems?
> >
> > Ah, that's because I made the assumption that if wchar_t is only 16-bits
> > wide, fnmatch() can't be correct. Which is true for AIX (and on this
> > platform, I prefer not to test the available locales). But not true
> > with your implementation any more.
> >
> > What are the test suite results if you do
> >
> > - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0"
> > in config.status,
> > - make clean
> > - ./config.status
> > - make
>
> The build fails here. The reason is that the GNU extension FNM_EXTMATCH
> is not supported by the FreeBSD code base of fnmatch, so it's not
> defined in our fnmatch.h system header. Gnulib still tries to build
> fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on
> using the system header?
> [...]
> After the above fail, I tried from scratch with your below patch,
> and I still get
>
> $ grep REPLACE_FNMATCH ./config.status
> S["REPLACE_FNMATCH"]="1"
>
> Even though
>
> $ grep fnmatch log1
> checking for fnmatch.h... yes
> checking for fnmatch... yes
> checking for working POSIX fnmatch... yes
>
> I'm quite puzzled.
I'm puzzled because I'm an idiot. I forgot autoreconf. After that and
another configure run, REPLACE_FNMATCH is correctly set to 0 *and* the
build runs fine.
I'll do the rest of the test later today.
Sorry,
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 10:56 ` Bruno Haible
@ 2023-07-28 11:14 ` Corinna Vinschen
2023-07-28 18:59 ` Corinna Vinschen
1 sibling, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 11:14 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 28 12:56, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > After the above fail, I tried from scratch with your below patch,
> > and I still get
> >
> > $ grep REPLACE_FNMATCH ./config.status
> > S["REPLACE_FNMATCH"]="1"
> >
> > Even though
> >
> > $ grep fnmatch log1
> > checking for fnmatch.h... yes
> > checking for fnmatch... yes
> > checking for working POSIX fnmatch... yes
> >
> > I'm quite puzzled.
>
> It's sometimes hard to make incremental changes to generated files of the
> GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
> at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .
Thanks, I'll use that for testing later today.
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen
@ 2023-07-28 11:22 ` Bruno Haible
2023-07-28 21:42 ` Bill Stewart
1 sibling, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 11:22 UTC (permalink / raw)
To: Corinna Vinschen, Bruno Haible
Corinna Vinschen wrote:
> I'm puzzled because I'm an idiot. I forgot autoreconf.
Things like that happen to me as well. There are so many
generation phases (collect *.m4 files; autoconf; configure;
make) that it's easy to forget one when making incremental changes.
It's more reliable to regenerate the testdir from scratch.
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 10:56 ` Bruno Haible
2023-07-28 11:14 ` Corinna Vinschen
@ 2023-07-28 18:59 ` Corinna Vinschen
2023-07-28 19:33 ` Bruno Haible
2023-07-28 19:54 ` GB18030 locale Bruno Haible
1 sibling, 2 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-28 18:59 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 28 12:56, Bruno Haible via Cygwin wrote:
> It's sometimes hard to make incremental changes to generated files of the
> GNU Build System plus Gnulib. I've therefore recreated a new tarball for you,
> at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz .
>
> The expected result is:
> 1. cd testdir-fnmatch-posix
> ./configure
> grep REPLACE_FNMATCH config.status
> (Expected: REPLACE_FNMATCH is 0)
$ grep REPLACE_FNMATCH config.status
S["REPLACE_FNMATCH"]="0"
> make
> make check
> (Expected: No test failures)
# TOTAL: 218
# PASS: 178
# SKIP: 40
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
> cd ..
> 2. cd testdir-fnmatch-gnu
> ./configure
> grep REPLACE_FNMATCH config.status
> (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)
$ grep REPLACE_FNMATCH config.status
S["REPLACE_FNMATCH"]="1"
> make
> make check
> (Expected: No test failures)
# TOTAL: 218
# PASS: 178
# SKIP: 40
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
Same SKIP of test-fnmatch-5.sh.
Does that look ok?
Thanks,
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 18:59 ` Corinna Vinschen
@ 2023-07-28 19:33 ` Bruno Haible
2023-07-28 19:54 ` GB18030 locale Bruno Haible
1 sibling, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 19:33 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> > 1. cd testdir-fnmatch-posix
> > ./configure
> > grep REPLACE_FNMATCH config.status
> > (Expected: REPLACE_FNMATCH is 0)
>
> $ grep REPLACE_FNMATCH config.status
> S["REPLACE_FNMATCH"]="0"
>
> > make
> > make check
> > (Expected: No test failures)
>
> # TOTAL: 218
> # PASS: 178
> # SKIP: 40
> # XFAIL: 0
> # FAIL: 0
> # XPASS: 0
> # ERROR: 0
>
> test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
>
> > cd ..
> > 2. cd testdir-fnmatch-gnu
> > ./configure
> > grep REPLACE_FNMATCH config.status
> > (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH)
>
> $ grep REPLACE_FNMATCH config.status
> S["REPLACE_FNMATCH"]="1"
>
> > make
> > make check
> > (Expected: No test failures)
>
> # TOTAL: 218
> # PASS: 178
> # SKIP: 40
> # XFAIL: 0
> # FAIL: 0
> # XPASS: 0
> # ERROR: 0
>
> Same SKIP of test-fnmatch-5.sh.
>
> Does that look ok?
Yes, that's all OK and as expected. I'll commit the fnmatch.m4 patch today.
When the user asks for an fnmatch() with FNM_EXTMATCH support, they will get
the Gnulib fnmatch(), as it supports these GNU extensions. I'll think about
how to make [=X=] and [.X.] work in this case too...
Thank you for your constructive cooperation!
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale
2023-07-28 18:59 ` Corinna Vinschen
2023-07-28 19:33 ` Bruno Haible
@ 2023-07-28 19:54 ` Bruno Haible
2023-07-29 9:23 ` Corinna Vinschen
1 sibling, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-28 19:54 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
Hmm? When I read winsup/cygwin/release/3.5.0 and the commit
5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030
locale (which on native Windows is called "Chinese_China.54936")
should be supported.
The Gnulib code which determines whether this locale is supported
is in m4/locale-zh.m4. Why does the
"checking for a transitional chinese locale..." test fail on
your system, when you call it as
LC_ALL=zh_CN.GB18030 LC_TIME= LC_CTYPE= ./conftest
?
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements
2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen
2023-07-28 11:22 ` Bruno Haible
@ 2023-07-28 21:42 ` Bill Stewart
1 sibling, 0 replies; 32+ messages in thread
From: Bill Stewart @ 2023-07-28 21:42 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 150 bytes --]
On Fri, Jul 28, 2023 at 5:12 AM Corinna Vinschen wrote:
I'm puzzled because I'm an idiot.
>
That's one thing you certainly are not.
Bill
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale
2023-07-28 19:54 ` GB18030 locale Bruno Haible
@ 2023-07-29 9:23 ` Corinna Vinschen
2023-07-29 9:53 ` Bruno Haible
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-29 9:23 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 28 21:54, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030.
>
> Hmm? When I read winsup/cygwin/release/3.5.0 and the commit
> 5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030
> locale (which on native Windows is called "Chinese_China.54936")
> should be supported.
You're right, I always had the idea to add GB18030 support and forgot
that I supposedly did that in 5da71b605995 ("Cygwin: add support for
GB18030 codeset"), sorry.
However, on debugging this, I see it's totally broken. Trying to fix
this in the existing functions is futile. We need dedicated
support functions for GB18030, kind of like the FreeBSD functions,
just with extra support for surrogate pairs, as with our UTF8 stuff.
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale
2023-07-29 9:23 ` Corinna Vinschen
@ 2023-07-29 9:53 ` Bruno Haible
2023-07-31 10:07 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-29 9:53 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> However, on debugging this, I see it's totally broken. Trying to fix
> this in the existing functions is futile. We need dedicated
> support functions for GB18030, kind of like the FreeBSD functions,
> just with extra support for surrogate pairs, as with our UTF8 stuff.
In case it helps: Find here a test suite for the various multibyte
functions with GB18030 specific test cases. (Extracted from gnulib.)
https://haible.de/bruno/gnu/testdir-gb18030.tar.gz
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale
2023-07-29 9:53 ` Bruno Haible
@ 2023-07-31 10:07 ` Corinna Vinschen
2023-07-31 13:38 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 10:07 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 29 11:53, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > However, on debugging this, I see it's totally broken. Trying to fix
> > this in the existing functions is futile. We need dedicated
> > support functions for GB18030, kind of like the FreeBSD functions,
> > just with extra support for surrogate pairs, as with our UTF8 stuff.
>
> In case it helps: Find here a test suite for the various multibyte
> functions with GB18030 specific test cases. (Extracted from gnulib.)
> https://haible.de/bruno/gnu/testdir-gb18030.tar.gz
Thank you, I'm already hacking and testing :)
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale
2023-07-31 10:07 ` Corinna Vinschen
@ 2023-07-31 13:38 ` Corinna Vinschen
2023-07-31 14:06 ` character class "alpha" Bruno Haible
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 13:38 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
Hi Bruno,
On Jul 31 12:07, Corinna Vinschen via Cygwin wrote:
> On Jul 29 11:53, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > > However, on debugging this, I see it's totally broken. Trying to fix
> > > this in the existing functions is futile. We need dedicated
> > > support functions for GB18030, kind of like the FreeBSD functions,
> > > just with extra support for surrogate pairs, as with our UTF8 stuff.
> >
> > In case it helps: Find here a test suite for the various multibyte
> > functions with GB18030 specific test cases. (Extracted from gnulib.)
> > https://haible.de/bruno/gnu/testdir-gb18030.tar.gz
>
> Thank you, I'm already hacking and testing :)
I have a problem with the c32isalpha function.
c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
because it expects the character to be an alphabetic character.
The Cygwin unicode information is automatically generated from the
Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha
in newlib is checking for the Unicode categories, using the expression:
return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
|| cat == CAT_Lm || cat == CAT_Lo
|| cat == CAT_Nl // Letter_Number
;
with CAT_foo being equivalent to Unicode category foo.
Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
alphabetic character.
I see that Glibc returns 1 from c32isalpha for U+FF11, but I don't see
where it takes that info and why this is correct. Can you point me to
some info on this?
Thanks,
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 13:38 ` Corinna Vinschen
@ 2023-07-31 14:06 ` Bruno Haible
2023-07-31 17:46 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 14:06 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> I have a problem with the c32isalpha function.
>
> c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> because it expects the character to be an alphabetic character.
This is not a big problem. You can see in the test-c32isalpha.c file
that this test is disabled for many platforms, in particular glibc.
There's no problem with disabling it on Cygwin as well.
> The Cygwin unicode information is automatically generated from the
> Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha
> in newlib is checking for the Unicode categories, using the expression:
>
> return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
> || cat == CAT_Lm || cat == CAT_Lo
> || cat == CAT_Nl // Letter_Number
> ;
>
> with CAT_foo being equivalent to Unicode category foo.
>
> Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> alphabetic character.
This is not wrong. However, see the comments in the generator of the
gnulib tables:
https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789
/* Consider all the non-ASCII digits as alphabetic.
ISO C 99 forbids us to have them in category "digit",
but we want iswalnum to return true on them. */
Likewise in the generator of the glibc tables:
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274
The original comment (from 2000) was:
/* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
takes it away:
7.25.2.1.5:
The iswdigit function tests for any wide character that corresponds
to a decimal-digit character (as defined in 5.2.1).
5.2.1:
the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
*/
return (ch >= 0x0030 && ch <= 0x0039);
The question is: In which category do you put these non-ASCII digits?
"print" and "graph", sure. But other than that? "punct" or "alnum"?
"punct" seems wrong. If you, like me, decide to put them in "alnum",
then you they need to be in "alpha" or "digit" (per POSIX
https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 14:06 ` character class "alpha" Bruno Haible
@ 2023-07-31 17:46 ` Corinna Vinschen
2023-07-31 18:20 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 17:46 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > I have a problem with the c32isalpha function.
> >
> > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > because it expects the character to be an alphabetic character.
>
> This is not a big problem. You can see in the test-c32isalpha.c file
> that this test is disabled for many platforms, in particular glibc.
Which is interesting, because I actually tried that today on glibc, and
for iswalpha (0xff11) it returns 1. So it actually behaves as the
testcase expects.
> There's no problem with disabling it on Cygwin as well.
I'd rather make Cygwin do the same as glibc.
> > The Cygwin unicode information is automatically generated from the
> > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha
> > in newlib is checking for the Unicode categories, using the expression:
> >
> > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
> > || cat == CAT_Lm || cat == CAT_Lo
> > || cat == CAT_Nl // Letter_Number
> > ;
> >
> > with CAT_foo being equivalent to Unicode category foo.
> >
> > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> > alphabetic character.
>
> This is not wrong. However, see the comments in the generator of the
> gnulib tables:
>
> https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789
>
> /* Consider all the non-ASCII digits as alphabetic.
> ISO C 99 forbids us to have them in category "digit",
> but we want iswalnum to return true on them. */
>
> Likewise in the generator of the glibc tables:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274
>
> The original comment (from 2000) was:
>
> /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
> takes it away:
> 7.25.2.1.5:
> The iswdigit function tests for any wide character that corresponds
> to a decimal-digit character (as defined in 5.2.1).
> 5.2.1:
> the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
> */
> return (ch >= 0x0030 && ch <= 0x0039);
>
> The question is: In which category do you put these non-ASCII digits?
> "print" and "graph", sure. But other than that? "punct" or "alnum"?
> "punct" seems wrong. If you, like me, decide to put them in "alnum",
> then you they need to be in "alpha" or "digit" (per POSIX
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
> But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".
Thanks for the description. It was clear to me that they don't belong
into the ISO C digit category, but other than that...
So, if we change the expression in iswalpha_l to something like
return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
|| cat == CAT_Lm || cat == CAT_Lo
|| cat == CAT_Nl // Letter_Number
/* Also all digits not allowed to be called digits per ISO C 99 */
|| (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9'));
;
we're good?
Thanks,
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 17:46 ` Corinna Vinschen
@ 2023-07-31 18:20 ` Corinna Vinschen
2023-07-31 18:43 ` Bruno Haible
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 18:20 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
On Jul 31 19:46, Corinna Vinschen via Cygwin wrote:
> On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> > Corinna Vinschen wrote:
> > > I have a problem with the c32isalpha function.
> > >
> > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > > because it expects the character to be an alphabetic character.
> >
> > This is not a big problem. You can see in the test-c32isalpha.c file
> > that this test is disabled for many platforms, in particular glibc.
>
> Which is interesting, because I actually tried that today on glibc, and
> for iswalpha (0xff11) it returns 1. So it actually behaves as the
> testcase expects.
>
> > There's no problem with disabling it on Cygwin as well.
>
> I'd rather make Cygwin do the same as glibc.
Hmm, there are more of those expressions which are disabled on glibc and
fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
the better idea to disable them on Cygwin, too, rather than to change
a working system...
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 18:20 ` Corinna Vinschen
@ 2023-07-31 18:43 ` Bruno Haible
2023-07-31 21:12 ` Corinna Vinschen
2023-07-31 21:13 ` Brian Inglis
0 siblings, 2 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 18:43 UTC (permalink / raw)
To: cygwin
Corinna Vinschen wrote:
> there are more of those expressions which are disabled on glibc and
> fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
> the better idea to disable them on Cygwin, too, rather than to change
> a working system...
Sure. There is no standard how to map the Unicode properties to POSIX
character classes. Other than the mentioned ISO C constraints for
'digit' and 'xdigit' and a few POSIX constraints, you are free to
map them as you like. For glibc and gnulib, I mapped them in a way
that seemed to make most sense for applications. But different
people might come to different meanings of "make sense".
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 18:43 ` Bruno Haible
@ 2023-07-31 21:12 ` Corinna Vinschen
2023-08-01 16:29 ` Brian Inglis
2023-07-31 21:13 ` Brian Inglis
1 sibling, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-07-31 21:12 UTC (permalink / raw)
To: Bruno Haible; +Cc: cygwin
Hi Bruno,
On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > there are more of those expressions which are disabled on glibc and
> > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
> > the better idea to disable them on Cygwin, too, rather than to change
> > a working system...
>
> Sure. There is no standard how to map the Unicode properties to POSIX
> character classes. Other than the mentioned ISO C constraints for
> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> map them as you like. For glibc and gnulib, I mapped them in a way
> that seemed to make most sense for applications. But different
> people might come to different meanings of "make sense".
Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
support actually work.
Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
implemented in Cygwin and a uchar.h header exists now, too.
Assuming all gnulib tests disabled for GLibc in
test-c32isalpha.c
test-c32iscntrl.c
test-c32isprint.c
test-c32isgraph.c
test-c32ispunct.c
test-c32islower.c
will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
work as desired now.
Thanks for your input and help!
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 18:43 ` Bruno Haible
2023-07-31 21:12 ` Corinna Vinschen
@ 2023-07-31 21:13 ` Brian Inglis
2023-07-31 21:37 ` Bruno Haible
1 sibling, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-07-31 21:13 UTC (permalink / raw)
To: cygwin; +Cc: Bruno Haible
On 2023-07-31 12:43, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
>> there are more of those expressions which are disabled on glibc and
>> fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
>> the better idea to disable them on Cygwin, too, rather than to change
>> a working system...
>
> Sure. There is no standard how to map the Unicode properties to POSIX
> character classes. Other than the mentioned ISO C constraints for
> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> map them as you like. For glibc and gnulib, I mapped them in a way
> that seemed to make most sense for applications. But different
> people might come to different meanings of "make sense".
It seems to me that most application developers needing to support
non-Western-European languages might want a non-POSIX interpretation of digits.
Are the Unicode character attribute classes supported for those application use
cases that need more than POSIX limitations allow?
I know that I sometimes want to see some alternative numeric digit forms and
expect to be able to find those with an appropriate grep expression.
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 21:13 ` Brian Inglis
@ 2023-07-31 21:37 ` Bruno Haible
0 siblings, 0 replies; 32+ messages in thread
From: Bruno Haible @ 2023-07-31 21:37 UTC (permalink / raw)
To: cygwin, Brian Inglis
Brian Inglis wrote:
> It seems to me that most application developers needing to support
> non-Western-European languages might want a non-POSIX interpretation of digits.
Sure. GNU libunistring has dedicated API for this:
- https://www.gnu.org/software/libunistring/manual/html_node/Object-oriented-API.html
UC_DECIMAL_DIGIT_NUMBER.
- https://www.gnu.org/software/libunistring/manual/html_node/Decimal-digit-value.html
- https://www.gnu.org/software/libunistring/manual/html_node/Digit-value.html
- https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-objects.html
UC_PROPERTY_DECIMAL_DIGIT
- https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-functions.html
uc_is_property_decimal_digit
I'm sure ICU4C has similar APIs too.
> Are the Unicode character attribute classes supported for those application use
> cases that need more than POSIX limitations allow?
POSIX allows the libc to define additional character classes. But these will be
platform and locale dependent, and I don't know of any application which makes
use of such additional character classes via wctype() and iswctype().
> I know that I sometimes want to see some alternative numeric digit forms and
> expect to be able to find those with an appropriate grep expression.
I think you can do so with GNU 'grep', when it was built with PCRE support.
PCRE includes support for Unicode character classes.
<https://www.pcre.org/current/doc/html/pcre2pattern.html>
Bruno
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-07-31 21:12 ` Corinna Vinschen
@ 2023-08-01 16:29 ` Brian Inglis
2023-08-02 7:56 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Brian Inglis @ 2023-08-01 16:29 UTC (permalink / raw)
To: cygwin; +Cc: Bruno Haible
On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> Hi Bruno,
>
> On Jul 31 20:43, Bruno Haible via Cygwin wrote:
>> Corinna Vinschen wrote:
>>> there are more of those expressions which are disabled on glibc and
>>> fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
>>> the better idea to disable them on Cygwin, too, rather than to change
>>> a working system...
>>
>> Sure. There is no standard how to map the Unicode properties to POSIX
>> character classes. Other than the mentioned ISO C constraints for
>> 'digit' and 'xdigit' and a few POSIX constraints, you are free to
>> map them as you like. For glibc and gnulib, I mapped them in a way
>> that seemed to make most sense for applications. But different
>> people might come to different meanings of "make sense".
>
> Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> support actually work.
>
> Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> implemented in Cygwin and a uchar.h header exists now, too.
>
> Assuming all gnulib tests disabled for GLibc in
>
> test-c32isalpha.c
> test-c32iscntrl.c
> test-c32isprint.c
> test-c32isgraph.c
> test-c32ispunct.c
> test-c32islower.c
>
> will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> work as desired now.
https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD]
Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX
Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.
https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS]
Draft Standard C 2023 is being voted on as of 2023-07-14, and if no technical
issues arise requiring tweaks, will become the new standard, in which Unicode
utilities <uchar.h> has some additions which you may wish to add; from:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=426
also:
https://en.cppreference.com/w/c/string/multibyte
https://en.cppreference.com/w/c/language/arithmetic_types
major additions (note November official standard publication date):
"7.30 Unicode utilities <uchar.h>
1 The header <uchar.h> declares one macro, a few types, and several functions
for manipulating Unicode characters.
2 The macro
__STDC_VERSION_UCHAR_H__
is an integer constant expression with a value equivalent to 202311L.
3 The types declared are mbstate_t (described in 7.31.1) and size_t (described
in 7.21);
char8_t
which is an unsigned integer type used for 8-bit characters and is the same type
as unsigned char;
...
7.30.1 Restartable multibyte/wide character conversion functions
...
2 When used in the functions in this subclause, the encoding of char8_t,
char16_t, and char32_t objects, and sequences of such objects, is UTF-8, UTF-16,
and UTF-32, respectively. Similarly, the encoding of char and wchar_t, and
sequences of such objects, is the execution and wide execution encodings
(6.2.9), respectively
7.30.1.1 The mbrtoc8 function
Synopsis
1 #include <uchar.h>
size_t mbrtoc8(char8_t * restrict pc8, const char * restrict s,
size_t n, mbstate_t * restrict ps);
Description
2 If s is a null pointer, the mbrtoc8 function is equivalent to the call:
mbrtoc8(NULL, "", 1, ps)
In this case, the values of the parameters pc8 and n are ignored.
3 If s is not a null pointer, the mbrtoc8 function function inspects at most n
bytes beginning with the byte pointed to by s to determine the number of bytes
needed to complete the next multibyte character (including any shift sequences).
If the function determines that the next multibyte character is complete and
valid, it determines the values of the corresponding characters and then, if pc8
is not a null pointer, stores the value of the first (or only) such character in
the object pointed to by pc8.
Subsequent calls will store successive characters without consuming any
additional input until all the characters have been stored. If the corresponding
character is the null character, the resulting state described is the initial
conversion state.
Returns
4 The mbrtoc8 function returns the first of the following that applies (given
the current conversion state):
0 if the next n or fewer bytes complete the multibyte character that corresponds
to the null character (which is the value stored).
between 1 and n inclusive if the next n or fewer bytes complete a valid
multibyte character (which is the value stored); the value returned is the
number of bytes that complete the multibyte character.
(size_t)(-3) if the next character resulting from a previous call has been
stored (no bytes from the input have been consumed by this call).
(size_t)(-2) if the next n bytes contribute to an incomplete (but potentially
valid) multibyte character, and all n bytes have been processed (no value is
stored).398)
(size_t)(-1) if an encoding error occurs, in which case the next n or fewer
bytes do not contribute to a complete and valid multibyte character (no value is
stored); the value of the macro EILSEQ is stored in errno, and the conversion
state is unspecified.
398)When n has at least the value of the MB_CUR_MAX macro, this case can only
occur if s points at a sequence of redundant
shift sequences (for implementations with state-dependent encodings).
7.30.1.2 The c8rtomb function
Synopsis
1 #include <uchar.h>
size_t c8rtomb(char * restrict s, char8_t c8, mbstate_t * restrict ps);
Description
2 If s is a null pointer, the c8rtomb function is equivalent to the call
c8rtomb(buf, u8’\0’, ps)
where buf is an internal buffer.
3 If s is not a null pointer, the c8rtomb function determines the number of
bytes needed to represent the multibyte character that corresponds to the
character given or completed by c8 (including any shift sequences), and stores
the multibyte character representation in the array whose first element is
pointed to by s, or stores nothing if c8 does not represent a complete
character. At most MB_CUR_MAX bytes are stored. If c8 is a null character, a
null byte is stored, preceded by any shift sequence needed to restore the
initial shift state; the resulting state described is the initial conversion state.
Returns
4 The c8rtomb function returns the number of bytes stored in the array object
(including any shift sequences). When c8 is not a valid character, an encoding
error occurs: the function stores the value of the macro EILSEQ in errno and
returns (size_t)(-1); the conversion state is unspecified.
..."
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-08-01 16:29 ` Brian Inglis
@ 2023-08-02 7:56 ` Corinna Vinschen
2023-08-02 15:06 ` Corinna Vinschen
0 siblings, 1 reply; 32+ messages in thread
From: Corinna Vinschen @ 2023-08-02 7:56 UTC (permalink / raw)
To: cygwin; +Cc: Brian Inglis, Bruno Haible
On Aug 1 10:29, Brian Inglis via Cygwin wrote:
> On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> > Hi Bruno,
> >
> > On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> > > Corinna Vinschen wrote:
> > > > there are more of those expressions which are disabled on glibc and
> > > > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
> > > > the better idea to disable them on Cygwin, too, rather than to change
> > > > a working system...
> > >
> > > Sure. There is no standard how to map the Unicode properties to POSIX
> > > character classes. Other than the mentioned ISO C constraints for
> > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> > > map them as you like. For glibc and gnulib, I mapped them in a way
> > > that seemed to make most sense for applications. But different
> > > people might come to different meanings of "make sense".
> >
> > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> > support actually work.
> >
> > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> > implemented in Cygwin and a uchar.h header exists now, too.
> >
> > Assuming all gnulib tests disabled for GLibc in
> >
> > test-c32isalpha.c
> > test-c32iscntrl.c
> > test-c32isprint.c
> > test-c32isgraph.c
> > test-c32ispunct.c
> > test-c32islower.c
> >
> > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> > work as desired now.
>
> https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD]
>
> Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX
> Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.
>
> https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS]
>
> Draft Standard C 2023 is being voted on as of 2023-07-14, and if no
> technical issues arise requiring tweaks, will become the new standard, in
> which Unicode utilities <uchar.h> has some additions which you may wish to
> add; from:
Maybe at one point, but nobody keeps you from sending patches :)
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha"
2023-08-02 7:56 ` Corinna Vinschen
@ 2023-08-02 15:06 ` Corinna Vinschen
0 siblings, 0 replies; 32+ messages in thread
From: Corinna Vinschen @ 2023-08-02 15:06 UTC (permalink / raw)
To: Corinna Vinschen via Cygwin; +Cc: Brian Inglis, Bruno Haible
On Aug 2 09:56, Corinna Vinschen via Cygwin wrote:
> On Aug 1 10:29, Brian Inglis via Cygwin wrote:
> > On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote:
> > > Hi Bruno,
> > >
> > > On Jul 31 20:43, Bruno Haible via Cygwin wrote:
> > > > Corinna Vinschen wrote:
> > > > > there are more of those expressions which are disabled on glibc and
> > > > > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually
> > > > > the better idea to disable them on Cygwin, too, rather than to change
> > > > > a working system...
> > > >
> > > > Sure. There is no standard how to map the Unicode properties to POSIX
> > > > character classes. Other than the mentioned ISO C constraints for
> > > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to
> > > > map them as you like. For glibc and gnulib, I mapped them in a way
> > > > that seemed to make most sense for applications. But different
> > > > people might come to different meanings of "make sense".
> > >
> > > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030
> > > support actually work.
> > >
> > > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now
> > > implemented in Cygwin and a uchar.h header exists now, too.
> > >
> > > Assuming all gnulib tests disabled for GLibc in
> > >
> > > test-c32isalpha.c
> > > test-c32iscntrl.c
> > > test-c32isprint.c
> > > test-c32isgraph.c
> > > test-c32ispunct.c
> > > test-c32islower.c
> > >
> > > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib
> > > work as desired now.
> >
> > https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD]
> >
> > Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX
> > Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities.
> >
> > https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS]
> >
> > Draft Standard C 2023 is being voted on as of 2023-07-14, and if no
> > technical issues arise requiring tweaks, will become the new standard, in
> > which Unicode utilities <uchar.h> has some additions which you may wish to
> > add; from:
>
> Maybe at one point, but nobody keeps you from sending patches :)
Never mind, had a bit of time.
I fixed the uchar.h header and implemented c8rtomb und mbrtoc8.
Still needs testing. Does anybody know of an easily accessible
testsuite testing these functions?
However, I did not define __STDC_VERSION_UCHAR_H__ yet. I wasn't sure
my uchar.h is compliant, and Glibc doesn't define that macro yet,
either.
Corinna
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2023-08-02 15:06 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-27 10:15 fnmatch improvements Bruno Haible
2023-07-27 18:24 ` Corinna Vinschen
2023-07-27 19:05 ` Corinna Vinschen
2023-07-27 20:25 ` Brian Inglis
2023-07-27 21:22 ` Bruno Haible
2023-07-27 22:17 ` Brian Inglis
2023-07-28 9:00 ` Corinna Vinschen
2023-07-28 9:53 ` Corinna Vinschen
2023-07-27 21:40 ` Bruno Haible
2023-07-28 8:53 ` Corinna Vinschen
2023-07-28 10:56 ` Bruno Haible
2023-07-28 11:14 ` Corinna Vinschen
2023-07-28 18:59 ` Corinna Vinschen
2023-07-28 19:33 ` Bruno Haible
2023-07-28 19:54 ` GB18030 locale Bruno Haible
2023-07-29 9:23 ` Corinna Vinschen
2023-07-29 9:53 ` Bruno Haible
2023-07-31 10:07 ` Corinna Vinschen
2023-07-31 13:38 ` Corinna Vinschen
2023-07-31 14:06 ` character class "alpha" Bruno Haible
2023-07-31 17:46 ` Corinna Vinschen
2023-07-31 18:20 ` Corinna Vinschen
2023-07-31 18:43 ` Bruno Haible
2023-07-31 21:12 ` Corinna Vinschen
2023-08-01 16:29 ` Brian Inglis
2023-08-02 7:56 ` Corinna Vinschen
2023-08-02 15:06 ` Corinna Vinschen
2023-07-31 21:13 ` Brian Inglis
2023-07-31 21:37 ` Bruno Haible
2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen
2023-07-28 11:22 ` Bruno Haible
2023-07-28 21:42 ` Bill Stewart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).