* fnmatch improvements @ 2023-07-27 10:15 Bruno Haible 2023-07-27 18:24 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-27 10:15 UTC (permalink / raw) To: cygwin Hi, Gnulib has, for the first time, an fnmatch() implementation that supports characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g. fnmatch ("x?y", "x\360\237\230\213y", 0) now returns 0. This implementation also implements GNU extensions, as documented in https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html Now, I see that in the Cygwin master branch the fnmatch implementation has been improved, supposedly handling non-BMP characters and character classes as well. Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch() now still gets overridden by the gnulib one and, if no, whether it passes the gnulib test suite. I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to help, here's how to: 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from March 2023 or newer). 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz 3. tar xfz testdir-fnmatch.tar.gz 4. cd testdir-fnmatch-posix ./configure 2>&1 | tee log1 make make check grep fnmatch log1 grep REPLACE_FNMATCH config.status cd .. 5. cd testdir-fnmatch-gnu ./configure 2>&1 | tee log1 make make check grep fnmatch log1 grep REPLACE_FNMATCH config.status cd .. and provide the build and grep results. Thanks! Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 10:15 fnmatch improvements Bruno Haible @ 2023-07-27 18:24 ` Corinna Vinschen 2023-07-27 19:05 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-27 18:24 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin Hi Bruno, On Jul 27 12:15, Bruno Haible via Cygwin wrote: > Hi, > > Gnulib has, for the first time, an fnmatch() implementation that supports > characters outside the Unicode Basic Multilingual Plane (BMP), even on Cygwin > with its 16-bits wchar_t type. That is, in an UTF-8 locale, e.g. > fnmatch ("x?y", "x\360\237\230\213y", 0) > now returns 0. > > This implementation also implements GNU extensions, as documented in > https://www.gnu.org/software/libc/manual/html_node/Wildcard-Matching.html > > Now, I see that in the Cygwin master branch the fnmatch implementation has > been improved, supposedly handling non-BMP characters and character classes > as well. The major changes are using 32 bit unicode values internally and implementing collating symbols and equivalence class expressions. > Therefore I would find it interesting to know whether the Cygwin 3.5.0 fnmatch() > now still gets overridden by the gnulib one and, if no, whether it passes the > gnulib test suite. I'm looking into that. First thing, your testsuite uncovered a bug in the latest fnmatch in the C locale. Comparing pointers instead of comparing characters was never a good idea for pattern matching... When I'm done I hope that our 3.5 fnmatch won't be overridden by the gnulib version :} > I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to > help, here's how to: > 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from > March 2023 or newer). > 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz > 3. tar xfz testdir-fnmatch.tar.gz > 4. cd testdir-fnmatch-posix > ./configure 2>&1 | tee log1 > make > make check > grep fnmatch log1 > grep REPLACE_FNMATCH config.status > cd .. > 5. cd testdir-fnmatch-gnu > ./configure 2>&1 | tee log1 > make > make check > grep fnmatch log1 > grep REPLACE_FNMATCH config.status > cd .. > and provide the build and grep results. > > Thanks! > > Bruno No worries, thanks for the testcases, I think I have some result tomorrow. Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 18:24 ` Corinna Vinschen @ 2023-07-27 19:05 ` Corinna Vinschen 2023-07-27 20:25 ` Brian Inglis 2023-07-27 21:40 ` Bruno Haible 0 siblings, 2 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-27 19:05 UTC (permalink / raw) To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible On Jul 27 20:24, Corinna Vinschen via Cygwin wrote: > On Jul 27 12:15, Bruno Haible via Cygwin wrote: > I'm looking into that. First thing, your testsuite uncovered a bug in > the latest fnmatch in the C locale. Comparing pointers instead of > comparing characters was never a good idea for pattern matching... > > When I'm done I hope that our 3.5 fnmatch won't be overridden by the > gnulib version :} > > > I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to > > help, here's how to: > > 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from > > March 2023 or newer). > > 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz > > 3. tar xfz testdir-fnmatch.tar.gz > > 4. cd testdir-fnmatch-posix > > ./configure 2>&1 | tee log1 > > make > > make check I fixed the above problem and the POSIX check now works fine: > > grep fnmatch log1 checking for fnmatch.h... yes checking for fnmatch... yes checking for working POSIX fnmatch... yes I also extraced the fnmatch configure testcase and ran it manually. It returns 0 now. But: > > grep REPLACE_FNMATCH config.status S["REPLACE_FNMATCH"]="1" Looks like the reason is that we don't have a uchar.h file? Seems like this is of interest for AIX, but why should this be of interest for fnmatch on other systems? Thanks, Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 19:05 ` Corinna Vinschen @ 2023-07-27 20:25 ` Brian Inglis 2023-07-27 21:22 ` Bruno Haible 2023-07-27 21:40 ` Bruno Haible 1 sibling, 1 reply; 32+ messages in thread From: Brian Inglis @ 2023-07-27 20:25 UTC (permalink / raw) To: Corinna Vinschen via Cygwin, Bruno Haible On 2023-07-27 13:05, Corinna Vinschen via Cygwin wrote: > On Jul 27 20:24, Corinna Vinschen via Cygwin wrote: >> On Jul 27 12:15, Bruno Haible via Cygwin wrote: >> I'm looking into that. First thing, your testsuite uncovered a bug in >> the latest fnmatch in the C locale. Comparing pointers instead of >> comparing characters was never a good idea for pattern matching... >> >> When I'm done I hope that our 3.5 fnmatch won't be overridden by the >> gnulib version :} >> >>> I can't easily install a Cygwin 3.5.0 snapshot. If one of you would like to >>> help, here's how to: >>> 1. Create an environment for working with a Cygwin 3.5.0 snapshot (from >>> March 2023 or newer). >>> 2. wget https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz >>> 3. tar xfz testdir-fnmatch.tar.gz >>> 4. cd testdir-fnmatch-posix >>> ./configure 2>&1 | tee log1 >>> make >>> make check > > I fixed the above problem and the POSIX check now works fine: > >>> grep fnmatch log1 > > checking for fnmatch.h... yes > checking for fnmatch... yes > checking for working POSIX fnmatch... yes > > I also extraced the fnmatch configure testcase and ran it manually. > It returns 0 now. But: > >>> grep REPLACE_FNMATCH config.status > > S["REPLACE_FNMATCH"]="1" > > Looks like the reason is that we don't have a uchar.h file? Seems > like this is of interest for AIX, but why should this be of > interest for fnmatch on other systems? It was added in C99 TR19769, integrated in C/++11, available in libicu-devel: https://cplusplus.com/reference/cuchar/ https://open-std.org/jtc1/sc22/open/n3579.pdf https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416 $ find /usr/include/ -name uchar.h /usr/include/unicode/uchar.h $ cygcheck -f /usr/include/unicode/uchar.h libicu-devel-72.1-1 -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 20:25 ` Brian Inglis @ 2023-07-27 21:22 ` Bruno Haible 2023-07-27 22:17 ` Brian Inglis 0 siblings, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-27 21:22 UTC (permalink / raw) To: Brian.Inglis; +Cc: cygwin Brian Inglis wrote: > It was added in C99 TR19769, integrated in C/++11 Yes. > available in libicu-devel: > > https://cplusplus.com/reference/cuchar/ > > https://open-std.org/jtc1/sc22/open/n3579.pdf > > https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf > > https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416 > > $ find /usr/include/ -name uchar.h > /usr/include/unicode/uchar.h > > $ cygcheck -f /usr/include/unicode/uchar.h > libicu-devel-72.1-1 This file, <unicode/uchar.h> from ICU4C, is something completely different than ISO C's <uchar.h>. Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 21:22 ` Bruno Haible @ 2023-07-27 22:17 ` Brian Inglis 2023-07-28 9:00 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Brian Inglis @ 2023-07-27 22:17 UTC (permalink / raw) To: cygwin; +Cc: Bruno Haible On 2023-07-27 15:22, Bruno Haible wrote: > Brian Inglis wrote: >> It was added in C99 TR19769, integrated in C/++11 > > Yes. > >> available in libicu-devel: >> >> https://cplusplus.com/reference/cuchar/ >> >> https://open-std.org/jtc1/sc22/open/n3579.pdf >> >> https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf >> >> https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416 >> >> $ find /usr/include/ -name uchar.h >> /usr/include/unicode/uchar.h >> >> $ cygcheck -f /usr/include/unicode/uchar.h >> libicu-devel-72.1-1 > > This file, <unicode/uchar.h> from ICU4C, is something completely different than > ISO C's <uchar.h>. This would then be a *newlib* AT sourceware DOT org addition so we could use FreeBSD's: https://cgit.freebsd.org/src/blame/include/uchar.h?id=9f9d157d82e2332b74d9c45b596748e3e4691f2d plus consideration of: gnulib: https://git.savannah.gnu.org/gitweb/?p=gnulib.git&a=search&h=HEAD&st=commit&s=uchar.h and C2023 CD2: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf there are only symbol formatting changes in N3148 comments and N3149 is a zip with a password protected PDF so likely FDIS! -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 22:17 ` Brian Inglis @ 2023-07-28 9:00 ` Corinna Vinschen 2023-07-28 9:53 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 9:00 UTC (permalink / raw) To: cygwin On Jul 27 16:17, Brian Inglis via Cygwin wrote: > On 2023-07-27 15:22, Bruno Haible wrote: > > Brian Inglis wrote: > > > It was added in C99 TR19769, integrated in C/++11 > > > > Yes. > > > > > available in libicu-devel: > > > > > > https://cplusplus.com/reference/cuchar/ > > > > > > https://open-std.org/jtc1/sc22/open/n3579.pdf > > > > > > https://open-std.org/jtc1/sc22/wg14/www/docs/n1326.pdf > > > > > > https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf#page=416 > > > > > > $ find /usr/include/ -name uchar.h > > > /usr/include/unicode/uchar.h > > > > > > $ cygcheck -f /usr/include/unicode/uchar.h > > > libicu-devel-72.1-1 > > > > This file, <unicode/uchar.h> from ICU4C, is something completely different than > > ISO C's <uchar.h>. > > This would then be a *newlib* AT sourceware DOT org addition so we could use > FreeBSD's: We can use FreeBSDs version as role model, but we can't use the code verbatim, given FreeBSD assumes sizeof(wchar_t) == 4. Since that's a Cygwin-only issue (2 byte wchar_t, that is), I guess we should merge the code into the Cygwin code base, rather than newlib. For mbrtoc32/c32rtomb, we can use the wirtomb/mbrtowi function I introduced for the globbing code. If we do that, I think the functions should actually be renamed accordingly and the globbing code should use uchar32_t rather than wint_t. Also, it might be helpful to add the mbrtoc8/c8rtomb extensions at one point, which are missing in FreeBSD. Either way, I'd be grateful for patches in this area. Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 9:00 ` Corinna Vinschen @ 2023-07-28 9:53 ` Corinna Vinschen 0 siblings, 0 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 9:53 UTC (permalink / raw) To: cygwin On Jul 28 11:00, Corinna Vinschen via Cygwin wrote: > If we do that, I think the functions > should actually be renamed accordingly and the globbing code should use > uchar32_t rather than wint_t. s/uchar32_t/char32_t/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 19:05 ` Corinna Vinschen 2023-07-27 20:25 ` Brian Inglis @ 2023-07-27 21:40 ` Bruno Haible 2023-07-28 8:53 ` Corinna Vinschen 1 sibling, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-27 21:40 UTC (permalink / raw) To: Corinna Vinschen, Bruno Haible Corinna Vinschen wrote: > > > 4. cd testdir-fnmatch-posix > > > ./configure 2>&1 | tee log1 > > > make > > > make check > > I fixed the above problem and the POSIX check now works fine: Glad that the test suite was helpful (and that you fixed it before 3.5.0 — so, no additional configure tests needed on the gnulib side). > > > grep fnmatch log1 > > checking for fnmatch.h... yes > checking for fnmatch... yes > checking for working POSIX fnmatch... yes > > I also extraced the fnmatch configure testcase and ran it manually. > It returns 0 now. But: > > > > grep REPLACE_FNMATCH config.status > > S["REPLACE_FNMATCH"]="1" > > Looks like the reason is that we don't have a uchar.h file? Seems > like this is of interest for AIX, but why should this be of > interest for fnmatch on other systems? Ah, that's because I made the assumption that if wchar_t is only 16-bits wide, fnmatch() can't be correct. Which is true for AIX (and on this platform, I prefer not to test the available locales). But not true with your implementation any more. What are the test suite results if you do - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0" in config.status, - make clean - ./config.status - make - make check Then the tests will be run against Cygwin's fnmatch() function. If all tests pass, I will add the following patch to gnulib. diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4 index 2e1442eff7..e99737a476 100644 --- a/m4/fnmatch.m4 +++ b/m4/fnmatch.m4 @@ -1,4 +1,4 @@ -# Check for fnmatch - serial 18 -*- coding: utf-8 -*- +# Check for fnmatch - serial 19 -*- coding: utf-8 -*- # Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc. # This file is free software; the Free Software Foundation @@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX] m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX]) AC_REQUIRE([gl_FNMATCH_H]) - AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles + AC_REQUIRE([AC_CANONICAL_HOST]) gl_fnmatch_required_lowercase=` echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]' ` @@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX] dnl This is due to wchar_t being only 16 bits wide. AC_REQUIRE([gl_UCHAR_H]) if test $SMALL_WCHAR_T = 1; then - REPLACE_FNMATCH=1 + case "$host_os" in + cygwin*) + dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no', + dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters, + dnl despite wchar_t being only 16 bits wide (because internally it + dnl works on wint_t values). + ;; + *) + REPLACE_FNMATCH=1 + ;; + esac fi fi if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-27 21:40 ` Bruno Haible @ 2023-07-28 8:53 ` Corinna Vinschen 2023-07-28 10:56 ` Bruno Haible 2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen 0 siblings, 2 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 8:53 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 27 23:40, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > > > 4. cd testdir-fnmatch-posix > > > > ./configure 2>&1 | tee log1 > > > > make > > > > make check > > > > I fixed the above problem and the POSIX check now works fine: > > Glad that the test suite was helpful (and that you fixed it before 3.5.0 — > so, no additional configure tests needed on the gnulib side). > > > > > grep fnmatch log1 > > > > checking for fnmatch.h... yes > > checking for fnmatch... yes > > checking for working POSIX fnmatch... yes > > > > I also extraced the fnmatch configure testcase and ran it manually. > > It returns 0 now. But: > > > > > > grep REPLACE_FNMATCH config.status > > > > S["REPLACE_FNMATCH"]="1" > > > > Looks like the reason is that we don't have a uchar.h file? Seems > > like this is of interest for AIX, but why should this be of > > interest for fnmatch on other systems? > > Ah, that's because I made the assumption that if wchar_t is only 16-bits > wide, fnmatch() can't be correct. Which is true for AIX (and on this > platform, I prefer not to test the available locales). But not true > with your implementation any more. > > What are the test suite results if you do > > - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0" > in config.status, > - make clean > - ./config.status > - make The build fails here. The reason is that the GNU extension FNM_EXTMATCH is not supported by the FreeBSD code base of fnmatch, so it's not defined in our fnmatch.h system header. Gnulib still tries to build fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on using the system header? > - make check > > Then the tests will be run against Cygwin's fnmatch() function. > If all tests pass, I will add the following patch to gnulib. After the above fail, I tried from scratch with your below patch, and I still get $ grep REPLACE_FNMATCH ./config.status S["REPLACE_FNMATCH"]="1" Even though $ grep fnmatch log1 checking for fnmatch.h... yes checking for fnmatch... yes checking for working POSIX fnmatch... yes I'm quite puzzled. Corinna > > diff --git a/m4/fnmatch.m4 b/m4/fnmatch.m4 > index 2e1442eff7..e99737a476 100644 > --- a/m4/fnmatch.m4 > +++ b/m4/fnmatch.m4 > @@ -1,4 +1,4 @@ > -# Check for fnmatch - serial 18 -*- coding: utf-8 -*- > +# Check for fnmatch - serial 19 -*- coding: utf-8 -*- > > # Copyright (C) 2000-2007, 2009-2023 Free Software Foundation, Inc. > # This file is free software; the Free Software Foundation > @@ -14,7 +14,7 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX] > m4_divert_text([DEFAULTS], [gl_fnmatch_required=POSIX]) > > AC_REQUIRE([gl_FNMATCH_H]) > - AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles > + AC_REQUIRE([AC_CANONICAL_HOST]) > gl_fnmatch_required_lowercase=` > echo $gl_fnmatch_required | LC_ALL=C tr '[[A-Z]]' '[[a-z]]' > ` > @@ -164,7 +164,17 @@ AC_DEFUN([gl_FUNC_FNMATCH_POSIX] > dnl This is due to wchar_t being only 16 bits wide. > AC_REQUIRE([gl_UCHAR_H]) > if test $SMALL_WCHAR_T = 1; then > - REPLACE_FNMATCH=1 > + case "$host_os" in > + cygwin*) > + dnl On Cygwin < 3.5.0, the above $gl_fnmatch_result came out as 'no', > + dnl On Cygwin >= 3.5.0, fnmatch supports all Unicode characters, > + dnl despite wchar_t being only 16 bits wide (because internally it > + dnl works on wint_t values). > + ;; > + *) > + REPLACE_FNMATCH=1 > + ;; > + esac > fi > fi > if test $HAVE_FNMATCH = 0 || test $REPLACE_FNMATCH = 1; then > > > > > -- > Problem reports: https://cygwin.com/problems.html > FAQ: https://cygwin.com/faq/ > Documentation: https://cygwin.com/docs.html > Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 8:53 ` Corinna Vinschen @ 2023-07-28 10:56 ` Bruno Haible 2023-07-28 11:14 ` Corinna Vinschen 2023-07-28 18:59 ` Corinna Vinschen 2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen 1 sibling, 2 replies; 32+ messages in thread From: Bruno Haible @ 2023-07-28 10:56 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > After the above fail, I tried from scratch with your below patch, > and I still get > > $ grep REPLACE_FNMATCH ./config.status > S["REPLACE_FNMATCH"]="1" > > Even though > > $ grep fnmatch log1 > checking for fnmatch.h... yes > checking for fnmatch... yes > checking for working POSIX fnmatch... yes > > I'm quite puzzled. It's sometimes hard to make incremental changes to generated files of the GNU Build System plus Gnulib. I've therefore recreated a new tarball for you, at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz . The expected result is: 1. cd testdir-fnmatch-posix ./configure grep REPLACE_FNMATCH config.status (Expected: REPLACE_FNMATCH is 0) make make check (Expected: No test failures) cd .. 2. cd testdir-fnmatch-gnu ./configure grep REPLACE_FNMATCH config.status (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH) make make check (Expected: No test failures) cd .. Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 10:56 ` Bruno Haible @ 2023-07-28 11:14 ` Corinna Vinschen 2023-07-28 18:59 ` Corinna Vinschen 1 sibling, 0 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 11:14 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 28 12:56, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > After the above fail, I tried from scratch with your below patch, > > and I still get > > > > $ grep REPLACE_FNMATCH ./config.status > > S["REPLACE_FNMATCH"]="1" > > > > Even though > > > > $ grep fnmatch log1 > > checking for fnmatch.h... yes > > checking for fnmatch... yes > > checking for working POSIX fnmatch... yes > > > > I'm quite puzzled. > > It's sometimes hard to make incremental changes to generated files of the > GNU Build System plus Gnulib. I've therefore recreated a new tarball for you, > at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz . Thanks, I'll use that for testing later today. Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 10:56 ` Bruno Haible 2023-07-28 11:14 ` Corinna Vinschen @ 2023-07-28 18:59 ` Corinna Vinschen 2023-07-28 19:33 ` Bruno Haible 2023-07-28 19:54 ` GB18030 locale Bruno Haible 1 sibling, 2 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 18:59 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 28 12:56, Bruno Haible via Cygwin wrote: > It's sometimes hard to make incremental changes to generated files of the > GNU Build System plus Gnulib. I've therefore recreated a new tarball for you, > at https://haible.de/bruno/gnu/testdir-fnmatch.tar.gz . > > The expected result is: > 1. cd testdir-fnmatch-posix > ./configure > grep REPLACE_FNMATCH config.status > (Expected: REPLACE_FNMATCH is 0) $ grep REPLACE_FNMATCH config.status S["REPLACE_FNMATCH"]="0" > make > make check > (Expected: No test failures) # TOTAL: 218 # PASS: 178 # SKIP: 40 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030. > cd .. > 2. cd testdir-fnmatch-gnu > ./configure > grep REPLACE_FNMATCH config.status > (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH) $ grep REPLACE_FNMATCH config.status S["REPLACE_FNMATCH"]="1" > make > make check > (Expected: No test failures) # TOTAL: 218 # PASS: 178 # SKIP: 40 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 Same SKIP of test-fnmatch-5.sh. Does that look ok? Thanks, Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 18:59 ` Corinna Vinschen @ 2023-07-28 19:33 ` Bruno Haible 2023-07-28 19:54 ` GB18030 locale Bruno Haible 1 sibling, 0 replies; 32+ messages in thread From: Bruno Haible @ 2023-07-28 19:33 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > > 1. cd testdir-fnmatch-posix > > ./configure > > grep REPLACE_FNMATCH config.status > > (Expected: REPLACE_FNMATCH is 0) > > $ grep REPLACE_FNMATCH config.status > S["REPLACE_FNMATCH"]="0" > > > make > > make check > > (Expected: No test failures) > > # TOTAL: 218 > # PASS: 178 > # SKIP: 40 > # XFAIL: 0 > # FAIL: 0 > # XPASS: 0 > # ERROR: 0 > > test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030. > > > cd .. > > 2. cd testdir-fnmatch-gnu > > ./configure > > grep REPLACE_FNMATCH config.status > > (Expected: REPLACE_FNMATCH is 1, because of FNM_EXTMATCH) > > $ grep REPLACE_FNMATCH config.status > S["REPLACE_FNMATCH"]="1" > > > make > > make check > > (Expected: No test failures) > > # TOTAL: 218 > # PASS: 178 > # SKIP: 40 > # XFAIL: 0 > # FAIL: 0 > # XPASS: 0 > # ERROR: 0 > > Same SKIP of test-fnmatch-5.sh. > > Does that look ok? Yes, that's all OK and as expected. I'll commit the fnmatch.m4 patch today. When the user asks for an fnmatch() with FNM_EXTMATCH support, they will get the Gnulib fnmatch(), as it supports these GNU extensions. I'll think about how to make [=X=] and [.X.] work in this case too... Thank you for your constructive cooperation! Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale 2023-07-28 18:59 ` Corinna Vinschen 2023-07-28 19:33 ` Bruno Haible @ 2023-07-28 19:54 ` Bruno Haible 2023-07-29 9:23 ` Corinna Vinschen 1 sibling, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-28 19:54 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030. Hmm? When I read winsup/cygwin/release/3.5.0 and the commit 5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030 locale (which on native Windows is called "Chinese_China.54936") should be supported. The Gnulib code which determines whether this locale is supported is in m4/locale-zh.m4. Why does the "checking for a transitional chinese locale..." test fail on your system, when you call it as LC_ALL=zh_CN.GB18030 LC_TIME= LC_CTYPE= ./conftest ? Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale 2023-07-28 19:54 ` GB18030 locale Bruno Haible @ 2023-07-29 9:23 ` Corinna Vinschen 2023-07-29 9:53 ` Bruno Haible 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-29 9:23 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 28 21:54, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > test-fnmatch-5.sh is SKIPped because we don't support zh_CN.GB18030. > > Hmm? When I read winsup/cygwin/release/3.5.0 and the commit > 5da71b6059956a8f20a6be02e82867aa28aa3880, it seems the zh_CN.GB18030 > locale (which on native Windows is called "Chinese_China.54936") > should be supported. You're right, I always had the idea to add GB18030 support and forgot that I supposedly did that in 5da71b605995 ("Cygwin: add support for GB18030 codeset"), sorry. However, on debugging this, I see it's totally broken. Trying to fix this in the existing functions is futile. We need dedicated support functions for GB18030, kind of like the FreeBSD functions, just with extra support for surrogate pairs, as with our UTF8 stuff. Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale 2023-07-29 9:23 ` Corinna Vinschen @ 2023-07-29 9:53 ` Bruno Haible 2023-07-31 10:07 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-29 9:53 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > However, on debugging this, I see it's totally broken. Trying to fix > this in the existing functions is futile. We need dedicated > support functions for GB18030, kind of like the FreeBSD functions, > just with extra support for surrogate pairs, as with our UTF8 stuff. In case it helps: Find here a test suite for the various multibyte functions with GB18030 specific test cases. (Extracted from gnulib.) https://haible.de/bruno/gnu/testdir-gb18030.tar.gz Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale 2023-07-29 9:53 ` Bruno Haible @ 2023-07-31 10:07 ` Corinna Vinschen 2023-07-31 13:38 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-31 10:07 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 29 11:53, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > However, on debugging this, I see it's totally broken. Trying to fix > > this in the existing functions is futile. We need dedicated > > support functions for GB18030, kind of like the FreeBSD functions, > > just with extra support for surrogate pairs, as with our UTF8 stuff. > > In case it helps: Find here a test suite for the various multibyte > functions with GB18030 specific test cases. (Extracted from gnulib.) > https://haible.de/bruno/gnu/testdir-gb18030.tar.gz Thank you, I'm already hacking and testing :) Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GB18030 locale 2023-07-31 10:07 ` Corinna Vinschen @ 2023-07-31 13:38 ` Corinna Vinschen 2023-07-31 14:06 ` character class "alpha" Bruno Haible 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-31 13:38 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin Hi Bruno, On Jul 31 12:07, Corinna Vinschen via Cygwin wrote: > On Jul 29 11:53, Bruno Haible via Cygwin wrote: > > Corinna Vinschen wrote: > > > However, on debugging this, I see it's totally broken. Trying to fix > > > this in the existing functions is futile. We need dedicated > > > support functions for GB18030, kind of like the FreeBSD functions, > > > just with extra support for surrogate pairs, as with our UTF8 stuff. > > > > In case it helps: Find here a test suite for the various multibyte > > functions with GB18030 specific test cases. (Extracted from gnulib.) > > https://haible.de/bruno/gnu/testdir-gb18030.tar.gz > > Thank you, I'm already hacking and testing :) I have a problem with the c32isalpha function. c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, because it expects the character to be an alphabetic character. The Cygwin unicode information is automatically generated from the Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha in newlib is checking for the Unicode categories, using the expression: return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt || cat == CAT_Lm || cat == CAT_Lo || cat == CAT_Nl // Letter_Number ; with CAT_foo being equivalent to Unicode category foo. Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an alphabetic character. I see that Glibc returns 1 from c32isalpha for U+FF11, but I don't see where it takes that info and why this is correct. Can you point me to some info on this? Thanks, Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 13:38 ` Corinna Vinschen @ 2023-07-31 14:06 ` Bruno Haible 2023-07-31 17:46 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Bruno Haible @ 2023-07-31 14:06 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > I have a problem with the c32isalpha function. > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, > because it expects the character to be an alphabetic character. This is not a big problem. You can see in the test-c32isalpha.c file that this test is disabled for many platforms, in particular glibc. There's no problem with disabling it on Cygwin as well. > The Cygwin unicode information is automatically generated from the > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha > in newlib is checking for the Unicode categories, using the expression: > > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt > || cat == CAT_Lm || cat == CAT_Lo > || cat == CAT_Nl // Letter_Number > ; > > with CAT_foo being equivalent to Unicode category foo. > > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an > alphabetic character. This is not wrong. However, see the comments in the generator of the gnulib tables: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789 /* Consider all the non-ASCII digits as alphabetic. ISO C 99 forbids us to have them in category "digit", but we want iswalnum to return true on them. */ Likewise in the generator of the glibc tables: https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274 The original comment (from 2000) was: /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99 takes it away: 7.25.2.1.5: The iswdigit function tests for any wide character that corresponds to a decimal-digit character (as defined in 5.2.1). 5.2.1: the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 */ return (ch >= 0x0030 && ch <= 0x0039); The question is: In which category do you put these non-ASCII digits? "print" and "graph", sure. But other than that? "punct" or "alnum"? "punct" seems wrong. If you, like me, decide to put them in "alnum", then you they need to be in "alpha" or "digit" (per POSIX https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ). But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit". Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 14:06 ` character class "alpha" Bruno Haible @ 2023-07-31 17:46 ` Corinna Vinschen 2023-07-31 18:20 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-31 17:46 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 31 16:06, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > I have a problem with the c32isalpha function. > > > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, > > because it expects the character to be an alphabetic character. > > This is not a big problem. You can see in the test-c32isalpha.c file > that this test is disabled for many platforms, in particular glibc. Which is interesting, because I actually tried that today on glibc, and for iswalpha (0xff11) it returns 1. So it actually behaves as the testcase expects. > There's no problem with disabling it on Cygwin as well. I'd rather make Cygwin do the same as glibc. > > The Cygwin unicode information is automatically generated from the > > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha > > in newlib is checking for the Unicode categories, using the expression: > > > > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt > > || cat == CAT_Lm || cat == CAT_Lo > > || cat == CAT_Nl // Letter_Number > > ; > > > > with CAT_foo being equivalent to Unicode category foo. > > > > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an > > alphabetic character. > > This is not wrong. However, see the comments in the generator of the > gnulib tables: > > https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789 > > /* Consider all the non-ASCII digits as alphabetic. > ISO C 99 forbids us to have them in category "digit", > but we want iswalnum to return true on them. */ > > Likewise in the generator of the glibc tables: > > https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274 > > The original comment (from 2000) was: > > /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99 > takes it away: > 7.25.2.1.5: > The iswdigit function tests for any wide character that corresponds > to a decimal-digit character (as defined in 5.2.1). > 5.2.1: > the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 > */ > return (ch >= 0x0030 && ch <= 0x0039); > > The question is: In which category do you put these non-ASCII digits? > "print" and "graph", sure. But other than that? "punct" or "alnum"? > "punct" seems wrong. If you, like me, decide to put them in "alnum", > then you they need to be in "alpha" or "digit" (per POSIX > https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ). > But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit". Thanks for the description. It was clear to me that they don't belong into the ISO C digit category, but other than that... So, if we change the expression in iswalpha_l to something like return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt || cat == CAT_Lm || cat == CAT_Lo || cat == CAT_Nl // Letter_Number /* Also all digits not allowed to be called digits per ISO C 99 */ || (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9')); ; we're good? Thanks, Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 17:46 ` Corinna Vinschen @ 2023-07-31 18:20 ` Corinna Vinschen 2023-07-31 18:43 ` Bruno Haible 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-31 18:20 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin On Jul 31 19:46, Corinna Vinschen via Cygwin wrote: > On Jul 31 16:06, Bruno Haible via Cygwin wrote: > > Corinna Vinschen wrote: > > > I have a problem with the c32isalpha function. > > > > > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, > > > because it expects the character to be an alphabetic character. > > > > This is not a big problem. You can see in the test-c32isalpha.c file > > that this test is disabled for many platforms, in particular glibc. > > Which is interesting, because I actually tried that today on glibc, and > for iswalpha (0xff11) it returns 1. So it actually behaves as the > testcase expects. > > > There's no problem with disabling it on Cygwin as well. > > I'd rather make Cygwin do the same as glibc. Hmm, there are more of those expressions which are disabled on glibc and fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually the better idea to disable them on Cygwin, too, rather than to change a working system... Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 18:20 ` Corinna Vinschen @ 2023-07-31 18:43 ` Bruno Haible 2023-07-31 21:12 ` Corinna Vinschen 2023-07-31 21:13 ` Brian Inglis 0 siblings, 2 replies; 32+ messages in thread From: Bruno Haible @ 2023-07-31 18:43 UTC (permalink / raw) To: cygwin Corinna Vinschen wrote: > there are more of those expressions which are disabled on glibc and > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually > the better idea to disable them on Cygwin, too, rather than to change > a working system... Sure. There is no standard how to map the Unicode properties to POSIX character classes. Other than the mentioned ISO C constraints for 'digit' and 'xdigit' and a few POSIX constraints, you are free to map them as you like. For glibc and gnulib, I mapped them in a way that seemed to make most sense for applications. But different people might come to different meanings of "make sense". Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 18:43 ` Bruno Haible @ 2023-07-31 21:12 ` Corinna Vinschen 2023-08-01 16:29 ` Brian Inglis 2023-07-31 21:13 ` Brian Inglis 1 sibling, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-07-31 21:12 UTC (permalink / raw) To: Bruno Haible; +Cc: cygwin Hi Bruno, On Jul 31 20:43, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > there are more of those expressions which are disabled on glibc and > > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually > > the better idea to disable them on Cygwin, too, rather than to change > > a working system... > > Sure. There is no standard how to map the Unicode properties to POSIX > character classes. Other than the mentioned ISO C constraints for > 'digit' and 'xdigit' and a few POSIX constraints, you are free to > map them as you like. For glibc and gnulib, I mapped them in a way > that seemed to make most sense for applications. But different > people might come to different meanings of "make sense". Ok, so I just pushed a patchset to Cygwin git, which should make GB18030 support actually work. Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now implemented in Cygwin and a uchar.h header exists now, too. Assuming all gnulib tests disabled for GLibc in test-c32isalpha.c test-c32iscntrl.c test-c32isprint.c test-c32isgraph.c test-c32ispunct.c test-c32islower.c will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib work as desired now. Thanks for your input and help! Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 21:12 ` Corinna Vinschen @ 2023-08-01 16:29 ` Brian Inglis 2023-08-02 7:56 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Brian Inglis @ 2023-08-01 16:29 UTC (permalink / raw) To: cygwin; +Cc: Bruno Haible On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote: > Hi Bruno, > > On Jul 31 20:43, Bruno Haible via Cygwin wrote: >> Corinna Vinschen wrote: >>> there are more of those expressions which are disabled on glibc and >>> fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually >>> the better idea to disable them on Cygwin, too, rather than to change >>> a working system... >> >> Sure. There is no standard how to map the Unicode properties to POSIX >> character classes. Other than the mentioned ISO C constraints for >> 'digit' and 'xdigit' and a few POSIX constraints, you are free to >> map them as you like. For glibc and gnulib, I mapped them in a way >> that seemed to make most sense for applications. But different >> people might come to different meanings of "make sense". > > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030 > support actually work. > > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now > implemented in Cygwin and a uchar.h header exists now, too. > > Assuming all gnulib tests disabled for GLibc in > > test-c32isalpha.c > test-c32iscntrl.c > test-c32isprint.c > test-c32isgraph.c > test-c32ispunct.c > test-c32islower.c > > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib > work as desired now. https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD] Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities. https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS] Draft Standard C 2023 is being voted on as of 2023-07-14, and if no technical issues arise requiring tweaks, will become the new standard, in which Unicode utilities <uchar.h> has some additions which you may wish to add; from: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=426 also: https://en.cppreference.com/w/c/string/multibyte https://en.cppreference.com/w/c/language/arithmetic_types major additions (note November official standard publication date): "7.30 Unicode utilities <uchar.h> 1 The header <uchar.h> declares one macro, a few types, and several functions for manipulating Unicode characters. 2 The macro __STDC_VERSION_UCHAR_H__ is an integer constant expression with a value equivalent to 202311L. 3 The types declared are mbstate_t (described in 7.31.1) and size_t (described in 7.21); char8_t which is an unsigned integer type used for 8-bit characters and is the same type as unsigned char; ... 7.30.1 Restartable multibyte/wide character conversion functions ... 2 When used in the functions in this subclause, the encoding of char8_t, char16_t, and char32_t objects, and sequences of such objects, is UTF-8, UTF-16, and UTF-32, respectively. Similarly, the encoding of char and wchar_t, and sequences of such objects, is the execution and wide execution encodings (6.2.9), respectively 7.30.1.1 The mbrtoc8 function Synopsis 1 #include <uchar.h> size_t mbrtoc8(char8_t * restrict pc8, const char * restrict s, size_t n, mbstate_t * restrict ps); Description 2 If s is a null pointer, the mbrtoc8 function is equivalent to the call: mbrtoc8(NULL, "", 1, ps) In this case, the values of the parameters pc8 and n are ignored. 3 If s is not a null pointer, the mbrtoc8 function function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding characters and then, if pc8 is not a null pointer, stores the value of the first (or only) such character in the object pointed to by pc8. Subsequent calls will store successive characters without consuming any additional input until all the characters have been stored. If the corresponding character is the null character, the resulting state described is the initial conversion state. Returns 4 The mbrtoc8 function returns the first of the following that applies (given the current conversion state): 0 if the next n or fewer bytes complete the multibyte character that corresponds to the null character (which is the value stored). between 1 and n inclusive if the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character. (size_t)(-3) if the next character resulting from a previous call has been stored (no bytes from the input have been consumed by this call). (size_t)(-2) if the next n bytes contribute to an incomplete (but potentially valid) multibyte character, and all n bytes have been processed (no value is stored).398) (size_t)(-1) if an encoding error occurs, in which case the next n or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro EILSEQ is stored in errno, and the conversion state is unspecified. 398)When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings). 7.30.1.2 The c8rtomb function Synopsis 1 #include <uchar.h> size_t c8rtomb(char * restrict s, char8_t c8, mbstate_t * restrict ps); Description 2 If s is a null pointer, the c8rtomb function is equivalent to the call c8rtomb(buf, u8’\0’, ps) where buf is an internal buffer. 3 If s is not a null pointer, the c8rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the character given or completed by c8 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s, or stores nothing if c8 does not represent a complete character. At most MB_CUR_MAX bytes are stored. If c8 is a null character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state. Returns 4 The c8rtomb function returns the number of bytes stored in the array object (including any shift sequences). When c8 is not a valid character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t)(-1); the conversion state is unspecified. ..." -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-08-01 16:29 ` Brian Inglis @ 2023-08-02 7:56 ` Corinna Vinschen 2023-08-02 15:06 ` Corinna Vinschen 0 siblings, 1 reply; 32+ messages in thread From: Corinna Vinschen @ 2023-08-02 7:56 UTC (permalink / raw) To: cygwin; +Cc: Brian Inglis, Bruno Haible On Aug 1 10:29, Brian Inglis via Cygwin wrote: > On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote: > > Hi Bruno, > > > > On Jul 31 20:43, Bruno Haible via Cygwin wrote: > > > Corinna Vinschen wrote: > > > > there are more of those expressions which are disabled on glibc and > > > > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually > > > > the better idea to disable them on Cygwin, too, rather than to change > > > > a working system... > > > > > > Sure. There is no standard how to map the Unicode properties to POSIX > > > character classes. Other than the mentioned ISO C constraints for > > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to > > > map them as you like. For glibc and gnulib, I mapped them in a way > > > that seemed to make most sense for applications. But different > > > people might come to different meanings of "make sense". > > > > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030 > > support actually work. > > > > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now > > implemented in Cygwin and a uchar.h header exists now, too. > > > > Assuming all gnulib tests disabled for GLibc in > > > > test-c32isalpha.c > > test-c32iscntrl.c > > test-c32isprint.c > > test-c32isgraph.c > > test-c32ispunct.c > > test-c32islower.c > > > > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib > > work as desired now. > > https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD] > > Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX > Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities. > > https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS] > > Draft Standard C 2023 is being voted on as of 2023-07-14, and if no > technical issues arise requiring tweaks, will become the new standard, in > which Unicode utilities <uchar.h> has some additions which you may wish to > add; from: Maybe at one point, but nobody keeps you from sending patches :) Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-08-02 7:56 ` Corinna Vinschen @ 2023-08-02 15:06 ` Corinna Vinschen 0 siblings, 0 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-08-02 15:06 UTC (permalink / raw) To: Corinna Vinschen via Cygwin; +Cc: Brian Inglis, Bruno Haible On Aug 2 09:56, Corinna Vinschen via Cygwin wrote: > On Aug 1 10:29, Brian Inglis via Cygwin wrote: > > On 2023-07-31 15:12, Corinna Vinschen via Cygwin wrote: > > > Hi Bruno, > > > > > > On Jul 31 20:43, Bruno Haible via Cygwin wrote: > > > > Corinna Vinschen wrote: > > > > > there are more of those expressions which are disabled on glibc and > > > > > fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually > > > > > the better idea to disable them on Cygwin, too, rather than to change > > > > > a working system... > > > > > > > > Sure. There is no standard how to map the Unicode properties to POSIX > > > > character classes. Other than the mentioned ISO C constraints for > > > > 'digit' and 'xdigit' and a few POSIX constraints, you are free to > > > > map them as you like. For glibc and gnulib, I mapped them in a way > > > > that seemed to make most sense for applications. But different > > > > people might come to different meanings of "make sense". > > > > > > Ok, so I just pushed a patchset to Cygwin git, which should make GB18030 > > > support actually work. > > > > > > Also, the C11 functions c16rtomb, c32rtomb, mbrtoc16, mbrtoc32 are now > > > implemented in Cygwin and a uchar.h header exists now, too. > > > > > > Assuming all gnulib tests disabled for GLibc in > > > > > > test-c32isalpha.c > > > test-c32iscntrl.c > > > test-c32isprint.c > > > test-c32isgraph.c > > > test-c32ispunct.c > > > test-c32islower.c > > > > > > will be disabled for Cygwin as well, all gb18030 and c32 tests in gnulib > > > work as desired now. > > > > https://www.iso.org/standard/86539.html [ISO/IEC/IEEE 9945 CD] > > > > Draft POSIX 2023 SUS V5 Issue 8 D3 CB2.1 proposes the following POSIX > > Subprofiling Option Group: POSIX_C_LANG_UCHAR: ISO C Unicode Utilities. > > > > https://www.iso.org/standard/82075.html [ISO/IEC 9899 DIS] > > > > Draft Standard C 2023 is being voted on as of 2023-07-14, and if no > > technical issues arise requiring tweaks, will become the new standard, in > > which Unicode utilities <uchar.h> has some additions which you may wish to > > add; from: > > Maybe at one point, but nobody keeps you from sending patches :) Never mind, had a bit of time. I fixed the uchar.h header and implemented c8rtomb und mbrtoc8. Still needs testing. Does anybody know of an easily accessible testsuite testing these functions? However, I did not define __STDC_VERSION_UCHAR_H__ yet. I wasn't sure my uchar.h is compliant, and Glibc doesn't define that macro yet, either. Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 18:43 ` Bruno Haible 2023-07-31 21:12 ` Corinna Vinschen @ 2023-07-31 21:13 ` Brian Inglis 2023-07-31 21:37 ` Bruno Haible 1 sibling, 1 reply; 32+ messages in thread From: Brian Inglis @ 2023-07-31 21:13 UTC (permalink / raw) To: cygwin; +Cc: Bruno Haible On 2023-07-31 12:43, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: >> there are more of those expressions which are disabled on glibc and >> fail on Cygwin, for instance in test-c32iscntrl.c. Maybe it's actually >> the better idea to disable them on Cygwin, too, rather than to change >> a working system... > > Sure. There is no standard how to map the Unicode properties to POSIX > character classes. Other than the mentioned ISO C constraints for > 'digit' and 'xdigit' and a few POSIX constraints, you are free to > map them as you like. For glibc and gnulib, I mapped them in a way > that seemed to make most sense for applications. But different > people might come to different meanings of "make sense". It seems to me that most application developers needing to support non-Western-European languages might want a non-POSIX interpretation of digits. Are the Unicode character attribute classes supported for those application use cases that need more than POSIX limitations allow? I know that I sometimes want to see some alternative numeric digit forms and expect to be able to find those with an appropriate grep expression. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: character class "alpha" 2023-07-31 21:13 ` Brian Inglis @ 2023-07-31 21:37 ` Bruno Haible 0 siblings, 0 replies; 32+ messages in thread From: Bruno Haible @ 2023-07-31 21:37 UTC (permalink / raw) To: cygwin, Brian Inglis Brian Inglis wrote: > It seems to me that most application developers needing to support > non-Western-European languages might want a non-POSIX interpretation of digits. Sure. GNU libunistring has dedicated API for this: - https://www.gnu.org/software/libunistring/manual/html_node/Object-oriented-API.html UC_DECIMAL_DIGIT_NUMBER. - https://www.gnu.org/software/libunistring/manual/html_node/Decimal-digit-value.html - https://www.gnu.org/software/libunistring/manual/html_node/Digit-value.html - https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-objects.html UC_PROPERTY_DECIMAL_DIGIT - https://www.gnu.org/software/libunistring/manual/html_node/Properties-as-functions.html uc_is_property_decimal_digit I'm sure ICU4C has similar APIs too. > Are the Unicode character attribute classes supported for those application use > cases that need more than POSIX limitations allow? POSIX allows the libc to define additional character classes. But these will be platform and locale dependent, and I don't know of any application which makes use of such additional character classes via wctype() and iswctype(). > I know that I sometimes want to see some alternative numeric digit forms and > expect to be able to find those with an appropriate grep expression. I think you can do so with GNU 'grep', when it was built with PCRE support. PCRE includes support for Unicode character classes. <https://www.pcre.org/current/doc/html/pcre2pattern.html> Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 8:53 ` Corinna Vinschen 2023-07-28 10:56 ` Bruno Haible @ 2023-07-28 11:12 ` Corinna Vinschen 2023-07-28 11:22 ` Bruno Haible 2023-07-28 21:42 ` Bill Stewart 1 sibling, 2 replies; 32+ messages in thread From: Corinna Vinschen @ 2023-07-28 11:12 UTC (permalink / raw) To: Corinna Vinschen via Cygwin; +Cc: Bruno Haible On Jul 28 10:53, Corinna Vinschen via Cygwin wrote: > On Jul 27 23:40, Bruno Haible via Cygwin wrote: > > Corinna Vinschen wrote: > > > S["REPLACE_FNMATCH"]="1" > > > > > > Looks like the reason is that we don't have a uchar.h file? Seems > > > like this is of interest for AIX, but why should this be of > > > interest for fnmatch on other systems? > > > > Ah, that's because I made the assumption that if wchar_t is only 16-bits > > wide, fnmatch() can't be correct. Which is true for AIX (and on this > > platform, I prefer not to test the available locales). But not true > > with your implementation any more. > > > > What are the test suite results if you do > > > > - Replace S["REPLACE_FNMATCH"]="1" with S["REPLACE_FNMATCH"]="0" > > in config.status, > > - make clean > > - ./config.status > > - make > > The build fails here. The reason is that the GNU extension FNM_EXTMATCH > is not supported by the FreeBSD code base of fnmatch, so it's not > defined in our fnmatch.h system header. Gnulib still tries to build > fnmatch_loop.c which uses FNM_EXTMATCH, but apparently it now relies on > using the system header? > [...] > After the above fail, I tried from scratch with your below patch, > and I still get > > $ grep REPLACE_FNMATCH ./config.status > S["REPLACE_FNMATCH"]="1" > > Even though > > $ grep fnmatch log1 > checking for fnmatch.h... yes > checking for fnmatch... yes > checking for working POSIX fnmatch... yes > > I'm quite puzzled. I'm puzzled because I'm an idiot. I forgot autoreconf. After that and another configure run, REPLACE_FNMATCH is correctly set to 0 *and* the build runs fine. I'll do the rest of the test later today. Sorry, Corinna ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen @ 2023-07-28 11:22 ` Bruno Haible 2023-07-28 21:42 ` Bill Stewart 1 sibling, 0 replies; 32+ messages in thread From: Bruno Haible @ 2023-07-28 11:22 UTC (permalink / raw) To: Corinna Vinschen, Bruno Haible Corinna Vinschen wrote: > I'm puzzled because I'm an idiot. I forgot autoreconf. Things like that happen to me as well. There are so many generation phases (collect *.m4 files; autoconf; configure; make) that it's easy to forget one when making incremental changes. It's more reliable to regenerate the testdir from scratch. Bruno ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: fnmatch improvements 2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen 2023-07-28 11:22 ` Bruno Haible @ 2023-07-28 21:42 ` Bill Stewart 1 sibling, 0 replies; 32+ messages in thread From: Bill Stewart @ 2023-07-28 21:42 UTC (permalink / raw) To: cygwin [-- Attachment #1: Type: text/plain, Size: 150 bytes --] On Fri, Jul 28, 2023 at 5:12 AM Corinna Vinschen wrote: I'm puzzled because I'm an idiot. > That's one thing you certainly are not. Bill ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2023-08-02 15:06 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-27 10:15 fnmatch improvements Bruno Haible 2023-07-27 18:24 ` Corinna Vinschen 2023-07-27 19:05 ` Corinna Vinschen 2023-07-27 20:25 ` Brian Inglis 2023-07-27 21:22 ` Bruno Haible 2023-07-27 22:17 ` Brian Inglis 2023-07-28 9:00 ` Corinna Vinschen 2023-07-28 9:53 ` Corinna Vinschen 2023-07-27 21:40 ` Bruno Haible 2023-07-28 8:53 ` Corinna Vinschen 2023-07-28 10:56 ` Bruno Haible 2023-07-28 11:14 ` Corinna Vinschen 2023-07-28 18:59 ` Corinna Vinschen 2023-07-28 19:33 ` Bruno Haible 2023-07-28 19:54 ` GB18030 locale Bruno Haible 2023-07-29 9:23 ` Corinna Vinschen 2023-07-29 9:53 ` Bruno Haible 2023-07-31 10:07 ` Corinna Vinschen 2023-07-31 13:38 ` Corinna Vinschen 2023-07-31 14:06 ` character class "alpha" Bruno Haible 2023-07-31 17:46 ` Corinna Vinschen 2023-07-31 18:20 ` Corinna Vinschen 2023-07-31 18:43 ` Bruno Haible 2023-07-31 21:12 ` Corinna Vinschen 2023-08-01 16:29 ` Brian Inglis 2023-08-02 7:56 ` Corinna Vinschen 2023-08-02 15:06 ` Corinna Vinschen 2023-07-31 21:13 ` Brian Inglis 2023-07-31 21:37 ` Bruno Haible 2023-07-28 11:12 ` fnmatch improvements Corinna Vinschen 2023-07-28 11:22 ` Bruno Haible 2023-07-28 21:42 ` Bill Stewart
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).