public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: Bruno Haible <bruno@clisp.org>
Cc: cygwin@cygwin.com
Subject: Re: character class "alpha"
Date: Mon, 31 Jul 2023 19:46:20 +0200	[thread overview]
Message-ID: <ZMfzbOOJth8Mk+rJ@calimero.vinschen.de> (raw)
In-Reply-To: <5176597.IBPj4gxFZX@nimes>

On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > I have a problem with the c32isalpha function.
> > 
> > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > because it expects the character to be an alphabetic character.
> 
> This is not a big problem. You can see in the test-c32isalpha.c file
> that this test is disabled for many platforms, in particular glibc.

Which is interesting, because I actually tried that today on glibc, and
for iswalpha (0xff11) it returns 1.  So it actually behaves as the
testcase expects.

> There's no problem with disabling it on Cygwin as well.

I'd rather make Cygwin do the same as glibc.

> > The Cygwin unicode information is automatically generated from the
> > Unicode data file UnicodeData.txt, fresh from their homepage.  iswalpha
> > in newlib is checking for the Unicode categories, using the expression:
> > 
> >     return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
> >           || cat == CAT_Lm || cat == CAT_Lo
> > 	  || cat == CAT_Nl // Letter_Number
> > 	  ;
> > 
> > with CAT_foo being equivalent to Unicode category foo.
> > 
> > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> > alphabetic character.
> 
> This is not wrong. However, see the comments in the generator of the
> gnulib tables:
> 
> https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789
> 
>    /* Consider all the non-ASCII digits as alphabetic.
>       ISO C 99 forbids us to have them in category "digit",
>       but we want iswalnum to return true on them.  */
> 
> Likewise in the generator of the glibc tables:
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274
> 
> The original comment (from 2000) was:
> 
>   /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
>      takes it away:
>      7.25.2.1.5:
>         The iswdigit function tests for any wide character that corresponds
>         to a decimal-digit character (as defined in 5.2.1).
>      5.2.1:
>         the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
>    */
>   return (ch >= 0x0030 && ch <= 0x0039);
> 
> The question is: In which category do you put these non-ASCII digits?
> "print" and "graph", sure. But other than that? "punct" or "alnum"?
> "punct" seems wrong. If you, like me, decide to put them in "alnum",
> then you they need to be in "alpha" or "digit" (per POSIX
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
> But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".

Thanks for the description.  It was clear to me that they don't belong
into the ISO C digit category, but other than that...

So, if we change the expression in iswalpha_l to something like

  return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
      || cat == CAT_Lm || cat == CAT_Lo
      || cat == CAT_Nl // Letter_Number
      /* Also all digits not allowed to be called digits per ISO C 99 */
      || (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9'));
      ;

we're good?


Thanks,
Corinna

  reply	other threads:[~2023-07-31 17:46 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-27 10:15 fnmatch improvements Bruno Haible
2023-07-27 18:24 ` Corinna Vinschen
2023-07-27 19:05   ` Corinna Vinschen
2023-07-27 20:25     ` Brian Inglis
2023-07-27 21:22       ` Bruno Haible
2023-07-27 22:17         ` Brian Inglis
2023-07-28  9:00           ` Corinna Vinschen
2023-07-28  9:53             ` Corinna Vinschen
2023-07-27 21:40     ` Bruno Haible
2023-07-28  8:53       ` Corinna Vinschen
2023-07-28 10:56         ` Bruno Haible
2023-07-28 11:14           ` Corinna Vinschen
2023-07-28 18:59           ` Corinna Vinschen
2023-07-28 19:33             ` Bruno Haible
2023-07-28 19:54             ` GB18030 locale Bruno Haible
2023-07-29  9:23               ` Corinna Vinschen
2023-07-29  9:53                 ` Bruno Haible
2023-07-31 10:07                   ` Corinna Vinschen
2023-07-31 13:38                     ` Corinna Vinschen
2023-07-31 14:06                       ` character class "alpha" Bruno Haible
2023-07-31 17:46                         ` Corinna Vinschen [this message]
2023-07-31 18:20                           ` Corinna Vinschen
2023-07-31 18:43                             ` Bruno Haible
2023-07-31 21:12                               ` Corinna Vinschen
2023-08-01 16:29                                 ` Brian Inglis
2023-08-02  7:56                                   ` Corinna Vinschen
2023-08-02 15:06                                     ` Corinna Vinschen
2023-07-31 21:13                               ` Brian Inglis
2023-07-31 21:37                                 ` Bruno Haible
2023-07-28 11:12         ` fnmatch improvements Corinna Vinschen
2023-07-28 11:22           ` Bruno Haible
2023-07-28 21:42           ` Bill Stewart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMfzbOOJth8Mk+rJ@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=bruno@clisp.org \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).