public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha
@ 2023-12-12 15:53 vincent-srcware at vinc17 dot net
  2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
  2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
  0 siblings, 2 replies; 3+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2023-12-12 15:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31149

            Bug ID: 31149
           Summary: combining characters (accents) misclassified as punct
                    rather than alpha
           Product: glibc
           Version: 2.37
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: vincent-srcware at vinc17 dot net
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Combining characters such as U+0301 COMBINING ACUTE ACCENT are misclassified as
punct, while this should be alpha.

With glibc 2.37 under Debian/unstable, I get for this character:

Property alnum : no
Property alpha : no
Property cntrl : no
Property digit : no
Property graph : yes
Property lower : no
Property print : yes
Property punct : yes
Property space : no
Property upper : no
Property xdigit: no

This affects grep: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (where
it is said that the bug is in the GNU libc).

Corresponding Debian bug:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=868654
(which was reported on 2017-07-17 and hasn't got any activity yet).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug localedata/31149] combining characters (accents) misclassified as punct rather than alpha
  2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
@ 2023-12-13 10:01 ` fweimer at redhat dot com
  2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
  1 sibling, 0 replies; 3+ messages in thread
From: fweimer at redhat dot com @ 2023-12-13 10:01 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31149

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com

--- Comment #1 from Florian Weimer <fweimer at redhat dot com> ---
Isn't the larger issue here that it's reasonable to expect that [[:alpha:]]
matches a single letter as perceived by the user: an entire grapheme cluster
comprising the base character(s), its associated combining characters and other
marks. We do not implement any of that in glibc, and there are no plans to do
so.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug localedata/31149] combining characters (accents) misclassified as punct rather than alpha
  2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
  2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
@ 2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
  1 sibling, 0 replies; 3+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2023-12-13 10:15 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31149

--- Comment #2 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Florian Weimer from comment #1)
> Isn't the larger issue here that it's reasonable to expect that [[:alpha:]]
> matches a single letter as perceived by the user: an entire grapheme cluster
> comprising the base character(s), its associated combining characters and
> other marks. [...]

I don't think so. The functions iswctype(), iswalpha(), etc. take a single
code-point (type wint_t), and the regex(7) man page says:

  Within a bracket expression, the name of a character class enclosed
  in "[:" and ":]" stands for the list of all characters belonging to
  that class. Standard character class names are:

        alnum   digit   punct
        alpha   graph   space
        blank   lower   upper
        cntrl   print   xdigit

  These stand for the character classes defined in wctype(3). [...]

so that it is expected that [[:alpha:]] matches a single character, like the
above functions.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-12-13 10:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
2023-12-13 10:15 ` vincent-srcware at vinc17 dot net

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).