public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha
@ 2023-12-12 15:53 vincent-srcware at vinc17 dot net
2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
0 siblings, 2 replies; 3+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2023-12-12 15:53 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=31149
Bug ID: 31149
Summary: combining characters (accents) misclassified as punct
rather than alpha
Product: glibc
Version: 2.37
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: localedata
Assignee: unassigned at sourceware dot org
Reporter: vincent-srcware at vinc17 dot net
CC: libc-locales at sourceware dot org
Target Milestone: ---
Combining characters such as U+0301 COMBINING ACUTE ACCENT are misclassified as
punct, while this should be alpha.
With glibc 2.37 under Debian/unstable, I get for this character:
Property alnum : no
Property alpha : no
Property cntrl : no
Property digit : no
Property graph : yes
Property lower : no
Property print : yes
Property punct : yes
Property space : no
Property upper : no
Property xdigit: no
This affects grep: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (where
it is said that the bug is in the GNU libc).
Corresponding Debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=868654
(which was reported on 2017-07-17 and hasn't got any activity yet).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug localedata/31149] combining characters (accents) misclassified as punct rather than alpha
2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
@ 2023-12-13 10:01 ` fweimer at redhat dot com
2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
1 sibling, 0 replies; 3+ messages in thread
From: fweimer at redhat dot com @ 2023-12-13 10:01 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=31149
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fweimer at redhat dot com
--- Comment #1 from Florian Weimer <fweimer at redhat dot com> ---
Isn't the larger issue here that it's reasonable to expect that [[:alpha:]]
matches a single letter as perceived by the user: an entire grapheme cluster
comprising the base character(s), its associated combining characters and other
marks. We do not implement any of that in glibc, and there are no plans to do
so.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug localedata/31149] combining characters (accents) misclassified as punct rather than alpha
2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
@ 2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
1 sibling, 0 replies; 3+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2023-12-13 10:15 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=31149
--- Comment #2 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Florian Weimer from comment #1)
> Isn't the larger issue here that it's reasonable to expect that [[:alpha:]]
> matches a single letter as perceived by the user: an entire grapheme cluster
> comprising the base character(s), its associated combining characters and
> other marks. [...]
I don't think so. The functions iswctype(), iswalpha(), etc. take a single
code-point (type wint_t), and the regex(7) man page says:
Within a bracket expression, the name of a character class enclosed
in "[:" and ":]" stands for the list of all characters belonging to
that class. Standard character class names are:
alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit
These stand for the character classes defined in wctype(3). [...]
so that it is expected that [[:alpha:]] matches a single character, like the
above functions.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-12-13 10:15 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 15:53 [Bug localedata/31149] New: combining characters (accents) misclassified as punct rather than alpha vincent-srcware at vinc17 dot net
2023-12-13 10:01 ` [Bug localedata/31149] " fweimer at redhat dot com
2023-12-13 10:15 ` vincent-srcware at vinc17 dot net
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).