public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug glob/31075] New: fnmatch("??") matches on 2-byte single characters (as well as 2 any-length characters)
@ 2023-11-18 10:03 stephane+sourceware at chazelas dot org
  2023-11-18 11:53 ` [Bug glob/31075] " stephane+sourceware at chazelas dot org
  2023-11-24 19:56 ` [Bug glob/31075] fnmatch("??") matches on one two byte valid character (as well as two " stephane+sourceware at chazelas dot org
  0 siblings, 2 replies; 3+ messages in thread
From: stephane+sourceware at chazelas dot org @ 2023-11-18 10:03 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31075

            Bug ID: 31075
           Summary: fnmatch("??") matches on 2-byte single characters (as
                    well as 2 any-length characters)
           Product: glibc
           Version: 2.34
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: glob
          Assignee: unassigned at sourceware dot org
          Reporter: stephane+sourceware at chazelas dot org
  Target Milestone: ---

Regression introduced in 2.34 by commit
a79328c745219dcb395070cdcd3be065a8347f24  reproduced on Ubuntu 22.04, Debian
sid libc6:amd64 2.37-12, and current git HEAD
(dae3cf4134d476a4b4ef86fd7012231d6436c15e) built on that sid system.

find . -name '??'

In a UTF-8 locale matches on a UTF-8 encoded éé (0xc3 0xa9 0xc3 0xa9) but also
on a UTF-8 encoded é (0xc3 0xa9):

To reproduce, from a shell with support for ksh93-style $'...' quotes (ksh93,
zsh, bash...) and on a system where the C.UTF-8 locale has been enabled (change
to any other UTF-8 locale if not):

(
  mkdir new-dir && cd new-dir || exit
  touch $'\xc3\xa9' $'\xc3\xa9\xc3\xa9'
  export LC_ALL=C.UTF-8
  locale charmap
  find . -name '??'
)

UTF-8
./é
./éé

It seems when fnmatch() fails to match in wchar_t mode, it tries again in char
mode. The pattern is also treated as a char[] array then which makes it even
worth than the (already quite buggy) behaviour of bash pattern matching
(https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html), as that's
done even when both the subject and pattern are properly encoded in the user's
locale charmap.

$ find . '*[á-ä]*'
./é
./éé

Those didn't match in wchar_t mode but matched in char mode as that became a
*[\303\241-\303\244]* match so matches on anything containing byte 0241 to
0303.

Like for bash, it becomes worse in locales that have characters whose encoding
contains the encoding of [, ] or \ as it can end up matching on a pattern
completely different from the one intended by the user.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-24 19:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-18 10:03 [Bug glob/31075] New: fnmatch("??") matches on 2-byte single characters (as well as 2 any-length characters) stephane+sourceware at chazelas dot org
2023-11-18 11:53 ` [Bug glob/31075] " stephane+sourceware at chazelas dot org
2023-11-24 19:56 ` [Bug glob/31075] fnmatch("??") matches on one two byte valid character (as well as two " stephane+sourceware at chazelas dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).