public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale
@ 2005-08-02  4:37 kimura dot koichi at canon dot co dot jp
  2005-09-27 20:05 ` [Bug regex/1149] " drepper at redhat dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: kimura dot koichi at canon dot co dot jp @ 2005-08-02  4:37 UTC (permalink / raw)
  To: glibc-bugs-regex

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]

In GNU sed 4.1.4 (regex.c is drawn from glibc), 
japanse half-width caracter in SJIS locale doesn't match
character class with range.

LC_ALL=ja_JP.SJIS
export LC_ALL
echo ±²³´µ | sed -ne '/[±-µ]\+/p'

above shell script print nothing.
any other japanese full-width kana character match correctly.

note:

echo ±²³´µ | sed -ne '/[±²³´µ]\+/p'

is print correctly.

-- 
           Summary: character class with range doesn't match half-width kana
                    in SJIS locale
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: kimura dot koichi at canon dot co dot jp
                CC: glibc-bugs-regex at sources dot redhat dot com,glibc-
                    bugs at sources dot redhat dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
@ 2005-09-27 20:05 ` drepper at redhat dot com
  2006-01-27  6:00 ` kimura dot koichi at canon dot co dot jp
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: drepper at redhat dot com @ 2005-09-27 20:05 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From drepper at redhat dot com  2005-09-27 20:05 -------
You really cannot use character ranges outside the C locale since the definition
depends on the locale description, more specifically the collation data.  It
currently doesn't contain anything for these characters.  And even if they
would, there is no guarantee that the result would be as you expect.  Just don't
use ranges.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
  2005-09-27 20:05 ` [Bug regex/1149] " drepper at redhat dot com
@ 2006-01-27  6:00 ` kimura dot koichi at canon dot co dot jp
  2006-02-01  4:48 ` kimura dot koichi at canon dot co dot jp
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: kimura dot koichi at canon dot co dot jp @ 2006-01-27  6:00 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From kimura dot koichi at canon dot co dot jp  2006-01-27 06:00 -------
You say that I shoud not use character ranges in not C locale.
But I have a question yet.

Why characters wchich is start/end of range are not printed?

Half-width katakana characters in SJIS locale has one-byte width 
(codepoint is under 0xff) but has large codepoint in Unicode (over U+0100). 
In regcomp.c, I guess half-width katakana characters should register as single 
byte character to fastmap.
And in regexec.c, half-width katakana characters shoud treat as single-byte
character and call bitset_set() function to register to bitmap.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|WONTFIX                     |


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
  2005-09-27 20:05 ` [Bug regex/1149] " drepper at redhat dot com
  2006-01-27  6:00 ` kimura dot koichi at canon dot co dot jp
@ 2006-02-01  4:48 ` kimura dot koichi at canon dot co dot jp
  2006-04-25 18:12 ` drepper at redhat dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: kimura dot koichi at canon dot co dot jp @ 2006-02-01  4:48 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From kimura dot koichi at canon dot co dot jp  2006-02-01 04:48 -------
(In reply to comment #2)
I guess I found point of problem.
Here is patch.

--- regcomp.c.1~	2005-07-18 11:51:43.000000000 +0900
+++ regcomp.c	2006-02-01 13:26:41.078750000 +0900
@@ -397,9 +397,13 @@ re_compile_fastmap_iter (bufp, init_stat
 		}
 # else
 	      if (dfa->mb_cur_max > 1)
-		for (i = 0; i < SBC_MAX; ++i)
-		  if (__btowc (i) == WEOF)
-		    re_set_fastmap (fastmap, icase, i);
+                  for (i = 0; i < SBC_MAX; ++i) {
+		    wint_t wc;
+		    wc = __btowc (i);
+
+		    if (wc == WEOF || wc >= SBC_MAX)
+		      re_set_fastmap (fastmap, icase, i);
+		  }
 # endif /* not _LIBC */
 	    }
 	  for (i = 0; i < cset->nmbchars; ++i)

--- regexec.c.1~	2005-07-18 11:51:42.000000000 +0900
+++ regexec.c	2006-02-01 13:26:44.016250000 +0900
@@ -3715,6 +3715,7 @@ check_node_accept_bytes (dfa, node_idx, 
   const re_token_t *node = dfa->nodes + node_idx;
   int char_len, elem_len;
   int i;
+  wchar_t wc;
 
   if (BE (node->type == OP_UTF8_PERIOD, 0))
     {
@@ -3784,7 +3785,8 @@ check_node_accept_bytes (dfa, node_idx, 
     }
 
   elem_len = re_string_elem_size_at (input, str_idx);
-  if ((elem_len <= 1 && char_len <= 1) || char_len == 0)
+  wc = __btowc(*(input->mbs+str_idx));
+  if ((elem_len <= 1 && char_len <= 1) || char_len == 0) && (wc != WEOF && wc <
SBC_MAX))
     return 0;
 
   if (node->type == COMPLEX_BRACKET)

This patch is for non-_LIBC part since I could not follow the _LIBC part flow.

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
                   ` (2 preceding siblings ...)
  2006-02-01  4:48 ` kimura dot koichi at canon dot co dot jp
@ 2006-04-25 18:12 ` drepper at redhat dot com
  2006-04-26  7:04 ` bonzini at gnu dot org
  2006-05-02 22:33 ` drepper at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: drepper at redhat dot com @ 2006-04-25 18:12 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From drepper at redhat dot com  2006-04-25 18:12 -------
Patches for non-_LIBC shouldn't be sent here.  This is the *libc* bugzilla. 
Send them to the sed list and let those people look at them.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
                   ` (3 preceding siblings ...)
  2006-04-25 18:12 ` drepper at redhat dot com
@ 2006-04-26  7:04 ` bonzini at gnu dot org
  2006-05-02 22:33 ` drepper at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: bonzini at gnu dot org @ 2006-04-26  7:04 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From bonzini at gnu dot org  2006-04-26 07:04 -------
So you WONTFIX a bug just because the patch sent is not for glibc?  Either the
bug is invalid, and you mark it as INVALID; or you just ignore the patch.  But
not WONTFIX.

The patch is not ok because it slows down unnecessarily the function, and regex
is already slow enough.  We probably should cache the results of btowc (at least
for the non _LIBC case).

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|WONTFIX                     |


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug regex/1149] character class with range doesn't match half-width kana in SJIS locale
  2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
                   ` (4 preceding siblings ...)
  2006-04-26  7:04 ` bonzini at gnu dot org
@ 2006-05-02 22:33 ` drepper at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: drepper at redhat dot com @ 2006-05-02 22:33 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From drepper at redhat dot com  2006-05-02 22:33 -------
This is glibc's bugzilla.  I mark it WONTFIX because I have nothing to do with
the non-glibc code.  Stop reopening.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=1149

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-05-02 22:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-02  4:37 [Bug regex/1149] New: character class with range doesn't match half-width kana in SJIS locale kimura dot koichi at canon dot co dot jp
2005-09-27 20:05 ` [Bug regex/1149] " drepper at redhat dot com
2006-01-27  6:00 ` kimura dot koichi at canon dot co dot jp
2006-02-01  4:48 ` kimura dot koichi at canon dot co dot jp
2006-04-25 18:12 ` drepper at redhat dot com
2006-04-26  7:04 ` bonzini at gnu dot org
2006-05-02 22:33 ` drepper at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).