* [PATCH] Some regex word matching fixes for multi-byte locales
@ 2003-11-21 0:12 Jakub Jelinek
2003-11-21 8:39 ` Ulrich Drepper
0 siblings, 1 reply; 2+ messages in thread
From: Jakub Jelinek @ 2003-11-21 0:12 UTC (permalink / raw)
To: Ulrich Drepper, Roland McGrath; +Cc: Glibc hackers
Hi!
This patch does a few things:
a) remove or #ifdef out ~ 1.5KB of dead code (check_matching is never
called with fl_search != 0 and transit_state_sb is called just
from unreachable code (else branch of if (1) (with no labels)).
If someone hacks up regex so that re_search_internal doesn't
call check_matching at each location separately to optimize
.abababababababababababababababc like pattern searching,
it can be IMHO easily added back and it doesn't seem to be
tested code anyway.
b) build_trtable assumed single byte encoding of any word resp.
non-word characters (used just IS_WORD_CHAR).
For multi-byte locales, it uses if needed twice as large trtable
with half for !IS_WORD_CONTEXT and half for IS_WORD_CONTEXT
(only if they are actually different, transit_state calls
re_string_context_at)
c) changes group_nodes_into_DFAstates to not touch !ascii characters
for NEXT_WORD_CONSTRAINT or NEXT_NOTWORD_CONSTRAINT.
d) enables some testcases un bug-regex19.c and runs tst-rxspencer
in --utf8 mode as well.
The c) change is problematic, but current group_nodes_into_DFAstates
is more. Using isalnum (ch) on first byte of a multi-byte character
is simply wrong. The thing is, a >= 0x80 byte might be accepted or not,
or accepted in some cases and not in others.
I'll start with writing lots of testcases and then see what can be done.
In simple cases like when certain word anchor is always followed by
word characters or always by non-word characters starting with certain
bytes we can certainly handle it as quickly as done now, the question
is how to handle stuff like "\\b(\xc3\x84\\|\xc3\xb7\\)"
(A" and division sign) or "\\b(\xc3\x96\\|.a)" etc.
It seems at least on the two tests I've tried so far that it worked
with COMPLEX_BRACKET, but I guess I really need to look at many testcases
first.
2003-11-20 Jakub Jelinek <jakub@redhat.com>
* posix/regex_internal.h (re_dfastate_t): Remove trtable_search.
Add word_trtable.
* posix/regex_internal.c (create_newstate_common, free_state):
Don't free trtable_search.
* posix/regexec.c (check_matching): Remove fl_search argument.
(transit_state_sb): Likewise. #ifdef out as unused.
(build_trtable): Remove fl_search argument. Set state->word_trtable
and state->trtable. Build separate word and non-word tables if
multi-byte and they differ for some character.
(transit_state): Remove fl_search argument. Don't update
state->trtable here. Handle state->word_trtable.
#ifdef out unused call to transit_state_sb.
(re_search_internal): Update check_matching caller.
(group_nodes_into_DFAstates): Don't clear non-ascii chars in accepts
bitmask for multi-byte locales.
* posix/bug-regex19.c (tests): Enable some commented out tests, add
2 new tests.
* posix/tst-rxspencer.c (mb_tests): Don't test [[=b=]] for now as
multi-byte. Don't run identical multi-byte tests multiple times
unnecessarily.
(main): Check setlocale return value.
* posix/Makefile (tst-rxspencer-ARGS): Add --utf8 argument.
(tst-rxspencer-ENV): Remove MALLOC_TRACE, add LOCPATH.
($(objpfx)tst-rxspencer-mem): Run another tst-rxspencer test
here, without --utf8 argument but with MALLOC_TRACE.
localedata/
* Makefile (LOCALES): Add cs_CZ.UTF-8.
--- libc/localedata/Makefile.jj 2003-11-18 11:34:26.000000000 +0100
+++ libc/localedata/Makefile 2003-11-20 14:30:02.000000000 +0100
@@ -132,7 +132,7 @@ LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8
en_US.ISO-8859-1 ja_JP.EUC-JP da_DK.ISO-8859-1 \
hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS fr_FR.ISO-8859-1 \
vi_VN.TCVN5712-1 nb_NO.ISO-8859-1 nn_NO.ISO-8859-1 \
- tr_TR.UTF-8
+ tr_TR.UTF-8 cs_CZ.UTF-8
LOCALE_SRCS := $(shell echo "$(LOCALES)"|sed 's/\([^ .]*\)[^ ]*/\1/g')
CHARMAPS := $(shell echo "$(LOCALES)" | \
sed -e 's/[^ .]*[.]\([^ ]*\)/\1/g' -e s/SJIS/SHIFT_JIS/g)
--- libc/posix/regex_internal.h.jj 2003-11-20 00:40:48.000000000 +0100
+++ libc/posix/regex_internal.h 2003-11-20 12:26:11.000000000 +0100
@@ -456,7 +456,6 @@ struct re_dfastate_t
re_node_set nodes;
re_node_set *entrance_nodes;
struct re_dfastate_t **trtable;
- struct re_dfastate_t **trtable_search;
/* If this state is a special state.
A state is a special state if the state is the halt state, or
a anchor. */
@@ -469,6 +468,7 @@ struct re_dfastate_t
/* If this state has backreference node(s). */
unsigned int has_backref : 1;
unsigned int has_constraint : 1;
+ unsigned int word_trtable : 1;
};
typedef struct re_dfastate_t re_dfastate_t;
--- libc/posix/regexec.c.jj 2003-11-20 00:40:48.000000000 +0100
+++ libc/posix/regexec.c 2003-11-20 21:25:22.000000000 +0100
@@ -57,7 +57,7 @@ static re_dfastate_t *acquire_init_state
static reg_errcode_t prune_impossible_nodes (const regex_t *preg,
re_match_context_t *mctx);
static int check_matching (const regex_t *preg, re_match_context_t *mctx,
- int fl_search, int fl_longest_match);
+ int fl_longest_match);
static int check_halt_node_context (const re_dfa_t *dfa, int node,
unsigned int context);
static int check_halt_state_context (const regex_t *preg,
@@ -123,15 +123,16 @@ static reg_errcode_t merge_state_array (
re_dfastate_t **src, int num);
static re_dfastate_t *transit_state (reg_errcode_t *err, const regex_t *preg,
re_match_context_t *mctx,
- re_dfastate_t *state, int fl_search);
+ re_dfastate_t *state);
static reg_errcode_t check_subexp_matching_top (re_dfa_t *dfa,
re_match_context_t *mctx,
re_node_set *cur_nodes,
int str_idx);
+#if 0
static re_dfastate_t *transit_state_sb (reg_errcode_t *err, const regex_t *preg,
re_dfastate_t *pstate,
- int fl_search,
re_match_context_t *mctx);
+#endif
#ifdef RE_ENABLE_I18N
static reg_errcode_t transit_state_mb (const regex_t *preg,
re_dfastate_t *pstate,
@@ -173,8 +174,7 @@ static reg_errcode_t expand_bkref_cache
int last_str, int subexp_num,
int fl_open);
static re_dfastate_t **build_trtable (const regex_t *dfa,
- const re_dfastate_t *state,
- int fl_search);
+ re_dfastate_t *state);
#ifdef RE_ENABLE_I18N
static int check_node_accept_bytes (const regex_t *preg, int node_idx,
const re_string_t *input, int idx);
@@ -741,7 +741,7 @@ re_search_internal (preg, string, length
/* It seems to be appropriate one, then use the matcher. */
/* We assume that the matching starts from 0. */
mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
- match_last = check_matching (preg, &mctx, 0, fl_longest_match);
+ match_last = check_matching (preg, &mctx, fl_longest_match);
if (match_last != -1)
{
if (BE (match_last == -2, 0))
@@ -919,8 +919,8 @@ acquire_init_state_context (err, preg, m
if (dfa->init_state->has_constraint)
{
unsigned int context;
- context = re_string_context_at (mctx->input, idx - 1, mctx->eflags,
- preg->newline_anchor);
+ context = re_string_context_at (mctx->input, idx - 1, mctx->eflags,
+ preg->newline_anchor);
if (IS_WORD_CONTEXT (context))
return dfa->init_state_word;
else if (IS_ORDINARY_CONTEXT (context))
@@ -947,16 +947,15 @@ acquire_init_state_context (err, preg, m
/* Check whether the regular expression match input string INPUT or not,
and return the index where the matching end, return -1 if not match,
or return -2 in case of an error.
- FL_SEARCH means we must search where the matching starts,
FL_LONGEST_MATCH means we want the POSIX longest matching.
Note that the matcher assume that the maching starts from the current
index of the buffer. */
static int
-check_matching (preg, mctx, fl_search, fl_longest_match)
+check_matching (preg, mctx, fl_longest_match)
const regex_t *preg;
re_match_context_t *mctx;
- int fl_search, fl_longest_match;
+ int fl_longest_match;
{
re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
reg_errcode_t err;
@@ -1006,31 +1005,15 @@ check_matching (preg, mctx, fl_search, f
while (!re_string_eoi (mctx->input))
{
- cur_state = transit_state (&err, preg, mctx, cur_state,
- fl_search && !match);
+ cur_state = transit_state (&err, preg, mctx, cur_state);
if (cur_state == NULL) /* Reached at the invalid state or an error. */
{
cur_str_idx = re_string_cur_idx (mctx->input);
if (BE (err != REG_NOERROR, 0))
return -2;
- if (fl_search && !match)
- {
- /* Restart from initial state, since we are searching
- the point from where matching start. */
-#ifdef RE_ENABLE_I18N
- if (dfa->mb_cur_max == 1
- || re_string_first_byte (mctx->input, cur_str_idx))
-#endif /* RE_ENABLE_I18N */
- cur_state = acquire_init_state_context (&err, preg, mctx,
- cur_str_idx);
- if (BE (cur_state == NULL && err != REG_NOERROR, 0))
- return -2;
- if (mctx->state_log != NULL)
- mctx->state_log[cur_str_idx] = cur_state;
- }
- else if (!fl_longest_match && match)
+ if (!fl_longest_match && match)
break;
- else /* (fl_longest_match && match) || (!fl_search && !match) */
+ else
{
if (mctx->state_log == NULL)
break;
@@ -2069,12 +2052,11 @@ sift_states_iter_mb (preg, mctx, sctx, n
update the destination of STATE_LOG. */
static re_dfastate_t *
-transit_state (err, preg, mctx, state, fl_search)
+transit_state (err, preg, mctx, state)
reg_errcode_t *err;
const regex_t *preg;
re_match_context_t *mctx;
re_dfastate_t *state;
- int fl_search;
{
re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
re_dfastate_t **trtable, *next_state;
@@ -2113,24 +2095,40 @@ transit_state (err, preg, mctx, state, f
{
/* Use transition table */
ch = re_string_fetch_byte (mctx->input);
- trtable = fl_search ? state->trtable_search : state->trtable;
+ trtable = state->trtable;
if (trtable == NULL)
{
- trtable = build_trtable (preg, state, fl_search);
- if (fl_search)
- state->trtable_search = trtable;
+ trtable = build_trtable (preg, state);
+ if (trtable == NULL)
+ {
+ *err = REG_ESPACE;
+ return NULL;
+ }
+ }
+ if (BE (state->word_trtable, 0))
+ {
+ unsigned int context;
+ context
+ = re_string_context_at (mctx->input,
+ re_string_cur_idx (mctx->input) - 1,
+ mctx->eflags, preg->newline_anchor);
+ if (IS_WORD_CONTEXT (context))
+ next_state = trtable[ch + SBC_MAX];
else
- state->trtable = trtable;
+ next_state = trtable[ch];
}
- next_state = trtable[ch];
+ else
+ next_state = trtable[ch];
}
+#if 0
else
{
/* don't use transition table */
- next_state = transit_state_sb (err, preg, state, fl_search, mctx);
+ next_state = transit_state_sb (err, preg, state, mctx);
if (BE (next_state == NULL && err != REG_NOERROR, 0))
return NULL;
}
+#endif
}
cur_idx = re_string_cur_idx (mctx->input);
@@ -2242,15 +2240,15 @@ check_subexp_matching_top (dfa, mctx, cu
return REG_NOERROR;
}
+#if 0
/* Return the next state to which the current state STATE will transit by
accepting the current input byte. */
static re_dfastate_t *
-transit_state_sb (err, preg, state, fl_search, mctx)
+transit_state_sb (err, preg, state, mctx)
reg_errcode_t *err;
const regex_t *preg;
re_dfastate_t *state;
- int fl_search;
re_match_context_t *mctx;
{
re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
@@ -2276,29 +2274,6 @@ transit_state_sb (err, preg, state, fl_s
}
}
}
- if (fl_search)
- {
-#ifdef RE_ENABLE_I18N
- int not_initial = 0;
- if (dfa->mb_cur_max > 1)
- for (node_cnt = 0; node_cnt < next_nodes.nelem; ++node_cnt)
- if (dfa->nodes[next_nodes.elems[node_cnt]].type == CHARACTER)
- {
- not_initial = dfa->nodes[next_nodes.elems[node_cnt]].mb_partial;
- break;
- }
- if (!not_initial)
-#endif
- {
- *err = re_node_set_merge (&next_nodes,
- dfa->init_state->entrance_nodes);
- if (BE (*err != REG_NOERROR, 0))
- {
- re_node_set_free (&next_nodes);
- return NULL;
- }
- }
- }
context = re_string_context_at (mctx->input, cur_str_idx, mctx->eflags,
preg->newline_anchor);
next_state = re_acquire_state_context (err, dfa, &next_nodes, context);
@@ -2309,6 +2284,7 @@ transit_state_sb (err, preg, state, fl_s
re_string_skip_bytes (mctx->input, 1);
return next_state;
}
+#endif
#ifdef RE_ENABLE_I18N
static reg_errcode_t
@@ -3117,10 +3093,9 @@ expand_bkref_cache (preg, mctx, cur_node
Return the new table if succeeded, otherwise return NULL. */
static re_dfastate_t **
-build_trtable (preg, state, fl_search)
+build_trtable (preg, state)
const regex_t *preg;
- const re_dfastate_t *state;
- int fl_search;
+ re_dfastate_t *state;
{
reg_errcode_t err;
re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
@@ -3154,6 +3129,7 @@ build_trtable (preg, state, fl_search)
/* Initialize transiton table. */
trtable = (re_dfastate_t **) calloc (sizeof (re_dfastate_t *), SBC_MAX);
+ state->word_trtable = 0;
if (BE (trtable == NULL, 0))
{
if (dests_node_malloced)
@@ -3170,7 +3146,10 @@ build_trtable (preg, state, fl_search)
free (dests_node);
/* Return NULL in case of an error, trtable otherwise. */
if (ndests == 0)
- return trtable;
+ {
+ state->trtable = trtable;
+ return trtable;
+ }
free (trtable);
return NULL;
}
@@ -3224,26 +3203,6 @@ out_free:
goto out_free;
}
}
- /* If search flag is set, merge the initial state. */
- if (fl_search)
- {
-#ifdef RE_ENABLE_I18N
- int not_initial = 0;
- for (j = 0; j < follows.nelem; ++j)
- if (dfa->nodes[follows.elems[j]].type == CHARACTER)
- {
- not_initial = dfa->nodes[follows.elems[j]].mb_partial;
- break;
- }
- if (!not_initial)
-#endif
- {
- err = re_node_set_merge (&follows,
- dfa->init_state->entrance_nodes);
- if (BE (err != REG_NOERROR, 0))
- goto out_free;
- }
- }
dest_states[i] = re_acquire_state_context (&err, dfa, &follows, 0);
if (BE (dest_states[i] == NULL && err != REG_NOERROR, 0))
goto out_free;
@@ -3274,31 +3233,41 @@ out_free:
for (j = 0; j < UINT_BITS; ++j, ++ch)
if ((acceptable[i] >> j) & 1)
{
- /* The current state accepts the character ch. */
- if (IS_WORD_CHAR (ch))
- {
- for (k = 0; k < ndests; ++k)
- if ((dests_ch[k][i] >> j) & 1)
+ for (k = 0; k < ndests; ++k)
+ if ((dests_ch[k][i] >> j) & 1)
+ {
+ /* k-th destination accepts the word character ch. */
+ if (state->word_trtable)
{
- /* k-th destination accepts the word character ch. */
- trtable[ch] = dest_states_word[k];
- /* There must be only one destination which accepts
- character ch. See group_nodes_into_DFAstates. */
- break;
+ trtable[ch] = dest_states[k];
+ trtable[ch + SBC_MAX] = dest_states_word[k];
}
- }
- else /* not WORD_CHAR */
- {
- for (k = 0; k < ndests; ++k)
- if ((dests_ch[k][i] >> j) & 1)
+ else if (dfa->mb_cur_max > 1
+ && dest_states[k] != dest_states_word[k])
{
- /* k-th destination accepts the non-word character ch. */
+ re_dfastate_t **new_trtable;
+
+ new_trtable = (re_dfastate_t **)
+ realloc (trtable,
+ sizeof (re_dfastate_t *)
+ * 2 * SBC_MAX);
+ if (BE (new_trtable == NULL, 0))
+ goto out_free;
+ memcpy (new_trtable + SBC_MAX, new_trtable,
+ sizeof (re_dfastate_t *) * SBC_MAX);
+ trtable = new_trtable;
+ state->word_trtable = 1;
trtable[ch] = dest_states[k];
- /* There must be only one destination which accepts
- character ch. See group_nodes_into_DFAstates. */
- break;
+ trtable[ch + SBC_MAX] = dest_states_word[k];
}
- }
+ else if (IS_WORD_CHAR (ch))
+ trtable[ch] = dest_states_word[k];
+ else
+ trtable[ch] = dest_states[k];
+ /* There must be only one destination which accepts
+ character ch. See group_nodes_into_DFAstates. */
+ break;
+ }
}
/* new line */
if (bitset_contain (acceptable, NEWLINE_CHAR))
@@ -3309,6 +3278,8 @@ out_free:
{
/* k-th destination accepts newline character. */
trtable[NEWLINE_CHAR] = dest_states_nl[k];
+ if (state->word_trtable)
+ trtable[NEWLINE_CHAR + SBC_MAX] = dest_states_nl[k];
/* There must be only one destination which accepts
newline. See group_nodes_into_DFAstates. */
break;
@@ -3325,6 +3296,7 @@ out_free:
if (dests_node_malloced)
free (dests_node);
+ state->trtable = trtable;
return trtable;
}
@@ -3386,6 +3358,8 @@ group_nodes_into_DFAstates (preg, state,
match it the context. */
if (constraint)
{
+ int word_char_max;
+
if (constraint & NEXT_NEWLINE_CONSTRAINT)
{
int accepts_newline = bitset_contain (accepts, NEWLINE_CHAR);
@@ -3400,11 +3374,16 @@ group_nodes_into_DFAstates (preg, state,
bitset_empty (accepts);
continue;
}
+
+ /* This assumes ASCII compatible locale. We cannot say
+ anything about the non-ascii chars. */
+ word_char_max
+ = dfa->mb_cur_max > 1 ? BITSET_UINTS / 2 : BITSET_UINTS;
if (constraint & NEXT_WORD_CONSTRAINT)
- for (j = 0; j < BITSET_UINTS; ++j)
+ for (j = 0; j < word_char_max; ++j)
accepts[j] &= dfa->word_char[j];
if (constraint & NEXT_NOTWORD_CONSTRAINT)
- for (j = 0; j < BITSET_UINTS; ++j)
+ for (j = 0; j < word_char_max; ++j)
accepts[j] &= ~dfa->word_char[j];
}
--- libc/posix/regex_internal.c.jj 2003-11-19 10:24:36.000000000 +0100
+++ libc/posix/regex_internal.c 2003-11-20 21:25:25.000000000 +0100
@@ -1207,7 +1207,6 @@ create_newstate_common (dfa, nodes, hash
return NULL;
}
newstate->trtable = NULL;
- newstate->trtable_search = NULL;
newstate->hash = hash;
return newstate;
}
@@ -1369,6 +1368,5 @@ free_state (state)
}
re_node_set_free (&state->nodes);
re_free (state->trtable);
- re_free (state->trtable_search);
re_free (state);
}
--- libc/posix/bug-regex19.c.jj 2003-11-12 18:41:47.000000000 +0100
+++ libc/posix/bug-regex19.c 2003-11-20 21:24:03.000000000 +0100
@@ -37,17 +37,21 @@ static struct
\xc3\x96 LATIN CAPITAL LETTER O WITH DIAERESIS
\xe2\x80\x94 EM DASH */
/* Should not match. */
+ {RE_SYNTAX_POSIX_BASIC, "\\<A", "aOAA", 0, -1},
{RE_SYNTAX_POSIX_BASIC, "\\<A", "aOAA", 2, -1},
{RE_SYNTAX_POSIX_BASIC, "A\\>", "aAAO", 1, -1},
+ {RE_SYNTAX_POSIX_BASIC, "\\bA", "aOAA", 0, -1},
{RE_SYNTAX_POSIX_BASIC, "\\bA", "aOAA", 2, -1},
{RE_SYNTAX_POSIX_BASIC, "A\\b", "aAAO", 1, -1},
+ {RE_SYNTAX_POSIX_BASIC, "\\<\xc3\x84", "a\xc3\x96\xc3\x84\xc3\x84", 0, -1},
{RE_SYNTAX_POSIX_BASIC, "\\<\xc3\x84", "a\xc3\x96\xc3\x84\xc3\x84", 3, -1},
{RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\>", "a\xc3\x84\xc3\x84\xc3\x96", 1, -1},
#if 0
- /* XXX Not used since they fail so far. */
+ /* XXX these 2 tests still fail. */
+ {RE_SYNTAX_POSIX_BASIC, "\\b\xc3\x84", "a\xc3\x96\xc3\x84\xc3\x84", 0, -1},
{RE_SYNTAX_POSIX_BASIC, "\\b\xc3\x84", "a\xc3\x96\xc3\x84\xc3\x84", 3, -1},
- {RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\b", "a\xc3\x84\xc3\x84\xc3\x96", 1, -1},
#endif
+ {RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\b", "a\xc3\x84\xc3\x84\xc3\x96", 1, -1},
/* Should match. */
{RE_SYNTAX_POSIX_BASIC, "\\<A", "AA", 0, 0},
{RE_SYNTAX_POSIX_BASIC, "\\<A", "a-AA", 2, 2},
@@ -57,8 +61,6 @@ static struct
{RE_SYNTAX_POSIX_BASIC, "\\bA", "a-AA", 2, 2},
{RE_SYNTAX_POSIX_BASIC, "A\\b", "aAA-", 1, 2},
{RE_SYNTAX_POSIX_BASIC, "A\\b", "aAA", 1, 2},
-#if 0
- /* XXX Not used since they fail so far. */
{RE_SYNTAX_POSIX_BASIC, "\\<\xc3\x84", "\xc3\x84\xc3\x84", 0, 0},
{RE_SYNTAX_POSIX_BASIC, "\\<\xc3\x84", "a\xe2\x80\x94\xc3\x84\xc3\x84", 4, 4},
{RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\>", "a\xc3\x84\xc3\x84\xe2\x80\x94", 1, 3},
@@ -67,7 +69,6 @@ static struct
{RE_SYNTAX_POSIX_BASIC, "\\b\xc3\x84", "a\xe2\x80\x94\xc3\x84\xc3\x84", 4, 4},
{RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\b", "a\xc3\x84\xc3\x84\xe2\x80\x94", 1, 3},
{RE_SYNTAX_POSIX_BASIC, "\xc3\x84\\b", "a\xc3\x84\xc3\x84", 1, 3}
-#endif
};
int
--- libc/posix/tst-rxspencer.c.jj 2003-11-20 00:40:48.000000000 +0100
+++ libc/posix/tst-rxspencer.c 2003-11-20 14:56:22.000000000 +0100
@@ -350,16 +350,28 @@ mb_tests (const char *pattern, int cflag
if (strstr (pattern, "[:xdigit:]"))
return 0;
+ /* XXX: regex ATM handles only single byte equivalence classes. */
+ if (strstr (pattern, "[[=b=]]"))
+ return 0;
+
for (i = 1; i < 16; ++i)
{
char *p = letters;
- if (i & 1)
+ if ((i & 1)
+ && (strchr (pattern, 'a') || strchr (string, 'a')
+ || strchr (pattern, 'A') || strchr (string, 'A')))
*p++ = 'a', *p++ = 'A';
- if (i & 2)
+ if ((i & 2)
+ && (strchr (pattern, 'b') || strchr (string, 'b')
+ || strchr (pattern, 'B') || strchr (string, 'B')))
*p++ = 'b', *p++ = 'B';
- if (i & 4)
+ if ((i & 4)
+ && (strchr (pattern, 'c') || strchr (string, 'c')
+ || strchr (pattern, 'C') || strchr (string, 'C')))
*p++ = 'c', *p++ = 'C';
- if (i & 8)
+ if ((i & 8)
+ && (strchr (pattern, 'd') || strchr (string, 'd')
+ || strchr (pattern, 'D') || strchr (string, 'D')))
*p++ = 'd', *p++ = 'D';
*p++ = '\0';
sprintf (fail, "UTF-8 %s FAIL", letters);
@@ -489,7 +501,11 @@ main (int argc, char **argv)
replace_special_chars (matches);
}
- setlocale (LC_ALL, "C");
+ if (setlocale (LC_ALL, "C") == NULL)
+ {
+ puts ("setlocale C failed");
+ ret = 1;
+ }
if (test (pattern, cflags, string, eflags, expect, matches, "FAIL")
|| (try_bre_ere
&& test (pattern, cflags & ~REG_EXTENDED, string, eflags,
@@ -497,12 +513,16 @@ main (int argc, char **argv)
ret = 1;
else if (test_utf8)
{
- setlocale (LC_ALL, "cs_CZ.UTF-8");
- if (test (pattern, cflags, string, eflags, expect, matches,
- "UTF-8 FAIL")
- || (try_bre_ere
- && test (pattern, cflags & ~REG_EXTENDED, string, eflags,
- expect, matches, "UTF-8 FAIL")))
+ if (setlocale (LC_ALL, "cs_CZ.UTF-8") == NULL)
+ {
+ puts ("setlocale cs_CZ.UTF-8 failed");
+ ret = 1;
+ }
+ else if (test (pattern, cflags, string, eflags, expect, matches,
+ "UTF-8 FAIL")
+ || (try_bre_ere
+ && test (pattern, cflags & ~REG_EXTENDED, string,
+ eflags, expect, matches, "UTF-8 FAIL")))
ret = 1;
else if (mb_tests (pattern, cflags, string, eflags, expect, matches)
|| (try_bre_ere
--- libc/posix/Makefile.jj 2003-11-20 00:40:47.000000000 +0100
+++ libc/posix/Makefile 2003-11-20 15:18:18.000000000 +0100
@@ -148,7 +148,6 @@ tst-exec-ARGS = -- $(built-program-cmd)
tst-spawn-ARGS = -- $(built-program-cmd)
tst-dir-ARGS = `pwd` `cd $(common-objdir)/$(subdir); pwd` `cd $(common-objdir); pwd` $(objpfx)tst-dir
tst-chmod-ARGS = `pwd`
-tst-rxspencer-ARGS = rxspencer/tests
tst-fnmatch-ENV = LOCPATH=$(common-objpfx)localedata
tst-regexloc-ENV = LOCPATH=$(common-objpfx)localedata
@@ -160,6 +159,8 @@ bug-regex17-ENV = LOCPATH=$(common-objpf
bug-regex18-ENV = LOCPATH=$(common-objpfx)localedata
bug-regex19-ENV = LOCPATH=$(common-objpfx)localedata
bug-regex20-ENV = LOCPATH=$(common-objpfx)localedata
+tst-rxspencer-ARGS = --utf8 rxspencer/tests
+tst-rxspencer-ENV = LOCPATH=$(common-objpfx)localedata
testcases.h: TESTS TESTS2C.sed
sed -f TESTS2C.sed < $< > $@T
@@ -207,9 +208,13 @@ bug-regex21-ENV = MALLOC_TRACE=$(objpfx)
$(objpfx)bug-regex21-mem: $(objpfx)bug-regex21.out
$(common-objpfx)malloc/mtrace $(objpfx)bug-regex21.mtrace > $@
-tst-rxspencer-ENV = MALLOC_TRACE=$(objpfx)tst-rxspencer.mtrace
-
+# tst-rxspencer.mtrace is generated only when run without --utf8
+# option, since otherwise the file has almost 100M and takes very long
+# time to process.
$(objpfx)tst-rxspencer-mem: $(objpfx)tst-rxspencer.out
+ MALLOC_TRACE=$(objpfx)tst-rxspencer.mtrace $(tst-rxspencer-ENV) \
+ $(run-program-prefix) $(objpfx)tst-rxspencer rxspencer/tests \
+ > /dev/null
$(common-objpfx)malloc/mtrace $(objpfx)tst-rxspencer.mtrace > $@
$(objpfx)tst-getconf.out: tst-getconf.sh $(objpfx)getconf
Jakub
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] Some regex word matching fixes for multi-byte locales
2003-11-21 0:12 [PATCH] Some regex word matching fixes for multi-byte locales Jakub Jelinek
@ 2003-11-21 8:39 ` Ulrich Drepper
0 siblings, 0 replies; 2+ messages in thread
From: Ulrich Drepper @ 2003-11-21 8:39 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: Glibc hackers
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I've applied this patch. Thanks,
- --
⧠Ulrich Drepper ⧠Red Hat, Inc. ⧠444 Castro St ⧠Mountain View, CA â
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
iD8DBQE/vVwN2ijCOnn/RHQRAszQAKCK7Kd4CxKHL1K/60FNanz7ksGoXQCeP1HM
8Y/ltHMQ79Vrc9/CahzNefE=
=XApt
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2003-11-21 0:28 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-21 0:12 [PATCH] Some regex word matching fixes for multi-byte locales Jakub Jelinek
2003-11-21 8:39 ` Ulrich Drepper
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).