public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/693] New: mishandled '\B'
@ 2005-01-25 10:32 kasal at ucw dot cz
2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: kasal at ucw dot cz @ 2005-01-25 10:32 UTC (permalink / raw)
To: glibc-bugs-regex
The GNU extension '\B' has always meant non-\b.
The dfa.[ch] code included in grep and gawk still handles it this way.
Try echo ' ' | grep ' \B ' on your system, try gsub(/ \B /,...) in gawk-3.1.1
or gawk '/ B /' with current gawk.
I have checked Perl documentation; it also defines '\B' and non-\b.
But current regex has changed to interpret '\B' as inword space. Try the above
gsub with current awk.
See also http://lists.gnu.org/archive/html/bug-gnu-utils/2005-01/msg00087.html
IMHO, the current regex code is not correct.
--
Summary: mishandled '\B'
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: regex
AssignedTo: gotom at debian dot or dot jp
ReportedBy: kasal at ucw dot cz
CC: glibc-bugs-regex at sources dot redhat dot com,glibc-
bugs at sources dot redhat dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
@ 2005-01-25 13:53 ` tee at sgi dot com
2005-01-25 14:08 ` jakub at redhat dot com
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: tee at sgi dot com @ 2005-01-25 13:53 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From tee at sgi dot com 2005-01-25 13:53 -------
If the '\B' funtionality is fixed, then the man page will also need to be updated.
It currently says that '\B' "matches the empty string within a word." Also, I
noticed that '\b' is not documented in the man page at all.
Thank you.
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |tee at sgi dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
@ 2005-01-25 14:08 ` jakub at redhat dot com
2005-01-25 17:11 ` karl at freefriends dot org
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: jakub at redhat dot com @ 2005-01-25 14:08 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From jakub at redhat dot com 2005-01-25 14:08 -------
regex.texi:
@title Regex
@subtitle edition 0.12a
@subtitle 19 September 1992
@author Kathryn A. Hargreaves
@author Karl Berry
has:
@cindex @samp{\B}
This operator (represented by @samp{\B}) matches the empty string within
a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
@samp{dirty \Brat} doesn't match @samp{dirty rat}.
so to me this seems that current glibc regex works as documented.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
2005-01-25 14:08 ` jakub at redhat dot com
@ 2005-01-25 17:11 ` karl at freefriends dot org
2005-01-26 13:44 ` arnold at skeeve dot com
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: karl at freefriends dot org @ 2005-01-25 17:11 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From karl at freefriends dot org 2005-01-25 17:11 -------
When Kathy and I wrote that description of \B more than a decade ago, we did not
have any deep reasoning behind it. In fact, we never thought of the possibility
that "inverse of \b" and "empty string within a word" were two different things.
We were just trying to give simple examples.
The manual was never meant to be taken as gospel or a standard, we were more
trying to describe how things worked than as a prescription of how things should
work. (It also desperately needs updating.)
So, I'm sorry that so much effort has gone into implementing our off-the-cuff
description of \B. However, it seems to me that it would be better for users if
\B in the new regex had the same definition as it's always had -- not \b. I
don't see any advantage to being incompatible with the past here; just the opposite.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (2 preceding siblings ...)
2005-01-25 17:11 ` karl at freefriends dot org
@ 2005-01-26 13:44 ` arnold at skeeve dot com
2005-01-26 17:33 ` jakub at redhat dot com
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-26 13:44 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From arnold at skeeve dot com 2005-01-26 13:44 -------
I'll second Karl's motion here; the current regex should be fixed to work like
the old one did. This brings dfa and regex back into line, which both grep
and gawk need, and provides backwards compatibility. The manual can and probably
should be changed, although that's a separate issue. The compatibility with
perl is also a welcome thing to have.
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |arnold at skeeve dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (3 preceding siblings ...)
2005-01-26 13:44 ` arnold at skeeve dot com
@ 2005-01-26 17:33 ` jakub at redhat dot com
2005-01-30 12:09 ` arnold at skeeve dot com
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: jakub at redhat dot com @ 2005-01-26 17:33 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From jakub at redhat dot com 2005-01-26 17:33 -------
Patch here: <http://sources.redhat.com/ml/libc-hacker/2005-01/msg00066.html>
--
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (4 preceding siblings ...)
2005-01-26 17:33 ` jakub at redhat dot com
@ 2005-01-30 12:09 ` arnold at skeeve dot com
2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-30 12:09 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From arnold at skeeve dot com 2005-01-30 12:09 -------
Hi. First, I appreciate the quick response on this issue. I am thus
saddened to say that the behavior between the current CVS and the original
regex is not identical. And to be honest, I'm not sure which is "correct".
Here's a script showing the difference:
$ cat typescript
Script started on Sun Jan 30 14:03:31 2005
bash-2.05b$ cat gnureop2.awk
BEGIN {
print (" " ~ / \B /) # test dfa matcher
a = " "
gsub(/\B/, "x", a) # test regex matcher
print a
}
bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
1
x
bash-2.05b$ gawk-3.1.4 -f gnureop2.awk # previous glibc regex
1
bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex
1
x x x
bash-2.05b$
Script done on Sun Jan 30 14:04:26 2005
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (5 preceding siblings ...)
2005-01-30 12:09 ` arnold at skeeve dot com
@ 2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
2005-01-31 5:40 ` kasal at ucw dot cz
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2005-01-30 13:33 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From paolo dot bonzini at lu dot unisi dot ch 2005-01-30 13:33 -------
Subject: Re: mishandled '\B'
> bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
> 1
> x
> bash-2.05b$ ./gawk -f gnureop2.awk # current CVS glibc regex
> 1
> x x x
It looks like old regex special-cased the first and last character so
that is was neither a word character nor a non-word character. The
current behavior is more consistent. FWIW, PCRE also shows the same
behavior as current CVS glibc regex.
Paolo
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (6 preceding siblings ...)
2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
@ 2005-01-31 5:40 ` kasal at ucw dot cz
2005-01-31 7:54 ` arnold at skeeve dot com
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: kasal at ucw dot cz @ 2005-01-31 5:40 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From kasal at ucw dot cz 2005-01-31 05:40 -------
(In reply to comment #6)
> print (" " ~ / \B /) # test dfa matcher
I tried the following:
$ gawk 'BEGIN{print "a b" ~ /\B/}'
0
$ gawk 'BEGIN{print " b" ~ /\B/}'
1
$ gawk 'BEGIN{print "a " ~ /\B/}'
1
This proves that the dfa matcher has the same opinion of the current CVS regex.
I'd to conclude that the old regex contained a bug which was not discovered
until now, when it is dead for more than 2 years.
(FWIW, I have also verified that perl has the same behaviour as PCRE and new
regex and dfa.c.)
The current regex.c seems OK. Thank you again, Jakub.
--
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |FIXED
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (7 preceding siblings ...)
2005-01-31 5:40 ` kasal at ucw dot cz
@ 2005-01-31 7:54 ` arnold at skeeve dot com
2005-02-16 4:19 ` roland at gnu dot org
2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-31 7:54 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From arnold at skeeve dot com 2005-01-31 07:53 -------
At this point, I too am inclined to stay with current CVS regex behavior.
To the best of my knowledge, Emacs still uses the old regex; it may or may not
be worthwhile mentioning this to RMS or whoever maintains Emacs. Then again,
it may also be best to let sleeping dogs lie. :-)
Thanks again to Jakub and everyone else. -- Arnold
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (8 preceding siblings ...)
2005-01-31 7:54 ` arnold at skeeve dot com
@ 2005-02-16 4:19 ` roland at gnu dot org
2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
10 siblings, 0 replies; 12+ messages in thread
From: roland at gnu dot org @ 2005-02-16 4:19 UTC (permalink / raw)
To: glibc-bugs-regex
--
What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |724
nThis| |
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug regex/693] mishandled '\B'
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
` (9 preceding siblings ...)
2005-02-16 4:19 ` roland at gnu dot org
@ 2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu dot org @ 2005-02-16 11:09 UTC (permalink / raw)
To: glibc-bugs-regex
------- Additional Comments From cvs-commit at gcc dot gnu dot org 2005-02-16 11:09 -------
Subject: Bug 693
CVSROOT: /cvs/glibc
Module name: libc
Branch: glibc-2_3-branch
Changes by: roland@sources.redhat.com 2005-02-16 11:09:25
Modified files:
posix : bug-regex19.c regcomp.c tst-rxspencer.c
regex_internal.h
posix/rxspencer: tests
Log message:
2005-01-26 Jakub Jelinek <jakub@redhat.com>
[BZ #693]
* posix/regex_internal.h (DUMMY_CONSTRAINT): Rename to...
(WORD_DELIM_CONSTRAINT): ...this.
(NOT_WORD_DELIM_CONSTRAINT): Define.
(re_context_type): Add INSIDE_NOTWORD and NOT_WORD_DELIM,
change WORD_DELIM to use WORD_DELIM_CONSTRAINT.
* posix/regcomp.c (peek_token): For \B create NOT_WORD_DELIM
anchor instead of INSIDE_WORD.
(parse_expression): Handle NOT_WORD_DELIM constraint.
* posix/bug-regex19.c (tests): Adjust tests that relied on \B
being inside word instead of not word delim.
* posix/tst-rxspencer.c (mb_frob_pattern): Don't frob escaped
characters.
* posix/rxspencer/tests: Add some new tests.
Patches:
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/bug-regex19.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.6&r2=1.6.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.87&r2=1.87.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/tst-rxspencer.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.7&r2=1.7.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.h.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.56&r2=1.56.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/rxspencer/tests.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.5&r2=1.5.2.1
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=693
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-02-16 11:09 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
2005-01-25 14:08 ` jakub at redhat dot com
2005-01-25 17:11 ` karl at freefriends dot org
2005-01-26 13:44 ` arnold at skeeve dot com
2005-01-26 17:33 ` jakub at redhat dot com
2005-01-30 12:09 ` arnold at skeeve dot com
2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
2005-01-31 5:40 ` kasal at ucw dot cz
2005-01-31 7:54 ` arnold at skeeve dot com
2005-02-16 4:19 ` roland at gnu dot org
2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).