public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/693] New: mishandled '\B'
@ 2005-01-25 10:32 kasal at ucw dot cz
  2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: kasal at ucw dot cz @ 2005-01-25 10:32 UTC (permalink / raw)
  To: glibc-bugs-regex

The GNU extension '\B' has always meant non-\b.
The dfa.[ch] code included in grep and gawk still handles it this way.
Try echo '  ' | grep ' \B ' on your system, try gsub(/ \B /,...) in gawk-3.1.1
or gawk '/ B /' with current gawk.

I have checked Perl documentation; it also defines '\B' and non-\b.

But current regex has changed to interpret '\B' as inword space.  Try the above
gsub with current awk.

See also http://lists.gnu.org/archive/html/bug-gnu-utils/2005-01/msg00087.html

IMHO, the current regex code is not correct.

-- 
           Summary: mishandled '\B'
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: kasal at ucw dot cz
                CC: glibc-bugs-regex at sources dot redhat dot com,glibc-
                    bugs at sources dot redhat dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
@ 2005-01-25 13:53 ` tee at sgi dot com
  2005-01-25 14:08 ` jakub at redhat dot com
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tee at sgi dot com @ 2005-01-25 13:53 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From tee at sgi dot com  2005-01-25 13:53 -------
If the '\B' funtionality is fixed, then the man page will also need to be updated.
It currently says that '\B' "matches the empty string within a word."  Also, I
noticed that '\b' is not documented in the man page at all.

Thank you.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tee at sgi dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
  2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
@ 2005-01-25 14:08 ` jakub at redhat dot com
  2005-01-25 17:11 ` karl at freefriends dot org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: jakub at redhat dot com @ 2005-01-25 14:08 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From jakub at redhat dot com  2005-01-25 14:08 -------
regex.texi:
@title Regex
@subtitle edition 0.12a
@subtitle 19 September 1992
@author Kathryn A. Hargreaves
@author Karl Berry

has:
@cindex @samp{\B}

This operator (represented by @samp{\B}) matches the empty string within
a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
@samp{dirty \Brat} doesn't match @samp{dirty rat}.

so to me this seems that current glibc regex works as documented.

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
  2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
  2005-01-25 14:08 ` jakub at redhat dot com
@ 2005-01-25 17:11 ` karl at freefriends dot org
  2005-01-26 13:44 ` arnold at skeeve dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: karl at freefriends dot org @ 2005-01-25 17:11 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From karl at freefriends dot org  2005-01-25 17:11 -------
When Kathy and I wrote that description of \B more than a decade ago, we did not
have any deep reasoning behind it.  In fact, we never thought of the possibility
that "inverse of \b" and "empty string within a word" were two different things.
 We were just trying to give simple examples.  

The manual was never meant to be taken as gospel or a standard, we were more
trying to describe how things worked than as a prescription of how things should
work. (It also desperately needs updating.)

So, I'm sorry that so much effort has gone into implementing our off-the-cuff
description of \B.  However, it seems to me that it would be better for users if
\B in the new regex had the same definition as it's always had -- not \b.  I
don't see any advantage to being incompatible with the past here; just the opposite.


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (2 preceding siblings ...)
  2005-01-25 17:11 ` karl at freefriends dot org
@ 2005-01-26 13:44 ` arnold at skeeve dot com
  2005-01-26 17:33 ` jakub at redhat dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-26 13:44 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From arnold at skeeve dot com  2005-01-26 13:44 -------
I'll second Karl's motion here; the current regex should be fixed to work like
the old one did.  This brings dfa and regex back into line, which both grep
and gawk need, and provides backwards compatibility.  The manual can and probably
should be changed, although that's a separate issue.  The compatibility with
perl is also a welcome thing to have.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |arnold at skeeve dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (3 preceding siblings ...)
  2005-01-26 13:44 ` arnold at skeeve dot com
@ 2005-01-26 17:33 ` jakub at redhat dot com
  2005-01-30 12:09 ` arnold at skeeve dot com
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: jakub at redhat dot com @ 2005-01-26 17:33 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From jakub at redhat dot com  2005-01-26 17:33 -------
Patch here: <http://sources.redhat.com/ml/libc-hacker/2005-01/msg00066.html>

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (4 preceding siblings ...)
  2005-01-26 17:33 ` jakub at redhat dot com
@ 2005-01-30 12:09 ` arnold at skeeve dot com
  2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-30 12:09 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From arnold at skeeve dot com  2005-01-30 12:09 -------
Hi. First, I appreciate the quick response on this issue.  I am thus
saddened to say that the behavior between the current CVS and the original
regex is not identical.  And to be honest, I'm not sure which is "correct".
Here's a script showing the difference:

$ cat typescript 
Script started on Sun Jan 30 14:03:31 2005
bash-2.05b$ cat gnureop2.awk 
BEGIN {
        print ("  " ~ / \B /)   # test dfa matcher
        a = "  "
        gsub(/\B/, "x", a)      # test regex matcher
        print a
}
bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
1
 x 
bash-2.05b$ gawk-3.1.4 -f gnureop2.awk # previous glibc regex
1
  
bash-2.05b$ ./gawk -f gnureop2.awk      # current CVS glibc regex
1
x x x
bash-2.05b$ 
Script done on Sun Jan 30 14:04:26 2005


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (5 preceding siblings ...)
  2005-01-30 12:09 ` arnold at skeeve dot com
@ 2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
  2005-01-31  5:40 ` kasal at ucw dot cz
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2005-01-30 13:33 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From paolo dot bonzini at lu dot unisi dot ch  2005-01-30 13:33 -------
Subject: Re:  mishandled '\B'

> bash-2.05b$ gawk-3.1.1 -f gnureop2.awk # old regex
> 1
>  x 
> bash-2.05b$ ./gawk -f gnureop2.awk      # current CVS glibc regex
> 1
> x x x

It looks like old regex special-cased the first and last character so 
that is was neither a word character nor a non-word character.  The 
current behavior is more consistent.  FWIW, PCRE also shows the same 
behavior as current CVS glibc regex.

Paolo


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (6 preceding siblings ...)
  2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
@ 2005-01-31  5:40 ` kasal at ucw dot cz
  2005-01-31  7:54 ` arnold at skeeve dot com
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: kasal at ucw dot cz @ 2005-01-31  5:40 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From kasal at ucw dot cz  2005-01-31 05:40 -------
(In reply to comment #6)
>         print ("  " ~ / \B /)   # test dfa matcher

I tried the following:
$ gawk 'BEGIN{print "a b" ~ /\B/}'
0
$ gawk 'BEGIN{print " b" ~ /\B/}'
1
$ gawk 'BEGIN{print "a " ~ /\B/}'
1

This proves that the dfa matcher has the same opinion of the current CVS regex.

I'd to conclude that the old regex contained a bug which was not discovered
until now, when it is dead for more than 2 years.

(FWIW, I have also verified that perl has the same behaviour as PCRE and new
regex and dfa.c.)

The current regex.c seems OK.  Thank you again, Jakub.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (7 preceding siblings ...)
  2005-01-31  5:40 ` kasal at ucw dot cz
@ 2005-01-31  7:54 ` arnold at skeeve dot com
  2005-02-16  4:19 ` roland at gnu dot org
  2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
  10 siblings, 0 replies; 12+ messages in thread
From: arnold at skeeve dot com @ 2005-01-31  7:54 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From arnold at skeeve dot com  2005-01-31 07:53 -------
At this point, I too am inclined to stay with current CVS regex behavior.
To the best of my knowledge, Emacs still uses the old regex; it may or may not
be worthwhile mentioning this to RMS or whoever maintains Emacs.  Then again,
it may also be best to let sleeping dogs lie. :-)

Thanks again to Jakub and everyone else. -- Arnold


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (8 preceding siblings ...)
  2005-01-31  7:54 ` arnold at skeeve dot com
@ 2005-02-16  4:19 ` roland at gnu dot org
  2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
  10 siblings, 0 replies; 12+ messages in thread
From: roland at gnu dot org @ 2005-02-16  4:19 UTC (permalink / raw)
  To: glibc-bugs-regex



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|                            |724
              nThis|                            |


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/693] mishandled '\B'
  2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
                   ` (9 preceding siblings ...)
  2005-02-16  4:19 ` roland at gnu dot org
@ 2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu dot org @ 2005-02-16 11:09 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From cvs-commit at gcc dot gnu dot org  2005-02-16 11:09 -------
Subject: Bug 693

CVSROOT:	/cvs/glibc
Module name:	libc
Branch: 	glibc-2_3-branch
Changes by:	roland@sources.redhat.com	2005-02-16 11:09:25

Modified files:
	posix          : bug-regex19.c regcomp.c tst-rxspencer.c 
	                 regex_internal.h 
	posix/rxspencer: tests 

Log message:
	2005-01-26  Jakub Jelinek  <jakub@redhat.com>
	
	[BZ #693]
	* posix/regex_internal.h (DUMMY_CONSTRAINT): Rename to...
	(WORD_DELIM_CONSTRAINT): ...this.
	(NOT_WORD_DELIM_CONSTRAINT): Define.
	(re_context_type): Add INSIDE_NOTWORD and NOT_WORD_DELIM,
	change WORD_DELIM to use WORD_DELIM_CONSTRAINT.
	* posix/regcomp.c (peek_token): For \B create NOT_WORD_DELIM
	anchor instead of INSIDE_WORD.
	(parse_expression): Handle NOT_WORD_DELIM constraint.
	* posix/bug-regex19.c (tests): Adjust tests that relied on \B
	being inside word instead of not word delim.
	* posix/tst-rxspencer.c (mb_frob_pattern): Don't frob escaped
	characters.
	* posix/rxspencer/tests: Add some new tests.

Patches:
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/bug-regex19.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.6&r2=1.6.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.87&r2=1.87.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/tst-rxspencer.c.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.7&r2=1.7.4.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/regex_internal.h.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.56&r2=1.56.2.1
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/posix/rxspencer/tests.diff?cvsroot=glibc&only_with_tag=glibc-2_3-branch&r1=1.5&r2=1.5.2.1



-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=693

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-02-16 11:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-25 10:32 [Bug regex/693] New: mishandled '\B' kasal at ucw dot cz
2005-01-25 13:53 ` [Bug regex/693] " tee at sgi dot com
2005-01-25 14:08 ` jakub at redhat dot com
2005-01-25 17:11 ` karl at freefriends dot org
2005-01-26 13:44 ` arnold at skeeve dot com
2005-01-26 17:33 ` jakub at redhat dot com
2005-01-30 12:09 ` arnold at skeeve dot com
2005-01-30 13:33 ` paolo dot bonzini at lu dot unisi dot ch
2005-01-31  5:40 ` kasal at ucw dot cz
2005-01-31  7:54 ` arnold at skeeve dot com
2005-02-16  4:19 ` roland at gnu dot org
2005-02-16 11:09 ` cvs-commit at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).