public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* regex library fails git tests
@ 2013-07-20 21:26 Mark Levedahl
  2013-07-22  2:59 ` Corinna Vinschen
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Levedahl @ 2013-07-20 21:26 UTC (permalink / raw)
  To: cygwin

Current git fails two sets of tests on cygwin due apparently to problems 
in the regex library. One set of tests does language based 
word-matching, and has a common failure during regex compilation. The 
suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is common to all 
of these, removing that clause eliminates the regcomp failure.

A test case extracted from the git sources is below - this works 
correctly on Fedora 18, fails on Cygwin:

$ gcc test-regex.c
$ ./a.out
failed regcomp() for pattern '[^<>=     ]+|[^[:space:]]|[â–’-â–’][â–’-â–’]+'

The failure disappears when the suffix clause is removed from pat_html.

This is happening on a current installation:
$ uname -a
CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin
$ cygcheck -c gcc-core gcc-g++
Cygwin Package Information
Package              Version        Status
gcc-core             4.7.3-1        OK
gcc-g++              4.7.3-1        OK

------------

#include <regex.h>
#include <stdio.h>

int main(int argc, char **argv)
{
	char *pat_html = "[^<>= \t]+"
		"|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+";
	char *str = "={}\nfred";
	regex_t r;
	regmatch_t m[1];

	if (regcomp(&r, pat_html, REG_EXTENDED | REG_NEWLINE)) {
		printf("failed regcomp() for pattern '%s'\n", pat_html);
		return 1;
	}
	if (regexec(&r, str, 1, m, 0)) {
		printf("no match of pattern '%s' to string '%s'\n",
			   pat_html, str);
		return 1;
	}
	return 0;
}

Mark


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex library fails git tests
  2013-07-20 21:26 regex library fails git tests Mark Levedahl
@ 2013-07-22  2:59 ` Corinna Vinschen
  2013-07-22  8:07   ` Mark Levedahl
  0 siblings, 1 reply; 6+ messages in thread
From: Corinna Vinschen @ 2013-07-22  2:59 UTC (permalink / raw)
  To: cygwin

On Jul 20 15:52, Mark Levedahl wrote:
> Current git fails two sets of tests on cygwin due apparently to
> problems in the regex library. One set of tests does language based
> word-matching, and has a common failure during regex compilation.
> The suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is
> common to all of these, removing that clause eliminates the regcomp
> failure.
> 
> A test case extracted from the git sources is below - this works
> correctly on Fedora 18, fails on Cygwin:
> 
> $ gcc test-regex.c
> $ ./a.out
> failed regcomp() for pattern '[^<>=     ]+|[^[:space:]]|[â–’-â–’][â–’-â–’]+'
> 
> The failure disappears when the suffix clause is removed from pat_html.
> 
> This is happening on a current installation:
> $ uname -a
> CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin

Thanks for the testcase.  The problem is this:  Cygwin's regex is taken
from FreeBSD, so it's not identical to the glibc implementation on Linux.
The FreeBSD implementation converts all input chars to wchar_t and then
handles everything, the pattern as well as the input string, in wchar_t
to be locale- and codeset independent.

You application does not call setlocale, so the locale is "C" or "POSIX"
and the codeset is ANSI_X3.4-1968 (aka ASCII).  The conversion to wchar_t
is performed by calling the mbrtowc function.  This function behaves on
Cygwin the same as on Linux:  If the current locale's codeset is ASCII,
and if the input character is >= 0x80, mbrtowc returns -1 with errno set
to EILSEQ.

This happens on Cygwin.  The regcomp routine converting the input string
to wchar_t calls mbrtowc, and mbrtowc returns -1 (EILSEQ) because the
input character is >= 0x80 in the bracket expression.

Even though the mbrtowc functions behave the same in Cygwin and glibc,
the glibc implementation of regcomp apparently does not call mbrtowc
under all circumstances, namely not in the "C"/"POSIX" locale or if the
locale's codeset is ASCII.  Therefore it does not treat the chars >= 0x80
as invalid characters.

So, what I did now was this:  I added a workaround to Cygwin's regcomp.
If the current codeset is ASCII, the characters in the pattern are
converted to wchar_t by simply using their unsigned value verbatim.
This allows to compile (and test) the patterns in the git testcases.

However, please note that this behaviour, while being provided by glibc
and now by Cygwin, is *not* standards-compliant.  In the narrow sense
the characters beyond 0x7f are still invalid ASCII chars, and other
functions working with wchar_t strings won't be as forgiving when using
invalid input.


HTH,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex library fails git tests
  2013-07-22  2:59 ` Corinna Vinschen
@ 2013-07-22  8:07   ` Mark Levedahl
  2013-07-22 10:32     ` Corinna Vinschen
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Levedahl @ 2013-07-22  8:07 UTC (permalink / raw)
  To: cygwin

On 07/21/2013 03:39 PM, Corinna Vinschen wrote:

> So, what I did now was this:  I added a workaround to Cygwin's regcomp.
> If the current codeset is ASCII, the characters in the pattern are
> converted to wchar_t by simply using their unsigned value verbatim.
> This allows to compile (and test) the patterns in the git testcases.
>
> However, please note that this behaviour, while being provided by glibc
> and now by Cygwin, is *not* standards-compliant.  In the narrow sense
> the characters beyond 0x7f are still invalid ASCII chars, and other
> functions working with wchar_t strings won't be as forgiving when using
> invalid input.
>
>
> HTH,
> Corinna
>

Thank you. I confirm that git passes the two test cases (t4018 and 
t4034) using today's snapshot. I will pass your comments about use of 
characters 0x80 and above to the git list to see if they wish to change 
anything.

Mark



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex library fails git tests
  2013-07-22  8:07   ` Mark Levedahl
@ 2013-07-22 10:32     ` Corinna Vinschen
  2013-07-22 19:19       ` Eric Blake
  0 siblings, 1 reply; 6+ messages in thread
From: Corinna Vinschen @ 2013-07-22 10:32 UTC (permalink / raw)
  To: cygwin

On Jul 21 22:59, Mark Levedahl wrote:
> On 07/21/2013 03:39 PM, Corinna Vinschen wrote:
> 
> >So, what I did now was this:  I added a workaround to Cygwin's regcomp.
> >If the current codeset is ASCII, the characters in the pattern are
> >converted to wchar_t by simply using their unsigned value verbatim.
> >This allows to compile (and test) the patterns in the git testcases.
> >
> >However, please note that this behaviour, while being provided by glibc
> >and now by Cygwin, is *not* standards-compliant.  In the narrow sense
> >the characters beyond 0x7f are still invalid ASCII chars, and other
> >functions working with wchar_t strings won't be as forgiving when using
> >invalid input.
> >
> >
> >HTH,
> >Corinna
> >
> 
> Thank you. I confirm that git passes the two test cases (t4018 and
> t4034) using today's snapshot.

Thanks for your feedback and for testing the snapshot.  I created them
yesterday but then forgot to mention them here.

> I will pass your comments about use
> of characters 0x80 and above to the git list to see if they wish to
> change anything.

After some sleep, I think I now understand why the glibc devs made
regcomp to work this way.  This behaviour is backward compatible to non
locale-aware applications.  In the "C" locale, a char is just some
arbitrary byte between 0 and 255.  So this pattern always worked before
in the "C locale, therefore it makes sense that it continues to work,
even if it won't when using other locales/codesets.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex library fails git tests
  2013-07-22 10:32     ` Corinna Vinschen
@ 2013-07-22 19:19       ` Eric Blake
  2013-07-22 21:26         ` Corinna Vinschen
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Blake @ 2013-07-22 19:19 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]

On 07/22/2013 02:12 AM, Corinna Vinschen wrote:

>>> However, please note that this behaviour, while being provided by glibc
>>> and now by Cygwin, is *not* standards-compliant.  In the narrow sense
>>> the characters beyond 0x7f are still invalid ASCII chars, and other
>>> functions working with wchar_t strings won't be as forgiving when using
>>> invalid input.
>>>

> After some sleep, I think I now understand why the glibc devs made
> regcomp to work this way.  This behaviour is backward compatible to non
> locale-aware applications.  In the "C" locale, a char is just some
> arbitrary byte between 0 and 255.  So this pattern always worked before
> in the "C locale, therefore it makes sense that it continues to work,
> even if it won't when using other locales/codesets.

By the way, there is currently a big debate going on in the Austin Group
(the people responsible for POSIX) on whether the "C" locale must be
8-bit clean (the way glibc behaves) or whether it was intended to allow
UTF-8 encoding by default (the way musl libc wants to behave); and
resolution of the debate will require input from the C standards
committee.  There may be some interesting fallout, no matter which
solution is finally reached.  http://austingroupbugs.net/view.php?id=663

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex library fails git tests
  2013-07-22 19:19       ` Eric Blake
@ 2013-07-22 21:26         ` Corinna Vinschen
  0 siblings, 0 replies; 6+ messages in thread
From: Corinna Vinschen @ 2013-07-22 21:26 UTC (permalink / raw)
  To: cygwin

On Jul 22 11:17, Eric Blake wrote:
> On 07/22/2013 02:12 AM, Corinna Vinschen wrote:
> 
> >>> However, please note that this behaviour, while being provided by glibc
> >>> and now by Cygwin, is *not* standards-compliant.  In the narrow sense
> >>> the characters beyond 0x7f are still invalid ASCII chars, and other
> >>> functions working with wchar_t strings won't be as forgiving when using
> >>> invalid input.
> >>>
> 
> > After some sleep, I think I now understand why the glibc devs made
> > regcomp to work this way.  This behaviour is backward compatible to non
> > locale-aware applications.  In the "C" locale, a char is just some
> > arbitrary byte between 0 and 255.  So this pattern always worked before
> > in the "C locale, therefore it makes sense that it continues to work,
> > even if it won't when using other locales/codesets.
> 
> By the way, there is currently a big debate going on in the Austin Group
> (the people responsible for POSIX) on whether the "C" locale must be
> 8-bit clean (the way glibc behaves) or whether it was intended to allow
> UTF-8 encoding by default (the way musl libc wants to behave); and
> resolution of the debate will require input from the C standards
> committee.  There may be some interesting fallout, no matter which
> solution is finally reached.  http://austingroupbugs.net/view.php?id=663

Thanks for letting us know.  This really may get interesting...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-07-22 20:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-20 21:26 regex library fails git tests Mark Levedahl
2013-07-22  2:59 ` Corinna Vinschen
2013-07-22  8:07   ` Mark Levedahl
2013-07-22 10:32     ` Corinna Vinschen
2013-07-22 19:19       ` Eric Blake
2013-07-22 21:26         ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).