public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/63776] New: [C++11] Regex collate matching not working
@ 2014-11-07 18:39 gnu-org at bignm dot com
  2014-11-08  8:35 ` [Bug libstdc++/63776] " timshen at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: gnu-org at bignm dot com @ 2014-11-07 18:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

            Bug ID: 63776
           Summary: [C++11] Regex collate matching not working
           Product: gcc
           Version: 4.9.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gnu-org at bignm dot com

Created attachment 33919
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33919&action=edit
Full test program source code

The locale has been set to "pt_BR.UTF-8" and the following String should match
the Regexp passed to the group_regexp() function. It seems as if the collation
matching is not working.

String = "João Méroço" <email@isp.com>
DEBUG: group_regexp(): Using 'c' flag
DEBUG: group_regexp(): Using 'o' flag
DEBUG: group_regexp(): Match Failed!
group_regexp('/("[[:alpha:][:space:]]+")/co') returned failure!

Full test program source code is attached below.
>From gcc-bugs-return-466007-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri Nov 07 18:46:49 2014
Return-Path: <gcc-bugs-return-466007-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 26404 invoked by alias); 7 Nov 2014 18:46:48 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 26381 invoked by uid 48); 7 Nov 2014 18:46:44 -0000
From: "glisse at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/63774] wrong code at all optimization levels on x86_64-linux-gnu
Date: Fri, 07 Nov 2014 18:46:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: glisse at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-63774-4-2mrap4A9DF@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-63774-4@http.gcc.gnu.org/bugzilla/>
References: <bug-63774-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00479.txt.bz2
Content-length: 209

https://gcc.gnu.org/bugzilla/show_bug.cgi?idc774

--- Comment #3 from Marc Glisse <glisse at gcc dot gnu.org> ---
At least it was deliberate. I did wonder if anyone would complain when I wrote
the patch...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
@ 2014-11-08  8:35 ` timshen at gcc dot gnu.org
  2014-11-08 10:56 ` gnu-org at bignm dot com
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: timshen at gcc dot gnu.org @ 2014-11-08  8:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

Tim Shen <timshen at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |timshen at gcc dot gnu.org

--- Comment #1 from Tim Shen <timshen at gcc dot gnu.org> ---
    std::string s = "\"João Méroço\" <email@isp.com>";

...is not what you want. It is not an decoded unicode string, but just "a
sequence of bytes":

    std::string s = "\"João Méroço\" <email@isp.com>";
    for (const auto& it : s) {
    std::cout << it << "\n";
    }
...print:
"
J
o

�
o

M

�
r
o

�
o
"

<
e
m
a
i
l
@
i
s
p
.
c
o
m
>

I don't know much about unicode support status in the standard library. @Jon,
can you put a comment?
>From gcc-bugs-return-466033-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Sat Nov 08 08:56:43 2014
Return-Path: <gcc-bugs-return-466033-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 19369 invoked by alias); 8 Nov 2014 08:56:43 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 19318 invoked by uid 48); 8 Nov 2014 08:56:39 -0000
From: "ebotcazou at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/63747] [5 regression] icf mis-compares switch gimple
Date: Sat, 08 Nov 2014 08:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords: build, wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: ebotcazou at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: marxin at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 5.0
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-63747-4-FZ6yBJEUvm@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-63747-4@http.gcc.gnu.org/bugzilla/>
References: <bug-63747-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00505.txt.bz2
Content-length: 605

https://gcc.gnu.org/bugzilla/show_bug.cgi?idc747

Eric Botcazou <ebotcazou at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ebotcazou at gcc dot gnu.org

--- Comment #8 from Eric Botcazou <ebotcazou at gcc dot gnu.org> ---
Martin, please use present tense in your Changelog entries, as specified by the
GNU Coding Standard:

    * ipa-icf-gimple.c (func_checker::compare_gimple_switch): Add missing
    checking for CASE_LOW and CASE_HIGH.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
  2014-11-08  8:35 ` [Bug libstdc++/63776] " timshen at gcc dot gnu.org
@ 2014-11-08 10:56 ` gnu-org at bignm dot com
  2014-11-10 18:26 ` redi at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: gnu-org at bignm dot com @ 2014-11-08 10:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #3 from Tom Straub <gnu-org at bignm dot com> ---
Hi Tim,

OOPS! The versions used were Boost REGEX 1.55.0 and ICU 52. Got the versions
mixed up in my head.

Tom


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
  2014-11-08  8:35 ` [Bug libstdc++/63776] " timshen at gcc dot gnu.org
  2014-11-08 10:56 ` gnu-org at bignm dot com
@ 2014-11-10 18:26 ` redi at gcc dot gnu.org
  2015-01-20 11:05 ` gnu-org at bignm dot com
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: redi at gcc dot gnu.org @ 2014-11-10 18:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #4 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Tom Straub from comment #2)
> Hi Tim,
> 
> Okay, a program very similar to this using the Boost REGEX library and ICU
> 4.55 works just fine with this.
> 
> According to my understanding, the "char" data type and "std::string"
> classes were specifically set up in C++11 to handle UTF-8 sequences.

Nothing was done to char or std::string because they can already hold UTF-8
data. The new u8 literal prefix was added to produce UTF8-encoded string
literals.

> The "sequence of bytes" are actually valid UNICODE characters.

Right, and if your source character set wasn't UTF-8 you could still initialize
the std::string correctly with e.g.

  std::string s = u8"Jo\u00e3o M\u00e9ro\u00e7o";

> It is acting as if it is still in POSIX or C locale, since it doesn't
> recognize the accented characters as "[:alpha:]" class.

Yes, I don't know how it's supposed to work in <regex> but I agree there seems
to be a bug.

(In reply to Tim Shen from comment #1)
> I don't know much about unicode support status in the standard library.
> @Jon, can you put a comment?

We're missing the facilities for converting between different unicode encodings
which would allow us to convert multibyte char strings to wchar_t so that we
can use the ctype<wchar_t>::is(ctype_base::mask, wchar_t) function to match
non-ASCII characters.

Maybe a workaround for now would be to detect when we reach the first byte of a
UTF-8 character, read the rest of the multibyte character into a wchar_t and
use ctype<wchar_t> for that.

Or just wait for the rest of the ctype and codecvt features to be implemented.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
                   ` (2 preceding siblings ...)
  2014-11-10 18:26 ` redi at gcc dot gnu.org
@ 2015-01-20 11:05 ` gnu-org at bignm dot com
  2015-01-20 11:10 ` gnu-org at bignm dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: gnu-org at bignm dot com @ 2015-01-20 11:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #5 from Tom Straub <gnu-org at bignm dot com> ---
Created attachment 34497
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34497&action=edit
Test Program for UTF-8 CPP library


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
                   ` (3 preceding siblings ...)
  2015-01-20 11:05 ` gnu-org at bignm dot com
@ 2015-01-20 11:10 ` gnu-org at bignm dot com
  2015-01-20 13:08 ` redi at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: gnu-org at bignm dot com @ 2015-01-20 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #6 from Tom Straub <gnu-org at bignm dot com> ---
Hi Tim,

After banging my head against the wall looking for a solution to the C++11
UTF-8 support problems, I finally found what seems to be a great addition to my
project, which I think might be beneficial to GNU as well.

It is the UTF-8 CPP code. It is implemented as a header file, so there are no
linking issues involved. It seems to add all the functionality needed that
seems to be missing in GCC currently.

The project can be found at:  http://sourceforge.net/projects/utfcpp/

It would certainly make implementation of the codecvt stuff less painful.

Best, Tom


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
                   ` (4 preceding siblings ...)
  2015-01-20 11:10 ` gnu-org at bignm dot com
@ 2015-01-20 13:08 ` redi at gcc dot gnu.org
  2015-02-06  6:55 ` timshen at gcc dot gnu.org
  2015-03-09  6:49 ` timshen at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: redi at gcc dot gnu.org @ 2015-01-20 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #7 from Jonathan Wakely <redi at gcc dot gnu.org> ---
The codecvt stuff was implemented last week. Probably incorrectly, because I
didn't really know what I was doing.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
                   ` (5 preceding siblings ...)
  2015-01-20 13:08 ` redi at gcc dot gnu.org
@ 2015-02-06  6:55 ` timshen at gcc dot gnu.org
  2015-03-09  6:49 ` timshen at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: timshen at gcc dot gnu.org @ 2015-02-06  6:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #8 from Tim Shen <timshen at gcc dot gnu.org> ---
I'm not sure how you call boost::regex in your code, here's what I did:

// g++ b.cc -lboost_regex -licuuc
#include <boost/regex/icu.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
using namespace boost;

int main() {
    std::locale loc("en_US.UTF-8");
    std::string s(u8"Ī");
    u32regex re = make_u32regex("[[:alpha:]]");
    std::cout << u32regex_match(s.data(), s.data() + s.size(), re) << "\n";
    return 0;
}


If this is the way that we do utf-8 matching using boost, then I don't think
std::regex_match and boost::u32regex_match (notice that it's not
boost::regex_match) have the same semantic.

An user who uses boost::u32regex_match explicitly tells the library that "I
want a unicode match here, here's my regex object, with type u32regex, please
do the decode for and match for me", and u32regex is actually
boost::basic_regex< ::UChar32, icu_regex_traits> with a library defined
regex_traits. u32regex_match, on the other hand, takes no user defined
regex_traits type, but u32regex only.

I don't think std::regex_match<BiIter, Alloc, char, RegexTraits> should care
about decoding a char string to wchar_t string and call
std::regex_match<AnotherBiIter, AnotherAlloc, wchar_t,
std::regex_traits<wchar_t>>, leaving user defined RegexTraits potentially
unused.

Instead, user can maually decode the utf-8 string (I'm sad we don't have a
standard char iterator adaptor which converts a utf-8 char iterator to char32_t
iterator) and call std::regex_match<..., wchar_t, ...>.

These are my understanding, so it's surely possible that I may miss something.

Thoughts?
>From gcc-bugs-return-476207-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Fri Feb 06 07:03:56 2015
Return-Path: <gcc-bugs-return-476207-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 32281 invoked by alias); 6 Feb 2015 07:03:56 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 32123 invoked by uid 48); 6 Feb 2015 07:03:52 -0000
From: "timshen at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/63775] [C++11] Regex range with leading dash (-) not working
Date: Fri, 06 Feb 2015 07:03:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libstdc++
X-Bugzilla-Version: 4.9.1
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: timshen at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_status resolution
Message-ID: <bug-63775-4-KEihUmbJTm@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-63775-4@http.gcc.gnu.org/bugzilla/>
References: <bug-63775-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-02/txt/msg00540.txt.bz2
Content-length: 416

https://gcc.gnu.org/bugzilla/show_bug.cgi?idc775

Tim Shen <timshen at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #3 from Tim Shen <timshen at gcc dot gnu.org> ---
Fixed.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libstdc++/63776] [C++11] Regex collate matching not working
  2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
                   ` (6 preceding siblings ...)
  2015-02-06  6:55 ` timshen at gcc dot gnu.org
@ 2015-03-09  6:49 ` timshen at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: timshen at gcc dot gnu.org @ 2015-03-09  6:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #9 from Tim Shen <timshen at gcc dot gnu.org> ---
Ping.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-03-09  6:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-07 18:39 [Bug c++/63776] New: [C++11] Regex collate matching not working gnu-org at bignm dot com
2014-11-08  8:35 ` [Bug libstdc++/63776] " timshen at gcc dot gnu.org
2014-11-08 10:56 ` gnu-org at bignm dot com
2014-11-10 18:26 ` redi at gcc dot gnu.org
2015-01-20 11:05 ` gnu-org at bignm dot com
2015-01-20 11:10 ` gnu-org at bignm dot com
2015-01-20 13:08 ` redi at gcc dot gnu.org
2015-02-06  6:55 ` timshen at gcc dot gnu.org
2015-03-09  6:49 ` timshen at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).