From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 9630 invoked by alias); 10 Nov 2014 18:26:15 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 9587 invoked by uid 48); 10 Nov 2014 18:26:11 -0000 From: "redi at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/63776] [C++11] Regex collate matching not working Date: Mon, 10 Nov 2014 18:26:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 4.9.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: redi at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00756.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776 --- Comment #4 from Jonathan Wakely --- (In reply to Tom Straub from comment #2) > Hi Tim, > > Okay, a program very similar to this using the Boost REGEX library and ICU > 4.55 works just fine with this. > > According to my understanding, the "char" data type and "std::string" > classes were specifically set up in C++11 to handle UTF-8 sequences. Nothing was done to char or std::string because they can already hold UTF-8 data. The new u8 literal prefix was added to produce UTF8-encoded string literals. > The "sequence of bytes" are actually valid UNICODE characters. Right, and if your source character set wasn't UTF-8 you could still initialize the std::string correctly with e.g. std::string s = u8"Jo\u00e3o M\u00e9ro\u00e7o"; > It is acting as if it is still in POSIX or C locale, since it doesn't > recognize the accented characters as "[:alpha:]" class. Yes, I don't know how it's supposed to work in but I agree there seems to be a bug. (In reply to Tim Shen from comment #1) > I don't know much about unicode support status in the standard library. > @Jon, can you put a comment? We're missing the facilities for converting between different unicode encodings which would allow us to convert multibyte char strings to wchar_t so that we can use the ctype::is(ctype_base::mask, wchar_t) function to match non-ASCII characters. Maybe a workaround for now would be to detect when we reach the first byte of a UTF-8 character, read the rest of the multibyte character into a wchar_t and use ctype for that. Or just wait for the rest of the ctype and codecvt features to be implemented.