From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 51922 invoked by alias); 9 Jun 2015 23:10:41 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 51865 invoked by uid 48); 9 Jun 2015 23:10:36 -0000 From: "lcarreon at bigpond dot net.au" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value Date: Tue, 09 Jun 2015 23:10:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 5.1.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: lcarreon at bigpond dot net.au X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: redi at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-06/txt/msg00865.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #2 from Leo Carreon --- Just clarifying that my comments are to do with codecvt_utf16 and codecvt_utf8. The way I understand it, codecvt_utf16 should be converting between UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP (characters in the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for characters above the BMP (0x010000 to 0x10FFFF). UCS-4 uses 4 byte values. Therefore, codecvt_utf16::max_length() should be returning 4 if the BOM is not taken into account. codecvt_utf8 converts between UTF-8 and UCS-4. UTF-8 can use up to 4 bytes for characters up to the range 0x10FFFF. Therefore, codecvt_utf8::max_length() should be returning 4 if the BOM is not taken into account. As I said in my previous post, I'm not sure if the BOM should be accounted for in max_length(). If I'm not mistaken, the purpose of this function is to allow a user to estimate how many bytes are required to fit a UCS-4 string when converted to either UTF-16 or UTF-8. And my guess, the BOM can be taken into account separately when doing the estimation. For example, when wstring_convert estimates the length of the std::string to be generated by wstring_convert::to_bytes(). It should be the number of UCS-4 characters multiplied by max_length() and then add the size of the BOM if required. The resulting std::string can be resized after the conversion to eliminate the unused bytes. Note that the comment you mentioned in your reply probably only applies to codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without going thru the UCS-4 conversion.