From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-488533-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 51922 invoked by alias); 9 Jun 2015 23:10:41 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 51865 invoked by uid 48); 9 Jun 2015 23:10:36 -0000
From: "lcarreon at bigpond dot net.au" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
Date: Tue, 09 Jun 2015 23:10:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libstdc++
X-Bugzilla-Version: 5.1.1
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: lcarreon at bigpond dot net.au
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: redi at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-66464-4-O0LkqzyDuT@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-66464-4@http.gcc.gnu.org/bugzilla/>
References: <bug-66464-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-06/txt/msg00865.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #2 from Leo Carreon <lcarreon at bigpond dot net.au> ---
Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
codecvt_utf8<char32_t>.

The way I understand it, codecvt_utf16<char32_t> should be converting between
UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP (characters in
the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for characters above
the BMP (0x010000 to 0x10FFFF).  UCS-4 uses 4 byte values.  Therefore,
codecvt_utf16<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.

codecvt_utf8<char32_t> converts between UTF-8 and UCS-4.  UTF-8 can use up to 4
bytes for characters up to the range 0x10FFFF.  Therefore,
codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.

As I said in my previous post, I'm not sure if the BOM should be accounted for
in max_length().  If I'm not mistaken, the purpose of this function is to allow
a user to estimate how many bytes are required to fit a UCS-4 string when
converted to either UTF-16 or UTF-8.  And my guess, the BOM can be taken into
account separately when doing the estimation.  For example, when
wstring_convert estimates the length of the std::string to be generated by
wstring_convert::to_bytes().  It should be the number of UCS-4 characters
multiplied by max_length() and then add the size of the BOM if required.  The
resulting std::string can be resized after the conversion to eliminate the
unused bytes.

Note that the comment you mentioned in your reply probably only applies to
codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
going thru the UCS-4 conversion.