public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value
@ 2015-06-08 23:44 lcarreon at bigpond dot net.au
2015-06-09 9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-08 23:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
Bug ID: 66464
Summary: codecvt_utf16 max_length returning incorrect value
Product: gcc
Version: 5.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: lcarreon at bigpond dot net.au
Target Milestone: ---
I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3.
This appears to be the wrong value because a surrogate pair is composed of 4
bytes therefore max_length() should at least be returning 4.
I'm also wondering whether the BOM should be taken into account. If it so
happens that at the beginning of a UTF-16 string which has a BOM and it so
happens to start with a surrogate pair, 6 bytes have to be consumed to generate
a single UCS-4 character.
Should the same thing be considered with codecvt_utf8<char32_t>::max_length()
which currently returns 4. Taking into account the BOM and the longest UTF-8
character below 0x10FFFF, shouldn't max_length() return 7.
I'm not really sure if the BOM should be taken into account because the
standard's definition for do_max_length() simply says the maximum number of
input characters that needs to be consumed to generate a single output
character.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
@ 2015-06-09 9:14 ` redi at gcc dot gnu.org
2015-06-09 23:10 ` lcarreon at bigpond dot net.au
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-09 9:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
Jonathan Wakely <redi at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2015-06-09
Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #0)
> I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3.
>
> This appears to be the wrong value because a surrogate pair is composed of 4
> bytes therefore max_length() should at least be returning 4.
Agreed, I think that's just a mistake.
I wrote this comment in the code:
int
codecvt<char16_t, char, mbstate_t>::do_max_length() const throw()
{
// Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit,
// whereas 4 byte sequences require two 16-bit code units.
return 3;
}
But that reasoning (even if it's correct!) doesn't apply to
codecvt_utf16<char32_t>.
> I'm also wondering whether the BOM should be taken into account. If it so
> happens that at the beginning of a UTF-16 string which has a BOM and it so
> happens to start with a surrogate pair, 6 bytes have to be consumed to
> generate a single UCS-4 character.
>
> Should the same thing be considered with
> codecvt_utf8<char32_t>::max_length() which currently returns 4. Taking into
> account the BOM and the longest UTF-8 character below 0x10FFFF, shouldn't
> max_length() return 7.
>
> I'm not really sure if the BOM should be taken into account because the
> standard's definition for do_max_length() simply says the maximum number of
> input characters that needs to be consumed to generate a single output
> character.
That's a very good question.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
2015-06-09 9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
@ 2015-06-09 23:10 ` lcarreon at bigpond dot net.au
2015-06-12 10:26 ` redi at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-09 23:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #2 from Leo Carreon <lcarreon at bigpond dot net.au> ---
Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
codecvt_utf8<char32_t>.
The way I understand it, codecvt_utf16<char32_t> should be converting between
UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP (characters in
the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for characters above
the BMP (0x010000 to 0x10FFFF). UCS-4 uses 4 byte values. Therefore,
codecvt_utf16<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.
codecvt_utf8<char32_t> converts between UTF-8 and UCS-4. UTF-8 can use up to 4
bytes for characters up to the range 0x10FFFF. Therefore,
codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.
As I said in my previous post, I'm not sure if the BOM should be accounted for
in max_length(). If I'm not mistaken, the purpose of this function is to allow
a user to estimate how many bytes are required to fit a UCS-4 string when
converted to either UTF-16 or UTF-8. And my guess, the BOM can be taken into
account separately when doing the estimation. For example, when
wstring_convert estimates the length of the std::string to be generated by
wstring_convert::to_bytes(). It should be the number of UCS-4 characters
multiplied by max_length() and then add the size of the BOM if required. The
resulting std::string can be resized after the conversion to eliminate the
unused bytes.
Note that the comment you mentioned in your reply probably only applies to
codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
going thru the UCS-4 conversion.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
2015-06-09 9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
2015-06-09 23:10 ` lcarreon at bigpond dot net.au
@ 2015-06-12 10:26 ` redi at gcc dot gnu.org
2015-06-12 11:22 ` redi at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 10:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Author: redi
Date: Fri Jun 12 10:26:05 2015
New Revision: 224415
URL: https://gcc.gnu.org/viewcvs?rev=224415&root=gcc&view=rev
Log:
PR libstdc++/66464
* src/c++11/codecvt.cc (codecvt_utf16_base<char32_t>::do_max_length):
Return 4 not 3.
Modified:
trunk/libstdc++-v3/ChangeLog
trunk/libstdc++-v3/src/c++11/codecvt.cc
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
` (2 preceding siblings ...)
2015-06-12 10:26 ` redi at gcc dot gnu.org
@ 2015-06-12 11:22 ` redi at gcc dot gnu.org
2015-06-12 11:34 ` redi at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 11:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #4 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Author: redi
Date: Fri Jun 12 11:22:01 2015
New Revision: 224417
URL: https://gcc.gnu.org/viewcvs?rev=224417&root=gcc&view=rev
Log:
PR libstdc++/66464
* src/c++11/codecvt.cc (codecvt_utf16_base<char32_t>::do_max_length):
Return 4 not 3.
Modified:
branches/gcc-5-branch/libstdc++-v3/ChangeLog
branches/gcc-5-branch/libstdc++-v3/src/c++11/codecvt.cc
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
` (3 preceding siblings ...)
2015-06-12 11:22 ` redi at gcc dot gnu.org
@ 2015-06-12 11:34 ` redi at gcc dot gnu.org
2015-06-18 23:15 ` lcarreon at bigpond dot net.au
2015-06-19 9:47 ` redi at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 11:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
Jonathan Wakely <redi at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
Target Milestone|--- |5.2
--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #2)
> Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
> codecvt_utf8<char32_t>.
>
> The way I understand it, codecvt_utf16<char32_t> should be converting
> between UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP
> (characters in the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for
> characters above the BMP (0x010000 to 0x10FFFF). UCS-4 uses 4 byte values.
> Therefore, codecvt_utf16<char32_t>::max_length() should be returning 4 if
> the BOM is not taken into account.
Yes, that's now fixed.
> codecvt_utf8<char32_t> converts between UTF-8 and UCS-4. UTF-8 can use up
> to 4 bytes for characters up to the range 0x10FFFF. Therefore,
> codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
> taken into account.
>
> As I said in my previous post, I'm not sure if the BOM should be accounted
> for in max_length().
I've raised that question with the C++ committee.
> If I'm not mistaken, the purpose of this function is
> to allow a user to estimate how many bytes are required to fit a UCS-4
> string when converted to either UTF-16 or UTF-8. And my guess, the BOM can
> be taken into account separately when doing the estimation. For example,
> when wstring_convert estimates the length of the std::string to be generated
> by wstring_convert::to_bytes(). It should be the number of UCS-4 characters
> multiplied by max_length() and then add the size of the BOM if required.
> The resulting std::string can be resized after the conversion to eliminate
> the unused bytes.
I believe that's the usual use case for max_length, and agree it's better to
calculate N * max_length() + length(BOM), rather than have max_length() include
the BOM, however the way max_length() is specified in the standard does suggest
it should be including the BOM. We'll discuss it in the committee and process
it as a defect report against the standard if necessary.
> Note that the comment you mentioned in your reply probably only applies to
> codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
> going thru the UCS-4 conversion.
Agreed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
` (4 preceding siblings ...)
2015-06-12 11:34 ` redi at gcc dot gnu.org
@ 2015-06-18 23:15 ` lcarreon at bigpond dot net.au
2015-06-19 9:47 ` redi at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-18 23:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #6 from Leo Carreon <lcarreon at bigpond dot net.au> ---
Has this fix been included in the recent gcc-5.1.1-3 update on Fedora 22?
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
` (5 preceding siblings ...)
2015-06-18 23:15 ` lcarreon at bigpond dot net.au
@ 2015-06-19 9:47 ` redi at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-19 9:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
--- Comment #7 from Jonathan Wakely <redi at gcc dot gnu.org> ---
No, it will be in 5.1.1-4
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-06-19 9:47 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
2015-06-09 9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
2015-06-09 23:10 ` lcarreon at bigpond dot net.au
2015-06-12 10:26 ` redi at gcc dot gnu.org
2015-06-12 11:22 ` redi at gcc dot gnu.org
2015-06-12 11:34 ` redi at gcc dot gnu.org
2015-06-18 23:15 ` lcarreon at bigpond dot net.au
2015-06-19 9:47 ` redi at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).