[Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value
@ 2015-06-08 23:44 lcarreon at bigpond dot net.au
  2015-06-09  9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-08 23:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

            Bug ID: 66464
           Summary: codecvt_utf16 max_length returning incorrect value
           Product: gcc
           Version: 5.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lcarreon at bigpond dot net.au
  Target Milestone: ---

I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3.

This appears to be the wrong value because a surrogate pair is composed of 4
bytes therefore max_length() should at least be returning 4.

I'm also wondering whether the BOM should be taken into account.  If it so
happens that at the beginning of a UTF-16 string which has a BOM and it so
happens to start with a surrogate pair, 6 bytes have to be consumed to generate
a single UCS-4 character.

Should the same thing be considered with codecvt_utf8<char32_t>::max_length()
which currently returns 4.  Taking into account the BOM and the longest UTF-8
character below 0x10FFFF, shouldn't max_length() return 7.

I'm not really sure if the BOM should be taken into account because the
standard's definition for do_max_length() simply says the maximum number of
input characters that needs to be consumed to generate a single output
character.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
@ 2015-06-09  9:14 ` redi at gcc dot gnu.org
  2015-06-09 23:10 ` lcarreon at bigpond dot net.au
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-09  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2015-06-09
           Assignee|unassigned at gcc dot gnu.org      |redi at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #0)
> I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3.
> 
> This appears to be the wrong value because a surrogate pair is composed of 4
> bytes therefore max_length() should at least be returning 4.

Agreed, I think that's just a mistake.

I wrote this comment in the code:

int
codecvt<char16_t, char, mbstate_t>::do_max_length() const throw()
{
  // Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit,
  // whereas 4 byte sequences require two 16-bit code units.
  return 3;
}

But that reasoning (even if it's correct!) doesn't apply to
codecvt_utf16<char32_t>.

> I'm also wondering whether the BOM should be taken into account.  If it so
> happens that at the beginning of a UTF-16 string which has a BOM and it so
> happens to start with a surrogate pair, 6 bytes have to be consumed to
> generate a single UCS-4 character.
> 
> Should the same thing be considered with
> codecvt_utf8<char32_t>::max_length() which currently returns 4.  Taking into
> account the BOM and the longest UTF-8 character below 0x10FFFF, shouldn't
> max_length() return 7.
> 
> I'm not really sure if the BOM should be taken into account because the
> standard's definition for do_max_length() simply says the maximum number of
> input characters that needs to be consumed to generate a single output
> character.

That's a very good question.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
  2015-06-09  9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
@ 2015-06-09 23:10 ` lcarreon at bigpond dot net.au
  2015-06-12 10:26 ` redi at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-09 23:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #2 from Leo Carreon <lcarreon at bigpond dot net.au> ---
Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
codecvt_utf8<char32_t>.

The way I understand it, codecvt_utf16<char32_t> should be converting between
UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP (characters in
the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for characters above
the BMP (0x010000 to 0x10FFFF).  UCS-4 uses 4 byte values.  Therefore,
codecvt_utf16<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.

codecvt_utf8<char32_t> converts between UTF-8 and UCS-4.  UTF-8 can use up to 4
bytes for characters up to the range 0x10FFFF.  Therefore,
codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
taken into account.

As I said in my previous post, I'm not sure if the BOM should be accounted for
in max_length().  If I'm not mistaken, the purpose of this function is to allow
a user to estimate how many bytes are required to fit a UCS-4 string when
converted to either UTF-16 or UTF-8.  And my guess, the BOM can be taken into
account separately when doing the estimation.  For example, when
wstring_convert estimates the length of the std::string to be generated by
wstring_convert::to_bytes().  It should be the number of UCS-4 characters
multiplied by max_length() and then add the size of the BOM if required.  The
resulting std::string can be resized after the conversion to eliminate the
unused bytes.

Note that the comment you mentioned in your reply probably only applies to
codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
going thru the UCS-4 conversion.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
  2015-06-09  9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
  2015-06-09 23:10 ` lcarreon at bigpond dot net.au
@ 2015-06-12 10:26 ` redi at gcc dot gnu.org
  2015-06-12 11:22 ` redi at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Author: redi
Date: Fri Jun 12 10:26:05 2015
New Revision: 224415

URL: https://gcc.gnu.org/viewcvs?rev=224415&root=gcc&view=rev
Log:
        PR libstdc++/66464
        * src/c++11/codecvt.cc (codecvt_utf16_base<char32_t>::do_max_length):
        Return 4 not 3.

Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/src/c++11/codecvt.cc


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
                   ` (2 preceding siblings ...)
  2015-06-12 10:26 ` redi at gcc dot gnu.org
@ 2015-06-12 11:22 ` redi at gcc dot gnu.org
  2015-06-12 11:34 ` redi at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #4 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Author: redi
Date: Fri Jun 12 11:22:01 2015
New Revision: 224417

URL: https://gcc.gnu.org/viewcvs?rev=224417&root=gcc&view=rev
Log:
        PR libstdc++/66464
        * src/c++11/codecvt.cc (codecvt_utf16_base<char32_t>::do_max_length):
        Return 4 not 3.

Modified:
    branches/gcc-5-branch/libstdc++-v3/ChangeLog
    branches/gcc-5-branch/libstdc++-v3/src/c++11/codecvt.cc


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
                   ` (3 preceding siblings ...)
  2015-06-12 11:22 ` redi at gcc dot gnu.org
@ 2015-06-12 11:34 ` redi at gcc dot gnu.org
  2015-06-18 23:15 ` lcarreon at bigpond dot net.au
  2015-06-19  9:47 ` redi at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-12 11:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |5.2

--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #2)
> Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
> codecvt_utf8<char32_t>.
> 
> The way I understand it, codecvt_utf16<char32_t> should be converting
> between UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP
> (characters in the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for
> characters above the BMP (0x010000 to 0x10FFFF).  UCS-4 uses 4 byte values. 
> Therefore, codecvt_utf16<char32_t>::max_length() should be returning 4 if
> the BOM is not taken into account.

Yes, that's now fixed.

> codecvt_utf8<char32_t> converts between UTF-8 and UCS-4.  UTF-8 can use up
> to 4 bytes for characters up to the range 0x10FFFF.  Therefore,
> codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
> taken into account.
> 
> As I said in my previous post, I'm not sure if the BOM should be accounted
> for in max_length().

I've raised that question with the C++ committee.

>  If I'm not mistaken, the purpose of this function is
> to allow a user to estimate how many bytes are required to fit a UCS-4
> string when converted to either UTF-16 or UTF-8.  And my guess, the BOM can
> be taken into account separately when doing the estimation.  For example,
> when wstring_convert estimates the length of the std::string to be generated
> by wstring_convert::to_bytes().  It should be the number of UCS-4 characters
> multiplied by max_length() and then add the size of the BOM if required. 
> The resulting std::string can be resized after the conversion to eliminate
> the unused bytes.

I believe that's the usual use case for max_length, and agree it's better to
calculate N * max_length() + length(BOM), rather than have max_length() include
the BOM, however the way max_length() is specified in the standard does suggest
it should be including the BOM. We'll discuss it in the committee and process
it as a defect report against the standard if necessary.

> Note that the comment you mentioned in your reply probably only applies to
> codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
> going thru the UCS-4 conversion.

Agreed.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
                   ` (4 preceding siblings ...)
  2015-06-12 11:34 ` redi at gcc dot gnu.org
@ 2015-06-18 23:15 ` lcarreon at bigpond dot net.au
  2015-06-19  9:47 ` redi at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: lcarreon at bigpond dot net.au @ 2015-06-18 23:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #6 from Leo Carreon <lcarreon at bigpond dot net.au> ---
Has this fix been included in the recent gcc-5.1.1-3 update on Fedora 22?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
  2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
                   ` (5 preceding siblings ...)
  2015-06-18 23:15 ` lcarreon at bigpond dot net.au
@ 2015-06-19  9:47 ` redi at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2015-06-19  9:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #7 from Jonathan Wakely <redi at gcc dot gnu.org> ---
No, it will be in 5.1.1-4


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-06-19  9:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 23:44 [Bug libstdc++/66464] New: codecvt_utf16 max_length returning incorrect value lcarreon at bigpond dot net.au
2015-06-09  9:14 ` [Bug libstdc++/66464] " redi at gcc dot gnu.org
2015-06-09 23:10 ` lcarreon at bigpond dot net.au
2015-06-12 10:26 ` redi at gcc dot gnu.org
2015-06-12 11:22 ` redi at gcc dot gnu.org
2015-06-12 11:34 ` redi at gcc dot gnu.org
2015-06-18 23:15 ` lcarreon at bigpond dot net.au
2015-06-19  9:47 ` redi at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).