public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/31643]  New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
@ 2007-04-20 21:26 jcavalla at postini dot com
  2007-04-20 21:30 ` [Bug libstdc++/31643] " jcavalla at postini dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:26 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 9289 bytes --]

Problem:

When calling the out() method of a codecvt facet for a locale that specifies
UTF-8 encoding, the method fails to recognize partial (i.e., incomplete) UTF-8
encoding sequences at the end of the source string.  Instead of returning the
expected
std::codecvt_base::partial status code with the returned source position
(arg-4) indexing the start of the incomplete sequence, the method returns
std::codecvt_base::ok with the returned source position just past the end of
the source string.  Nothing from the partial sequence ends up in the
destination
wide string (as expected).

Compilation:

gcc -v --save-temps -Wall -ansi -pedantic -g -o localetest localetest.cxx

Compilation output:

Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/u
sr/share/info --enable-shared --enable-threads=posix --enable-checking=release
-
-with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-
libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
--enable
-java-awt=gtk --disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-
1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux
Thread model: posix
gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/cc1plus -E -quiet -v -D_GNU_SOURCE
loc
aletest.cxx -mtune=generic -ansi -Wall -pedantic -fworking-directory
-fpch-prepr
ocess -o localetest.ii
ignoring nonexistent directory
"/usr/lib/gcc/i386-redhat-linux/4.1.1/../../../..
/i386-redhat-linux/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1

/usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/i386-redhat-
linux
 /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/backward
 /usr/local/include
 /usr/lib/gcc/i386-redhat-linux/4.1.1/include
 /usr/include
End of search list.
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/cc1plus -fpreprocessed localetest.ii
-
quiet -dumpbase localetest.cxx -mtune=generic -ansi -auxbase localetest -g
-Wall
 -pedantic -ansi -version -o localetest.s
GNU C++ version 4.1.1 20070105 (Red Hat 4.1.1-51) (i386-redhat-linux)
        compiled by GNU C version 4.1.1 20070105 (Red Hat 4.1.1-51).
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 4720743fdfefd64206c8550433f6e508
 as -V -Qy -o localetest.o localetest.s
GNU assembler version 2.17.50.0.6-2.fc6 (i386-redhat-linux) using BFD version
2.
17.50.0.6-2.fc6 20061020
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/collect2 --eh-frame-hdr -m elf_i386
--
hash-style=gnu -dynamic-linker /lib/ld-linux.so.2 -o localetest
/usr/lib/gcc/i38
6-redhat-linux/4.1.1/../../../crt1.o
/usr/lib/gcc/i386-redhat-linux/4.1.1/../../
../crti.o /usr/lib/gcc/i386-redhat-linux/4.1.1/crtbegin.o
-L/usr/lib/gcc/i386-re
dhat-linux/4.1.1 -L/usr/lib/gcc/i386-redhat-linux/4.1.1
-L/usr/lib/gcc/i386-redh
at-linux/4.1.1/../../.. localetest.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s
-lgc
c /usr/lib/gcc/i386-redhat-linux/4.1.1/crtend.o
/usr/lib/gcc/i386-redhat-linux/4
.1.1/../../../crtn.o

Test Source File (localetest.cxx):

//
// This test demonstrates that UTF-8 codecvt facets are ignoring incomplete
// trailing encoding sequences.  The expected behavior is a return of the
// status value std::codecvt_base::partial, with the returned current source
// position at the start of the failed sequence.  The actual behavior is a
// return of std::codecvt_base::ok, with the returned current source position
// at the end of the source string (i.e., the incomplete sequence is ignored).
// 

#include <iostream>
#include <string>
#include <locale>

using namespace std;

//
// Some typedefs to help with facet access.
//

typedef codecvt_base::result Result;
typedef string::traits_type::state_type State;
typedef codecvt<wstring::value_type, string::value_type, State> Converter;

wchar_t to[256];                // Destination buffer.

//
// Perform each test iteration fresh, just to make sure that there isn't any
// lingering context between tests.
//

void
dotest(
    const string &test_name,
    const char *const locale_name,
    const string &test_string
) {

    State q;                            // Shift state context.
    const string::value_type *me = 0;   // Multibyte source current postion.
    wstring::value_type *we = 0;        // Wide destination current position.
    Result status;                      // Conversion status.

    //
    // Set the current locale.
    //

    locale loc(locale_name);
    locale::global(loc);
    cout.imbue(loc);

    //
    // Start with a clear output buffer.
    //

    memset(to, 0, sizeof(to));

    //
    // Do the conversion from narrow multibyte to wide unicode.
    //

    const Converter& cvt = use_facet<Converter>(loc);
    memset(&q, 0, sizeof(q));
    string::size_type src_size = test_string.size();

    status =
        cvt.in(
            q,
            test_string.data(), test_string.data() + src_size, me,
            to, to + sizeof(to)/sizeof(to[0]), we
        );

    string::size_type mpos = me - test_string.data();
    wstring::size_type wpos = we - to;

    //
    // Display the results:
    //

    cout << endl;

    cout << test_name << ": " << loc.name() << endl;
    cout << "  Input:";
    for (
        string::const_iterator i = test_string.begin();
        i != test_string.end();
        ++i
    )
        cout << " " << hex << ((*i)&0xFF);
    cout << " \"" << test_string << "\"" << endl;

    cout << dec << "  Result=" << status
        << " Source=" << mpos
        << " Dest=" << wpos
        << endl;

    cout << "  Output:";
    for (size_t i = 0; i < wpos; ++i)
        cout << " " << hex << to[i];
    cout << endl;

    cout << endl;

    return;

}

//
// Do three tests for each locale: one with a good string, one with a partial
// string, and one with an error string.
//

string from_ok("\xC2\xA1Hasta ma\xC3\xB1\x61na!");
                                // Whole string, with complete lowercase en-yay
                                // sequence (\xC3\xB1).

string from_partial("\xC2\xA1Hasta ma\xC3");
                                // Partial string, with lowercase en-yay cut
                                // off after the first byte of the two-byte
                                // sequence.

string from_error("\xC2\xA1Hasta\xFF ma\xC3\xB1\x61na!");
                                // An error in the middle of the string, for
                                // comparison purposes.

void
dolocale(const char *const locale_name) {
    dotest("Complete", locale_name, from_ok);
    dotest("Partial", locale_name, from_partial);
    dotest("Error", locale_name, from_error);
    return;
}

//
// Do the test across 3 different locales, all with UTF-8 encoding.
//

int
main(int argc, char *argv[]) {

    dolocale("en_US.UTF-8");
    dolocale("es_US.UTF-8");
    dolocale("es_CR.UTF-8");

    return 0;

}

Test Output:

Complete: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Complete: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Complete: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Test Results:

Note that each of the error cases properly reports the invalid encoding byte
in the source string (\xFF) at source position 7; however, the partial test
cases improperly ignore the partial encoding sequence (\xC3) at the end of the
partial test strings.


-- 
           Summary: Codecvt facets with UTF-8 encoding fail to recognize
                    partial encoding sequences
           Product: gcc
           Version: 4.1.1
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: libstdc++
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jcavalla at postini dot com
GCC target triplet: i386-redhat-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
@ 2007-04-20 21:30 ` jcavalla at postini dot com
  2007-04-20 21:31 ` jcavalla at postini dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from jcavalla at postini dot com  2007-04-20 22:30 -------
Created an attachment (id=13395)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13395&action=view)
Verbose compilation output

Produced with:

g++ -v --save-temps -Wall -ansi -pedantic -g -o localetest localetest.cxx


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
  2007-04-20 21:30 ` [Bug libstdc++/31643] " jcavalla at postini dot com
@ 2007-04-20 21:31 ` jcavalla at postini dot com
  2007-04-20 21:32 ` jcavalla at postini dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from jcavalla at postini dot com  2007-04-20 22:31 -------
Created an attachment (id=13396)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13396&action=view)
Original test case source file


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
  2007-04-20 21:30 ` [Bug libstdc++/31643] " jcavalla at postini dot com
  2007-04-20 21:31 ` jcavalla at postini dot com
@ 2007-04-20 21:32 ` jcavalla at postini dot com
  2007-04-20 21:32 ` jcavalla at postini dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from jcavalla at postini dot com  2007-04-20 22:32 -------
Created an attachment (id=13398)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13398&action=view)
Test results


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (2 preceding siblings ...)
  2007-04-20 21:32 ` jcavalla at postini dot com
@ 2007-04-20 21:32 ` jcavalla at postini dot com
  2007-04-20 21:37 ` jcavalla at postini dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from jcavalla at postini dot com  2007-04-20 22:31 -------
Created an attachment (id=13397)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13397&action=view)
Preprocessed intermediate file


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (3 preceding siblings ...)
  2007-04-20 21:32 ` jcavalla at postini dot com
@ 2007-04-20 21:37 ` jcavalla at postini dot com
  2007-04-20 21:59 ` jcavalla at postini dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from jcavalla at postini dot com  2007-04-20 22:37 -------
1.  Please note that 'g++' was used to compile, not 'gcc' as stated below. 
Sorry.

2.  I marked this bug 'major' instead or 'normal' because callers will not be
able to determine whether or not they need to supply more input to complete a
sequence.
If in a read/convert type loop with preserved shift state across convert calls,
this may not matter.  I will run such a test and post the results.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (4 preceding siblings ...)
  2007-04-20 21:37 ` jcavalla at postini dot com
@ 2007-04-20 21:59 ` jcavalla at postini dot com
  2007-04-22  0:06 ` [Bug libstdc++/31643] [DR 382] " pcarlini at suse dot de
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jcavalla at postini dot com @ 2007-04-20 21:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from jcavalla at postini dot com  2007-04-20 22:59 -------
I ran additional tests just to make sure that the shift state was valid across
calls, even though partial is not returned when a chunk ends in a partial
encoding sequence.  I split several 2,3, and 4 byte UTF character sequences
across two calls to the codecvt in() method.  Each time, the sequence was
correctly widened into 1 UTF-32 character code.  Thus, the shift state appears
to be OK.  Just the return value of 'ok' is incorrect.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] [DR 382] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (6 preceding siblings ...)
  2007-04-22  0:06 ` [Bug libstdc++/31643] [DR 382] " pcarlini at suse dot de
@ 2007-04-22  0:06 ` pcarlini at suse dot de
  2009-12-22 10:23 ` paolo dot carlini at oracle dot com
  8 siblings, 0 replies; 10+ messages in thread
From: pcarlini at suse dot de @ 2007-04-22  0:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from pcarlini at suse dot de  2007-04-22 01:06 -------
(In reply to comment #6)
> I ran additional tests just to make sure that the shift state was valid across
> calls, even though partial is not returned when a chunk ends in a partial
> encoding sequence.  I split several 2,3, and 4 byte UTF character sequences
> across two calls to the codecvt in() method.  Each time, the sequence was
> correctly widened into 1 UTF-32 character code.  Thus, the shift state appears
> to be OK.  Just the return value of 'ok' is incorrect.

Indeed, that's the point. And this is DR 382, mentioned by Petur in various
places in the implementation.

  http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#382

We will revisit the issue when 382 gets a complete resolution.


-- 

pcarlini at suse dot de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |normal
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2007-04-22 01:06:27
               date|                            |
            Summary|Codecvt facets with UTF-8   |[DR 382] Codecvt facets with
                   |encoding fail to recognize  |UTF-8 encoding fail to
                   |partial encoding sequences  |recognize partial encoding
                   |                            |sequences


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] [DR 382] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (5 preceding siblings ...)
  2007-04-20 21:59 ` jcavalla at postini dot com
@ 2007-04-22  0:06 ` pcarlini at suse dot de
  2007-04-22  0:06 ` pcarlini at suse dot de
  2009-12-22 10:23 ` paolo dot carlini at oracle dot com
  8 siblings, 0 replies; 10+ messages in thread
From: pcarlini at suse dot de @ 2007-04-22  0:06 UTC (permalink / raw)
  To: gcc-bugs



-- 

pcarlini at suse dot de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |SUSPENDED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/31643] [DR 382] Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences
  2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
                   ` (7 preceding siblings ...)
  2007-04-22  0:06 ` pcarlini at suse dot de
@ 2009-12-22 10:23 ` paolo dot carlini at oracle dot com
  8 siblings, 0 replies; 10+ messages in thread
From: paolo dot carlini at oracle dot com @ 2009-12-22 10:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from paolo dot carlini at oracle dot com  2009-12-22 10:23 -------
Recently, the DR has been closed as NAD:

  http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-closed.html#382

Given that, I see no reason to keep this PR open, because we are not going to
change a very old behavior of our implementation within the current ABI missing
a resolution enforcing a different specific behavior.


-- 

paolo dot carlini at oracle dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|SUSPENDED                   |RESOLVED
         Resolution|                            |INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-12-22 10:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-20 21:26 [Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences jcavalla at postini dot com
2007-04-20 21:30 ` [Bug libstdc++/31643] " jcavalla at postini dot com
2007-04-20 21:31 ` jcavalla at postini dot com
2007-04-20 21:32 ` jcavalla at postini dot com
2007-04-20 21:32 ` jcavalla at postini dot com
2007-04-20 21:37 ` jcavalla at postini dot com
2007-04-20 21:59 ` jcavalla at postini dot com
2007-04-22  0:06 ` [Bug libstdc++/31643] [DR 382] " pcarlini at suse dot de
2007-04-22  0:06 ` pcarlini at suse dot de
2009-12-22 10:23 ` paolo dot carlini at oracle dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).