From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19804 invoked by alias); 26 Aug 2008 18:29:45 -0000 Received: (qmail 19703 invoked by uid 22791); 26 Aug 2008 18:29:44 -0000 X-Spam-Check-By: sourceware.org Received: from yx-out-1718.google.com (HELO yx-out-1718.google.com) (74.125.44.157) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:24:53 +0000 Received: by yx-out-1718.google.com with SMTP id 36so1109768yxh.26 for ; Tue, 26 Aug 2008 11:24:44 -0700 (PDT) Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075; Tue, 26 Aug 2008 11:24:44 -0700 (PDT) Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT) Message-ID: Date: Tue, 26 Aug 2008 18:37:00 -0000 From: me22 To: "Jim Cobban" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: GCC-help In-Reply-To: <48B44687.2040106@magma.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> <48B30C97.1080509@users.sourceforge.net> <002101c90722$8ca0ca00$3b9c65dc@testserver> <48B44687.2040106@magma.ca> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00294.txt.bz2 On Tue, Aug 26, 2008 at 14:08, Jim Cobban wrote: > The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 > bits in size, is not the same thing as UTF-16. A wchar_t string or > std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE > wchar_t value for each displayed glyph. Alternatively the value of strlen() > for a wchar_t string is the same as the number of glyphs in the displayed > representation of the string. > One wchar_t value for each codepoint -- glyphs can be formed from multiple codepoints. (Combining characters and ligatures, for example.) > In these standards the size of a wchar_t is not explicitly defined except > that it must be large enough to represent every text "character". It is > critical to understand that a wchar_t string, as defined by these standards, > is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in > size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, > although I believe that almost all symbols actually used by living languages > can be represented in a single word in UTF-16. I have not worked with > Visual C++ recently precisely because it accepts a non-portable language. > The last time I used it the M$ library was standards compliant, with the > understanding that its definition of wchar_t as a 16-bit word meant the > library could not support some languages. If the implementation of the > wchar_t strings in the Visual C++ library has been changed to implement > UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, > and C++ standards. > The outdated encoding that only supports codepoints 0x0000 through 0xFFFF is called UCS-2. ( See http://en.wikipedia.org/wiki/UTF-16 ) ~ Scott From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4114 invoked by alias); 26 Aug 2008 18:39:52 -0000 Received: (qmail 4086 invoked by uid 22791); 26 Aug 2008 18:39:51 -0000 X-Spam-Check-By: sourceware.org Received: from wa-out-1112.google.com (HELO wa-out-1112.google.com) (209.85.146.179) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:35:13 +0000 Received: by wa-out-1112.google.com with SMTP id k22so917703waf.20 for ; Tue, 26 Aug 2008 11:34:44 -0700 (PDT) Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075; Tue, 26 Aug 2008 11:24:44 -0700 (PDT) Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT) Message-ID: Date: Tue, 26 Aug 2008 19:20:00 -0000 From: me22 To: "Jim Cobban" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: GCC-help In-Reply-To: <48B44687.2040106@magma.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> <48B30C97.1080509@users.sourceforge.net> <002101c90722$8ca0ca00$3b9c65dc@testserver> <48B44687.2040106@magma.ca> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00296.txt.bz2 Message-ID: <20080826192000.Noipz6s6P9AbyDwUjrZlHoYcbEhGycTooyo71Kh37Ps@z> On Tue, Aug 26, 2008 at 14:08, Jim Cobban wrote: > The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 > bits in size, is not the same thing as UTF-16. A wchar_t string or > std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE > wchar_t value for each displayed glyph. Alternatively the value of strlen() > for a wchar_t string is the same as the number of glyphs in the displayed > representation of the string. > One wchar_t value for each codepoint -- glyphs can be formed from multiple codepoints. (Combining characters and ligatures, for example.) > In these standards the size of a wchar_t is not explicitly defined except > that it must be large enough to represent every text "character". It is > critical to understand that a wchar_t string, as defined by these standards, > is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in > size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, > although I believe that almost all symbols actually used by living languages > can be represented in a single word in UTF-16. I have not worked with > Visual C++ recently precisely because it accepts a non-portable language. > The last time I used it the M$ library was standards compliant, with the > understanding that its definition of wchar_t as a 16-bit word meant the > library could not support some languages. If the implementation of the > wchar_t strings in the Visual C++ library has been changed to implement > UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, > and C++ standards. > The outdated encoding that only supports codepoints 0x0000 through 0xFFFF is called UCS-2. ( See http://en.wikipedia.org/wiki/UTF-16 ) ~ Scott From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2093 invoked by alias); 26 Aug 2008 18:59:45 -0000 Received: (qmail 2079 invoked by uid 22791); 26 Aug 2008 18:59:44 -0000 X-Spam-Check-By: sourceware.org Received: from yx-out-1718.google.com (HELO yx-out-1718.google.com) (74.125.44.158) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:55:45 +0000 Received: by yx-out-1718.google.com with SMTP id 36so1115784yxh.26 for ; Tue, 26 Aug 2008 11:55:14 -0700 (PDT) Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075; Tue, 26 Aug 2008 11:24:44 -0700 (PDT) Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT) Message-ID: Date: Tue, 26 Aug 2008 21:29:00 -0000 From: me22 To: "Jim Cobban" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: GCC-help In-Reply-To: <48B44687.2040106@magma.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> <48B30C97.1080509@users.sourceforge.net> <002101c90722$8ca0ca00$3b9c65dc@testserver> <48B44687.2040106@magma.ca> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00298.txt.bz2 Message-ID: <20080826212900.UKLtuMAdcktFzkPIJ4L-0UxlbSEp7OxOZ01GL7xNbLo@z> On Tue, Aug 26, 2008 at 14:08, Jim Cobban wrote: > The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 > bits in size, is not the same thing as UTF-16. A wchar_t string or > std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE > wchar_t value for each displayed glyph. Alternatively the value of strlen() > for a wchar_t string is the same as the number of glyphs in the displayed > representation of the string. > One wchar_t value for each codepoint -- glyphs can be formed from multiple codepoints. (Combining characters and ligatures, for example.) > In these standards the size of a wchar_t is not explicitly defined except > that it must be large enough to represent every text "character". It is > critical to understand that a wchar_t string, as defined by these standards, > is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in > size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, > although I believe that almost all symbols actually used by living languages > can be represented in a single word in UTF-16. I have not worked with > Visual C++ recently precisely because it accepts a non-portable language. > The last time I used it the M$ library was standards compliant, with the > understanding that its definition of wchar_t as a 16-bit word meant the > library could not support some languages. If the implementation of the > wchar_t strings in the Visual C++ library has been changed to implement > UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, > and C++ standards. > The outdated encoding that only supports codepoints 0x0000 through 0xFFFF is called UCS-2. ( See http://en.wikipedia.org/wiki/UTF-16 ) ~ Scott From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3399 invoked by alias); 26 Aug 2008 20:29:18 -0000 Received: (qmail 3387 invoked by uid 22791); 26 Aug 2008 20:29:14 -0000 X-Spam-Check-By: sourceware.org Received: from hs-out-0708.google.com (HELO hs-out-0708.google.com) (64.233.178.249) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 20:27:27 +0000 Received: by hs-out-0708.google.com with SMTP id n78so446233hsc.8 for ; Tue, 26 Aug 2008 13:27:24 -0700 (PDT) Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075; Tue, 26 Aug 2008 11:24:44 -0700 (PDT) Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT) Message-ID: Date: Wed, 27 Aug 2008 13:29:00 -0000 From: me22 To: "Jim Cobban" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: GCC-help In-Reply-To: <48B44687.2040106@magma.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> <48B30C97.1080509@users.sourceforge.net> <002101c90722$8ca0ca00$3b9c65dc@testserver> <48B44687.2040106@magma.ca> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00302.txt.bz2 Message-ID: <20080827132900.i4ENsQaqFeCgD56QAZytas8RflQ_8Vt-ycTc7gV1ugQ@z> On Tue, Aug 26, 2008 at 14:08, Jim Cobban wrote: > The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 > bits in size, is not the same thing as UTF-16. A wchar_t string or > std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE > wchar_t value for each displayed glyph. Alternatively the value of strlen() > for a wchar_t string is the same as the number of glyphs in the displayed > representation of the string. > One wchar_t value for each codepoint -- glyphs can be formed from multiple codepoints. (Combining characters and ligatures, for example.) > In these standards the size of a wchar_t is not explicitly defined except > that it must be large enough to represent every text "character". It is > critical to understand that a wchar_t string, as defined by these standards, > is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in > size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, > although I believe that almost all symbols actually used by living languages > can be represented in a single word in UTF-16. I have not worked with > Visual C++ recently precisely because it accepts a non-portable language. > The last time I used it the M$ library was standards compliant, with the > understanding that its definition of wchar_t as a 16-bit word meant the > library could not support some languages. If the implementation of the > wchar_t strings in the Visual C++ library has been changed to implement > UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, > and C++ standards. > The outdated encoding that only supports codepoints 0x0000 through 0xFFFF is called UCS-2. ( See http://en.wikipedia.org/wiki/UTF-16 ) ~ Scott