From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-33891-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19804 invoked by alias); 26 Aug 2008 18:29:45 -0000
Received: (qmail 19703 invoked by uid 22791); 26 Aug 2008 18:29:44 -0000
X-Spam-Check-By: sourceware.org
Received: from yx-out-1718.google.com (HELO yx-out-1718.google.com) (74.125.44.157)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:24:53 +0000
Received: by yx-out-1718.google.com with SMTP id 36so1109768yxh.26         for <gcc-help@gcc.gnu.org>; Tue, 26 Aug 2008 11:24:44 -0700 (PDT)
Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075;         Tue, 26 Aug 2008 11:24:44 -0700 (PDT)
Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT)
Message-ID: <fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com>
Date: Tue, 26 Aug 2008 18:37:00 -0000
From: me22 <me22.ca@gmail.com>
To: "Jim Cobban" <jcobban@magma.ca>
Subject: Re: UTF-8, UTF-16 and UTF-32
Cc: GCC-help <gcc-help@gcc.gnu.org>
In-Reply-To: <48B44687.2040106@magma.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <C4D567D5.8063%eljay@adobe.com> 	 <000c01c90568$d436d230$3b9c65dc@testserver> 	 <48B30C97.1080509@users.sourceforge.net> 	 <002101c90722$8ca0ca00$3b9c65dc@testserver> 	 <48B44687.2040106@magma.ca>
X-IsSubscribed: yes
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg00294.txt.bz2

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-33893-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 4114 invoked by alias); 26 Aug 2008 18:39:52 -0000
Received: (qmail 4086 invoked by uid 22791); 26 Aug 2008 18:39:51 -0000
X-Spam-Check-By: sourceware.org
Received: from wa-out-1112.google.com (HELO wa-out-1112.google.com) (209.85.146.179)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:35:13 +0000
Received: by wa-out-1112.google.com with SMTP id k22so917703waf.20         for <gcc-help@gcc.gnu.org>; Tue, 26 Aug 2008 11:34:44 -0700 (PDT)
Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075;         Tue, 26 Aug 2008 11:24:44 -0700 (PDT)
Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT)
Message-ID: <fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com>
Date: Tue, 26 Aug 2008 19:20:00 -0000
From: me22 <me22.ca@gmail.com>
To: "Jim Cobban" <jcobban@magma.ca>
Subject: Re: UTF-8, UTF-16 and UTF-32
Cc: GCC-help <gcc-help@gcc.gnu.org>
In-Reply-To: <48B44687.2040106@magma.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <C4D567D5.8063%eljay@adobe.com> 	 <000c01c90568$d436d230$3b9c65dc@testserver> 	 <48B30C97.1080509@users.sourceforge.net> 	 <002101c90722$8ca0ca00$3b9c65dc@testserver> 	 <48B44687.2040106@magma.ca>
X-IsSubscribed: yes
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg00296.txt.bz2
Message-ID: <20080826192000.Noipz6s6P9AbyDwUjrZlHoYcbEhGycTooyo71Kh37Ps@z>

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-33895-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 2093 invoked by alias); 26 Aug 2008 18:59:45 -0000
Received: (qmail 2079 invoked by uid 22791); 26 Aug 2008 18:59:44 -0000
X-Spam-Check-By: sourceware.org
Received: from yx-out-1718.google.com (HELO yx-out-1718.google.com) (74.125.44.158)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 18:55:45 +0000
Received: by yx-out-1718.google.com with SMTP id 36so1115784yxh.26         for <gcc-help@gcc.gnu.org>; Tue, 26 Aug 2008 11:55:14 -0700 (PDT)
Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075;         Tue, 26 Aug 2008 11:24:44 -0700 (PDT)
Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT)
Message-ID: <fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com>
Date: Tue, 26 Aug 2008 21:29:00 -0000
From: me22 <me22.ca@gmail.com>
To: "Jim Cobban" <jcobban@magma.ca>
Subject: Re: UTF-8, UTF-16 and UTF-32
Cc: GCC-help <gcc-help@gcc.gnu.org>
In-Reply-To: <48B44687.2040106@magma.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <C4D567D5.8063%eljay@adobe.com> 	 <000c01c90568$d436d230$3b9c65dc@testserver> 	 <48B30C97.1080509@users.sourceforge.net> 	 <002101c90722$8ca0ca00$3b9c65dc@testserver> 	 <48B44687.2040106@magma.ca>
X-IsSubscribed: yes
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg00298.txt.bz2
Message-ID: <20080826212900.UKLtuMAdcktFzkPIJ4L-0UxlbSEp7OxOZ01GL7xNbLo@z>

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-33899-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 3399 invoked by alias); 26 Aug 2008 20:29:18 -0000
Received: (qmail 3387 invoked by uid 22791); 26 Aug 2008 20:29:14 -0000
X-Spam-Check-By: sourceware.org
Received: from hs-out-0708.google.com (HELO hs-out-0708.google.com) (64.233.178.249)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 26 Aug 2008 20:27:27 +0000
Received: by hs-out-0708.google.com with SMTP id n78so446233hsc.8         for <gcc-help@gcc.gnu.org>; Tue, 26 Aug 2008 13:27:24 -0700 (PDT)
Received: by 10.115.91.11 with SMTP id t11mr5082271wal.41.1219775084075;         Tue, 26 Aug 2008 11:24:44 -0700 (PDT)
Received: by 10.115.93.5 with HTTP; Tue, 26 Aug 2008 11:24:43 -0700 (PDT)
Message-ID: <fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com>
Date: Wed, 27 Aug 2008 13:29:00 -0000
From: me22 <me22.ca@gmail.com>
To: "Jim Cobban" <jcobban@magma.ca>
Subject: Re: UTF-8, UTF-16 and UTF-32
Cc: GCC-help <gcc-help@gcc.gnu.org>
In-Reply-To: <48B44687.2040106@magma.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <C4D567D5.8063%eljay@adobe.com> 	 <000c01c90568$d436d230$3b9c65dc@testserver> 	 <48B30C97.1080509@users.sourceforge.net> 	 <002101c90722$8ca0ca00$3b9c65dc@testserver> 	 <48B44687.2040106@magma.ca>
X-IsSubscribed: yes
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg00302.txt.bz2
Message-ID: <20080827132900.i4ENsQaqFeCgD56QAZytas8RflQ_8Vt-ycTc7gV1ugQ@z>

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott