public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
From: me22 <me22.ca@gmail.com>
To: "Jim Cobban" <jcobban@magma.ca>
Cc: GCC-help <gcc-help@gcc.gnu.org>
Subject: Re: UTF-8, UTF-16 and UTF-32
Date: Wed, 27 Aug 2008 13:29:00 -0000	[thread overview]
Message-ID: <fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com> (raw)
Message-ID: <20080827132900.i4ENsQaqFeCgD56QAZytas8RflQ_8Vt-ycTc7gV1ugQ@z> (raw)
In-Reply-To: <48B44687.2040106@magma.ca>

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

  reply	other threads:[~2008-08-26 20:29 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <002901c903df$08265510$3b9c65dc@testserver>
2008-08-22 14:54 ` Eljay Love-Jensen
2008-08-23  2:00   ` Dallas Clarke
2008-08-23  2:24     ` me22
2008-08-23  2:45       ` Dallas Clarke
2008-08-23  3:06         ` me22
2008-08-23  3:52           ` Dallas Clarke
2008-08-23  4:31             ` Brian Dessent
2008-08-23 11:33     ` Andrew Haley
2008-08-23 21:41     ` Eljay Love-Jensen
2008-08-24  0:41       ` Dallas Clarke
2008-08-24  4:02         ` me22
2008-08-24  5:53         ` corey taylor
2008-08-24  6:02           ` Dallas Clarke
2008-08-24 11:11             ` me22
2008-08-24 19:11             ` Eljay Love-Jensen
2008-08-26 14:50               ` Marco Manfredini
2008-08-25 23:15         ` Matthew Woehlke
2008-08-26  4:14           ` Dallas Clarke
2008-08-26  6:03             ` Matthew Woehlke
2008-08-26 18:29             ` Jim Cobban
2008-08-26 18:37               ` me22 [this message]
2008-08-26 19:20                 ` me22
2008-08-26 21:29                 ` me22
2008-08-27 13:29                 ` me22
2008-08-26 18:54               ` Andrew Haley
2008-08-26 21:19                 ` me22
2008-08-27  8:18                   ` me22
2008-08-27 11:45                   ` me22
2008-08-21  5:16 Dallas Clarke
2008-08-21  9:30 ` me22
     [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
2008-08-21 12:49     ` me22
2008-08-21 10:18 ` Andrew Haley
2008-08-21 11:50   ` Dallas Clarke
2008-08-21 12:15     ` John Love-Jensen
2008-08-21 14:38 ` John Gateley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa28b9250808261124s53988606l83dc8bba04b8dec6@mail.gmail.com \
    --to=me22.ca@gmail.com \
    --cc=gcc-help@gcc.gnu.org \
    --cc=jcobban@magma.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).