public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
From: me22 <me22.ca@gmail.com>
To: "Andrew Haley" <aph@redhat.com>
Cc: GCC-help <gcc-help@gcc.gnu.org>
Subject: Re: UTF-8, UTF-16 and UTF-32
Date: Wed, 27 Aug 2008 08:18:00 -0000	[thread overview]
Message-ID: <fa28b9250808261150v32bdd89bod1154c659f3ac9a5@mail.gmail.com> (raw)
Message-ID: <20080827081800.5_YPck6Gjs_qniVNNmtvhT6FdbbxXvPGmrBcsi27wlk@z> (raw)
In-Reply-To: <48B44C2C.5040301@redhat.com>

On Tue, Aug 26, 2008 at 14:32, Andrew Haley <aph@redhat.com> wrote:
>
> Just in case anyone thinks that UTF-16 might be a good format for saving
> data in files or for data to be sent over a network, here's a gem from
> Microsoft:
>
> 'The example in the documentation didn't specify Little Endian, so
> the Unicode string that the code generates is Big Endian.  The
> SQL Server Driver for PHP expected Big Endian, so the data
> written to SQL Server is not what was expected.  However, because
> the code to retrieve the data converts the string from Big Endian
> back to UTF-8, the resulting string in the example matches the
> original string.
>
> 'If you change the Unicode charset in the example from "UTF-16"
> to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
> see the original and resulting strings match but now you'll also
> see that the code sends the expected data to the database.'
>
> http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1
>

Absolutely.  UTF-8 is the only one without possible byte ordering
issues, so it (or UTF-7, if needed) is the only reasonable option for
interchange, since for text, size isn't that high anyways, and with
compression it's not bad at all.  (All the bytes in the UTF-8
representation of a codepoint are the same, for a language, except the
last and maybe second last, so even just a naive huffman can pretty
much eliminate the cost in size over UTF-16, since UTF-16 also has
those prelude bytes for specific languages.)

And really, since at a glyph level even UTF-32 is a variable-width
encoding, you have to think about it anyways, so I don't see why it's
worth not just using UTF-8 everywhere.  (For example, suppose you have
an s codepoint followed by a combining accent codepoint.  Pressing
"backspace" with the cursor after it should, probably, erase both
codepoints.  At the same time, if it's an ffi ligature, then probably
backspace should replace it with an ff ligature.  So since you can't
just do --size on your string anyways...)

~ Scott

P.S.  Are there any architectures around using middle-endian UTF-32? ;)

  reply	other threads:[~2008-08-26 19:04 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <002901c903df$08265510$3b9c65dc@testserver>
2008-08-22 14:54 ` Eljay Love-Jensen
2008-08-23  2:00   ` Dallas Clarke
2008-08-23  2:24     ` me22
2008-08-23  2:45       ` Dallas Clarke
2008-08-23  3:06         ` me22
2008-08-23  3:52           ` Dallas Clarke
2008-08-23  4:31             ` Brian Dessent
2008-08-23 11:33     ` Andrew Haley
2008-08-23 21:41     ` Eljay Love-Jensen
2008-08-24  0:41       ` Dallas Clarke
2008-08-24  4:02         ` me22
2008-08-24  5:53         ` corey taylor
2008-08-24  6:02           ` Dallas Clarke
2008-08-24 11:11             ` me22
2008-08-24 19:11             ` Eljay Love-Jensen
2008-08-26 14:50               ` Marco Manfredini
2008-08-25 23:15         ` Matthew Woehlke
2008-08-26  4:14           ` Dallas Clarke
2008-08-26  6:03             ` Matthew Woehlke
2008-08-26 18:29             ` Jim Cobban
2008-08-26 18:37               ` me22
2008-08-26 19:20                 ` me22
2008-08-26 21:29                 ` me22
2008-08-27 13:29                 ` me22
2008-08-26 18:54               ` Andrew Haley
2008-08-26 21:19                 ` me22 [this message]
2008-08-27  8:18                   ` me22
2008-08-27 11:45                   ` me22
2008-08-21  5:16 Dallas Clarke
2008-08-21  9:30 ` me22
     [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
2008-08-21 12:49     ` me22
2008-08-21 10:18 ` Andrew Haley
2008-08-21 11:50   ` Dallas Clarke
2008-08-21 12:15     ` John Love-Jensen
2008-08-21 14:38 ` John Gateley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa28b9250808261150v32bdd89bod1154c659f3ac9a5@mail.gmail.com \
    --to=me22.ca@gmail.com \
    --cc=aph@redhat.com \
    --cc=gcc-help@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).