public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Michael Enright <mike@kmcardiff.com>
To: cygwin@cygwin.com
Subject: Re: UTF-8 character encoding
Date: Wed, 27 Jun 2018 07:50:00 -0000	[thread overview]
Message-ID: <CAOC2fq-OZE1DvG_p1tcVpfJHD3x2rg2nmiY1ox8mAoVSEM3S7w@mail.gmail.com> (raw)
In-Reply-To: <CAD8GWsuevQX6fBUzkEvUs5rBPehhG7-ht+FPZU=eOaACF5uCPg@mail.gmail.com>

On Mon, Jun 25, 2018 at 11:33 AM, Lee <ler762@gmail.com> wrote:
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.

I don't see how you arrived at this. An initial byte of 0xFF is not
the initial byte of any valid UTF-8 byte sequence. And it doesn't
conform with the statement you have later:

>  An easy way to remember this transformation format is to note that the
>  number of high-order 1's in the first byte is the same as the number of
>  subsequent bytes in the multibyte character:

This is true, but there is also a zero bit that ends the
high-order-1's bit string, which means that 0xFF is not a valid lead
byte. 0x7F is the highest byte value that you can have as a
single-byte UTF8 string.

Perhaps your statement about 0-0xFF was meant to be read differently.

Thomas Wolff's note seems to be objecting to the inclusion of
characters above U+10FFFF which isn't legal UTF-8, but was in the
original proposal. Otherwise your table rows 1-4 is correct.

The standards such as IETF RFC-3629 are easy enough to read, so I
recommend using them and citing them to others instead of trying to
summarize.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

  parent reply	other threads:[~2018-06-26 21:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-21  7:20 Lee
2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
2018-06-22  7:31   ` Lee
2018-06-22 17:30     ` Andrey Repin
2018-06-25  9:56     ` L A Walsh
2018-06-25 20:52       ` Lee
2018-06-26 21:39         ` Thomas Wolff
2018-06-27  9:31           ` Lee
2018-06-27  7:50         ` Michael Enright [this message]
2018-06-27  9:34           ` Lee
2018-06-21 18:49 ` Houder
2018-06-21 20:46   ` Houder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOC2fq-OZE1DvG_p1tcVpfJHD3x2rg2nmiY1ox8mAoVSEM3S7w@mail.gmail.com \
    --to=mike@kmcardiff.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).