From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 111748 invoked by alias); 26 Jun 2018 21:39:40 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 111724 invoked by uid 89); 26 Jun 2018 21:39:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=1-4, arrived, Hx-languages-length:1194 X-HELO: mail-oi0-f52.google.com Received: from mail-oi0-f52.google.com (HELO mail-oi0-f52.google.com) (209.85.218.52) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 26 Jun 2018 21:39:38 +0000 Received: by mail-oi0-f52.google.com with SMTP id h79-v6so17408382oig.13 for ; Tue, 26 Jun 2018 14:39:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kmcardiff-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=sR4Px5scipBgKXY4VOAQiAsrbuuFiav52QsO7KXpk00=; b=bSapeO4iPX48sKUpky/6VhSIClVwy7faBi4q4BbxwjAmdAI/lapy5iQEh1/HVD4/HH PjDhjZmm5bkIG2J4FQ2829ml2u2c9kG+ukeBXJ8MUK715tq8MwIJH2rD3rnMiVh+xpJT Z5dhES2P2yNpucfOPltDwO2uXPylDVGobKY02z66ntAv/llNZtTUcB0s0KLbDNZXu2Jf 84sXj3Xk36nuO5Qb9Qwg1ava1vEQrq2K2rt9/NaEieHVuwGWRulKoT+3BU875fS9BiA4 HVXnNCFZuY1GpBd16tOKGkva6toWi87bCEg8GwKf2YVtIajUKKQPrwrEHhbG5TET9HBd d/gg== MIME-Version: 1.0 Received: by 2002:a9d:5013:0:0:0:0:0 with HTTP; Tue, 26 Jun 2018 14:39:35 -0700 (PDT) In-Reply-To: References: <1183751257.20180621042620@yandex.ru> <5B3045B1.4080504@tlinx.org> From: Michael Enright Date: Wed, 27 Jun 2018 07:50:00 -0000 Message-ID: Subject: Re: UTF-8 character encoding To: cygwin@cygwin.com Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2018-06/txt/msg00294.txt.bz2 On Mon, Jun 25, 2018 at 11:33 AM, Lee wrote: > I'm still trying to figure utf-8 out, but it seems to me that 0x0 - > 0xff is part of the utf-8 encoding. I don't see how you arrived at this. An initial byte of 0xFF is not the initial byte of any valid UTF-8 byte sequence. And it doesn't conform with the statement you have later: > An easy way to remember this transformation format is to note that the > number of high-order 1's in the first byte is the same as the number of > subsequent bytes in the multibyte character: This is true, but there is also a zero bit that ends the high-order-1's bit string, which means that 0xFF is not a valid lead byte. 0x7F is the highest byte value that you can have as a single-byte UTF8 string. Perhaps your statement about 0-0xFF was meant to be read differently. Thomas Wolff's note seems to be objecting to the inclusion of characters above U+10FFFF which isn't legal UTF-8, but was in the original proposal. Otherwise your table rows 1-4 is correct. The standards such as IETF RFC-3629 are easy enough to read, so I recommend using them and citing them to others instead of trying to summarize. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple