From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 70657 invoked by alias); 11 Dec 2017 23:36:14 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 70595 invoked by uid 89); 11 Dec 2017 23:36:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,LIKELY_SPAM_SUBJECT,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=no version=3.3.2 spammy=Euro, dash, UD:k.a, HX-Received:10.55.108.7 X-HELO: mail-qt0-f178.google.com Received: from mail-qt0-f178.google.com (HELO mail-qt0-f178.google.com) (209.85.216.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 11 Dec 2017 23:36:12 +0000 Received: by mail-qt0-f178.google.com with SMTP id m59so43051279qte.11 for ; Mon, 11 Dec 2017 15:36:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:references:to:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=OA7cDA/zSS3ZK7HQj0L2DE7APzbPAwSvCNx3PzpMwW8=; b=O9L4NUuwl/8FHMFGRc9FM7h+zG1lU1ato37UVjoM+TB9oLNnE7PJu0pBq3MxT3qBr2 +oz0Yf5kdBBjidKvXM2w4LcpJlI39y7MhyS82vtmHbZzKqddFZFeXLR0SSKwQLc5pKVa RRbBS5D89QRiWeJZAj2KNI46AKuBmNmeVJLXjVBBw8stC2kG7Y5GOGoV2S+U0xfGyLJ9 YNTgNKTuen/HpqfUXapi5plpXBvR8xDJuq651m7KJ8j2InMDuQcuV4Vso/oSgzrMZgcY Qil1UvosPW7+zCFUea+6X4vTUX/lqyD/O9FMCZVB/0gLr8YE0k1lKJmfvxRUqwVCnsS5 b7rQ== X-Gm-Message-State: AKGB3mLdJuxYWUCpoT4iKp4ukvCevgaRbdS7/cB2yqIKOrcIWV/UP+Ji gRFWkNp+yiwK+xYSKd75cHE= X-Google-Smtp-Source: ACJfBou273Jbt1rzXOweQZmbvREunoBxoaMbCOCr9uiIHmJHCNeJC/SrKYfbpk3PcU7BtEp52C0lwg== X-Received: by 10.55.108.7 with SMTP id h7mr3079475qkc.111.1513035371132; Mon, 11 Dec 2017 15:36:11 -0800 (PST) Received: from [192.168.0.82] (74-94-185-237-NewEngland.hfc.comcastbusiness.net. [74.94.185.237]) by smtp.googlemail.com with ESMTPSA id z18sm5004688qkb.32.2017.12.11.15.36.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Dec 2017 15:36:10 -0800 (PST) Subject: Re: Need help with multibyte UTF-8 characters References: <626a3c06-e9f2-1932-f1f3-47ddb2051215@gmail.com> To: cygwin@cygwin.com From: Thomas Taylor Message-ID: <9d3b73ff-f596-51a2-909a-30a767e3e9b3@gmail.com> Date: Tue, 12 Dec 2017 03:43:00 -0000 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <626a3c06-e9f2-1932-f1f3-47ddb2051215@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2017-12/txt/msg00089.txt.bz2 Thank you for your advice on setting my locale to en_US.UTF-8.  Unfortunately, Cygwin still seems to have trouble displaying some three-byte UTF-8 encoded characters correctly.  For example, see the following snippet from a "sed" file.  This file attempts to convert XML-encoded filenames to UTF-8.  As you can see, it converts one- and two-byte encodings correctly, but fails on some three-byte encodings (the en dash, the em dash, and the ellipsis, all of which are displayed as a filled-in rectangle): # Match longest strings first # Three-byte encodings: # En dash s/%[Ee]2%80%93/–/g # Em dash s/%[Ee]2%80%94/—/g # Horizontal ellipsis s/%[Ee]2%80%[Aa]6/…/g # Less-than-or-equal sign s/%[Ee]2%89%[Aa]4/≤/g # Euro symbol s/%[Ee]2%82%[Aa][Cc]/€/g # Two-byte encodings: # Non-break space #s/%[Cc]2%[Aa]0/⎵/g # Lowercase a with acute accent s/%[Cc]3%[Aa]1/á/g # Lowercase a with umlaut (a.k.a. diaeresis) s/%[Cc]3%[Aa]4/ä/g # Lowercase e with acute accent s/%[Cc]3%[Aa]9/é/g # Lowercase i with acute accent s/%[Cc]3%[Aa]D/í/g # Lowercase o with acute accent s/%[Cc]3%[Bb]3/ó/g -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple