From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 100859 invoked by alias); 4 Sep 2018 19:59:29 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 100832 invoked by uid 89); 4 Sep 2018 19:59:28 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-1.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,FROM_LOCAL_NOVOWEL,HK_RANDOM_ENVFROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=no version=3.3.2 spammy=henderson, Henderson, =ef=bf=bd, ordinary?= X-HELO: mail-qt0-f177.google.com Received: from mail-qt0-f177.google.com (HELO mail-qt0-f177.google.com) (209.85.216.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 04 Sep 2018 19:59:27 +0000 Received: by mail-qt0-f177.google.com with SMTP id j7-v6so5500085qtp.2 for ; Tue, 04 Sep 2018 12:59:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=kuUBle6VGeesfdT/vDHOl5ICJX4402XkVA0CauZRqoQ=; b=KyIkYFqy6SgsahJIC2EfEOUdnIHdDu6FqKv0riubthC/3KE8/UDbsWF45KJN6GjCSU k33+0jF9fgMkksVaXCxIm0NkbF64yWM3UQYl1wvCrz2tupcmew0M3dHQBEJ80O0lrWZ5 Lw7djFCyffOVF+yPK3M1g7jPfREqQRm1WrFl7jfagjfy3x2yUsADeQUrtEX7XbcC+pDF AwW7TiNQCvURGFN08WEz9pImEASa1+qm8ByGOEcLet5W9MiTsW0dUjjmw8RpV/VBY9Ao +uniUqZGIrmHpRcfrJelCZGs+TWZLQMHxK7cQUMOU60542tojJfYDFmfdmb9C3x/OxhY r5aQ== MIME-Version: 1.0 References: <5b8aba97.1c69fb81.96f14.1b37@mx.google.com> In-Reply-To: <5b8aba97.1c69fb81.96f14.1b37@mx.google.com> From: Doug Henderson Date: Tue, 04 Sep 2018 19:59:00 -0000 Message-ID: Subject: Re: Cygwin fails to utilize Unicode replacement character To: cygwin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2018-09/txt/msg00076.txt.bz2 On Sat, 1 Sep 2018 at 10:13, Steven Penny wrote: > You get this result with Linux: > > $ cat alfa.txt > =EF=BF=BD > > Where "cat" properly outputs Unicode 'REPLACEMENT CHARACTER' (U+FFFD). Ho= wever > with Cygwin you get this: > > $ cat alfa.txt > =E2=96=92 > > Where "cat" outputs Unicode Character 'MEDIUM SHADE' (U+2592). My preference is to remove the output fiddling code that Corrina has been working on. It is trying to solve the wrong problem. I think we have gone down a rabbit hole at the wrong end of cat's data flow. Should any changes to the way a character is displayed be required, it needs to be in the terminal program that display the character, not in cygwin which should pass the character along unmodified. Both cygwin and Debian 9.5 show: $ file alfa.txt alfa.txt: ISO-8859 text When Linux reads the file, it assumes the encoding is UTF-8. When cygwin reads the file, it assume the encoding is CP1252 This command shows the problem $ iconv -f utf8 alfa.txt iconv: alfa.txt:1:0: incomplete character or shift sequence On Linux, this shows a slightly different message, with the same intent. Try using this string: $ printf "\xC3\xAB\353\n" =C3=AB=E2=96=92 to get a better understanding of the problem. It contains two representation of LATIN SMALL LETTER E WITH DIAERESIS, first encoded in UTF-8, then using ISO-8859-1. There are two different reasons for the MEDIUM SHADE. Here it indicates an invalid UTF-8 character, and the font does not have a glyph for REPLACEMENT CHARACTER. The MEDIUM SHADE is also used in place of an ordinary character without a glyph in the font. HTH Doug --=20 Doug Henderson, Calgary, Alberta, Canada - from gmail.com -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple