From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6958 invoked by alias); 24 May 2017 12:18:34 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 6948 invoked by uid 89); 24 May 2017 12:18:33 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=2.8 required=5.0 tests=BAYES_50,LIKELY_SPAM_SUBJECT,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=german, German, principal, ronald X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 24 May 2017 12:18:32 +0000 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 5924572470; Wed, 24 May 2017 12:18:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 5924572470 Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=eblake@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 5924572470 Received: from [10.10.122.27] (ovpn-122-27.rdu2.redhat.com [10.10.122.27]) by smtp.corp.redhat.com (Postfix) with ESMTP id D375D7C747; Wed, 24 May 2017 12:18:33 +0000 (UTC) Subject: Re: Bug: grep behaves incorrectly under the locale C.UTF-8, if a file contains Umlaut characters To: cygwin@cygwin.com, ynnor@mm.st References: <1495612367.2760331.986814392.79C77EB2@webmail.messagingengine.com> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg Message-ID: <3c344ecb-6ef3-9d54-a627-4714382d4d84@redhat.com> Date: Wed, 24 May 2017 12:33:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: <1495612367.2760331.986814392.79C77EB2@webmail.messagingengine.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="wFS364UccguVH58rEwtXcAeNNAmT3neio" X-IsSubscribed: yes X-SW-Source: 2017-05/txt/msg00388.txt.bz2 --wFS364UccguVH58rEwtXcAeNNAmT3neio Content-Type: multipart/mixed; boundary="ch42whOwlnlTecg2IkcPbo3jmO4MkXDof"; protected-headers="v1" From: Eric Blake To: cygwin@cygwin.com, ynnor@mm.st Message-ID: <3c344ecb-6ef3-9d54-a627-4714382d4d84@redhat.com> Subject: Re: Bug: grep behaves incorrectly under the locale C.UTF-8, if a file contains Umlaut characters References: <1495612367.2760331.986814392.79C77EB2@webmail.messagingengine.com> In-Reply-To: <1495612367.2760331.986814392.79C77EB2@webmail.messagingengine.com> --ch42whOwlnlTecg2IkcPbo3jmO4MkXDof Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Content-length: 3056 On 05/24/2017 02:52 AM, Ronald Fischer wrote: > I have a file X which contains ASCII text, but also in some lines German > umlaut characters. The file is classified as: >=20 > $ file X > X: ISO-8859 text, with CRLF line terminators In ISO-8859, a German umlaut occupies one byte with the high-bit set. >=20 > However, if LANG is set to C.UTF-8, two things happen: >=20 > - grep classifies the file as binary file and produces the error message > "Binary file X matches"=20 In UTF-8, any one-byte sequence with the high bit set in isolation is an encoding error (all high-bit bytes in UTF-8 occur in 2-or-more byte sequences). According to POSIX, grep is only required to operate on text files, and the definition of a text file includes a requirement that ALL bytes in the file form valid encodings of characters in the current locale. Yes, this means that there are files that are valid text files in some locales and invalid in others (such as your file here). Once you violate the POSIX constraint of passing a non-text file to grep, all bets are off, and grep can do whatever it wants, including telling you that a binary file matches. >=20 > - Both the grepped lines (i.e. in our example the non-empty lines) AND > the error message end up in the standard output (i.e. in file Y). Yes, that's the current intended behavior in upstream grep. It's not unique to Cygwin, so complaining here won't change it. >=20 > IMO, there are several problems with this: >=20 > 1. It's hard to see, why an umlaut character makes the file X binary > under encoding C.UTF-8,=20 Because it's not a valid UTF-8 encoding. Use iconv to convert your file from ISO-8859 to UTF-8 if you want to grep it under C.UTF-8. > but not under encoding UTF-8 or C.en_EN Those aren't valid locale names. But if you mean that it does what you want under LC_ALL=3DC, that's because in the straight C locale, there are no multi-byte characters, and therefore no encoding errors are possible, and therefore you can't get a binary file in that locale due merely to an encoding error. >=20 > 2. If grep classifies a file as binary, I think the desired behaviour > would be to NOT produce any output, unless the -a flag has been > supplied. Once behavior is in the realm of the undefined, it's hard to say what the desired behavior should be. But again, if you want the current behavior changed, it's an upstream issue to complain about on bug-grep, and not something that I'm going to change for Cygwin in isolation. >=20 > 3. If grep writes a message "Binary file ... matches", this message > should go to stderr, not stdout. The stdout is supposed to contain only > a subset of the input lines. The message "Binary file ... matches" has always gone to stdout, even before upstream was tightened to flag more encoding errors as binary files. Whether the behavior of mixing it with actual output is desirable is a question for upstream. --=20 Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org --ch42whOwlnlTecg2IkcPbo3jmO4MkXDof-- --wFS364UccguVH58rEwtXcAeNNAmT3neio Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-length: 604 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJZJXoZAAoJEKeha0olJ0Nqp84H/iWSSopEVxUxiULypcaa5cw+ r69fuCD873KG1gq9iuw1U3pTT2sOy2BrfLQtAtk4KLl/UOf07WDio/CZi6J6D3tU hm+KzLhuxotWjma4SBEumtuN0YIoNoerXnTuumsUBAHscqI/MzoSZ4Efozvl+dWn rCiDkRx+jPUm8/VZKr7cgZmtcrCbPGaPSa/IvESX/ttXBZistvUQq0ZpvDfrYQ6O xQV3VKEGzMs+83jU9fnRk7Ai9/JMY7RxYDD/XlpqXECYPhSKAR1Rxxc+uULfZW79 RqnkrN/U9YyErrhg1wAXqH18rj52f5kBw/IfRJ8FKVlbDHbBFyvQ6uQAxHIwcGE= =xCqi -----END PGP SIGNATURE----- --wFS364UccguVH58rEwtXcAeNNAmT3neio--