From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 116956 invoked by alias); 25 Jul 2017 14:12:21 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 116946 invoked by uid 89); 25 Jul 2017 14:12:20 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=H*r:TLS1.2, interpretation, month, stand X-HELO: albireo.enyo.de From: Florian Weimer To: Carlos O'Donell Cc: Mike FABIAN , Andreas Schwab , libc-alpha@sourceware.org Subject: Re: Is it OK to write ASCII strings directly into locale source files? References: <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com> <87h8y13gvb.fsf@mid.deneb.enyo.de> <87379lczdi.fsf@mid.deneb.enyo.de> <7fa0552d-c24b-3c5c-cad3-1359eb4dd6bd@redhat.com> Date: Tue, 25 Jul 2017 14:21:00 -0000 In-Reply-To: (Carlos O'Donell's message of "Tue, 25 Jul 2017 08:17:44 -0400") Message-ID: <87mv7sbo75.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain X-SW-Source: 2017-07/txt/msg00851.txt.bz2 * Carlos O'Donell: > On 07/25/2017 02:20 AM, Mike FABIAN wrote: >> Carlos O'Donell wrote: >> >>> My only argument is that when you are forced to use encoding it >>> is empirically less likely you'll make a mistake. Like reading a sentence >>> backwards to catch errors since it prevents your brain from filling in >>> the missing information. >> >> But there are also many mistakes because somebody mistyped code points. >> Several weird typos in things like month names look as if somebody >> mistyped code points. > > Ultimately I defer to your judgement as localedata maintainer to create > a workflow that is easy for you and benefits your work. > > However, I caution against throwing away the compatibility of our locales > with POSIX, which doesn't seem to allow UTF-8 in the specification. It does, to some extent: | A character in the portable character set can be represented by the | character itself, in which case the value of the character is | implementation-defined. (Implementations may allow other characters | to be represented as themselves, but such locale definitions are not | portable.) You'll need a very hostile interpretation to say that this doesn't allow multi-byte character sequences in localedef input. But I found this in the guts of localedef: /* The standards leave it up to the implementation to decide what to do with character which stand for themself. We could jump through hoops to find out the value relative to the charmap and the repertoire map, but instead we leave it up to the locale definition author to write a better definition. We assume here that every character which stands for itself is encoded using ISO 8859-1. Using the escape character is allowed. */ So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the bootstrapping problem.