From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16679 invoked by alias); 24 Jul 2017 13:28:36 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 16618 invoked by uid 89); 24 Jul 2017 13:28:34 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AC_HTML_NONSENSE_TAGS,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,T_FILL_THIS_FORM_SHORT autolearn=ham version=3.3.2 spammy= X-HELO: mx1.redhat.com DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A1B4EC00AFDC Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=mfabian@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com A1B4EC00AFDC From: Mike FABIAN To: Carlos O'Donell Cc: libc-alpha@sourceware.org Subject: Re: Is it OK to write ASCII strings directly into locale source files? References: <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com> Date: Mon, 24 Jul 2017 13:32:00 -0000 In-Reply-To: <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com> (Carlos O'Donell's message of "Mon, 24 Jul 2017 09:22:48 -0400") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2017-07/txt/msg00812.txt.bz2 Carlos O'Donell wrote: > On 07/24/2017 09:09 AM, Mike FABIAN wrote: >> >> Currently the locale source files use a lot of code points even for >> strings which are pure ASCII. For example localedata/locales/de_DE >> contains: >> >> % "%a %d %b %Y %T %Z" >> d_t_fmt >> "" >> >> Would it be OK to write this as >> >> d_t_fmt "%a %d %b %Y %T %Z" >> >> ?? >> >> This would make the files much more readable. >> >> Stuff that is mostly ASCII can probably be written like this: >> >> % https://oc.wikipedia.org/wiki/Fran%C3%A7a França >> country_name "Frana" >> >> which is already more readable then writing it all in code points. >> >> It would be even nicer to write it completely in UTF-8, i.e.: >> >> country_name "França" >> >> but I am not sure whether this is allowed in the locale source files. >> >> But at least for everything which is ASCII, it might be OK already to >> write the characters directly. >> >> Is writing ASCII there allowed or not?? > > It's not ASCII though is it? Since '<' and '>' have to be reserved > to support parsing of UTF-8 code points, so it's "almost ASCII." > > I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form > instead of the verbose code-points, but we need to document exactly > which characters are allowed. I believe the answer is everything > except '<>'. > > I'm not entirely ready to allow all UTF-8, since that descends into > the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and > which form should be used. Then there are discussions around uniqueness > of decomposition and exactly what did the source author want. > > So let us start slowly and agree with 'ASCII - [<>]' where < denotes > the start of a code point and > the end of the code point. Yes, that sounds like a very reasonable first step! Is it OK to use that already *now*? Or is any change necessary to make that work? I tried country_name "Frana" and it seems to work: bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name country_name="França" So maybe it is possible to use that right now without having to change anything in the code parsing the locale source files. -- Mike FABIAN