From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-82368-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 16679 invoked by alias); 24 Jul 2017 13:28:36 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 16618 invoked by uid 89); 24 Jul 2017 13:28:34 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AC_HTML_NONSENSE_TAGS,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,T_FILL_THIS_FORM_SHORT autolearn=ham version=3.3.2 spammy=
X-HELO: mx1.redhat.com
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A1B4EC00AFDC
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=mfabian@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com A1B4EC00AFDC
From: Mike FABIAN <mfabian@redhat.com>
To: Carlos O'Donell <carlos@redhat.com>
Cc: libc-alpha@sourceware.org
Subject: Re: Is it OK to write ASCII strings directly into locale source files?
References: <s9d8tje9e1k.fsf@redhat.com>
	<5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com>
Date: Mon, 24 Jul 2017 13:32:00 -0000
In-Reply-To: <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com> (Carlos
	O'Donell's message of "Mon, 24 Jul 2017 09:22:48 -0400")
Message-ID: <s9d7eyy6k1y.fsf@redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-07/txt/msg00812.txt.bz2

Carlos O'Donell <carlos@redhat.com> wrote:

> On 07/24/2017 09:09 AM, Mike FABIAN wrote:
>> 
>> Currently the locale source files use a lot of code points even for
>> strings which are pure ASCII. For example localedata/locales/de_DE
>> contains:
>> 
>> %	"%a %d %b %Y %T %Z"
>> d_t_fmt
>> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
>> 
>> Would it be OK to write this as
>> 
>> d_t_fmt "%a %d %b %Y %T %Z"
>> 
>> ??
>> 
>> This would make the files much more readable.
>> 
>> Stuff that is mostly ASCII can probably be written like this:
>> 
>> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
>> country_name "Fran<U00E7>a"
>> 
>> which is already more readable then writing it all in <U00??> code points.
>> 
>> It would be even nicer to write it completely in UTF-8, i.e.:
>> 
>> country_name "FranÃ§a"
>> 
>> but I am not sure whether this is allowed in the locale source files.
>> 
>> But at least for everything which is ASCII, it might be OK already to
>> write the characters directly.
>> 
>> Is writing ASCII there allowed or not??
>  
> It's not ASCII though is it? Since '<' and '>' have to be reserved
> to support parsing of UTF-8 code points, so it's "almost ASCII."
>
> I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
> instead of the verbose code-points, but we need to document exactly
> which characters are allowed. I believe the answer is everything
> except '<>'.
>
> I'm not entirely ready to allow all UTF-8, since that descends into
> the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
> which form should be used. Then there are discussions around uniqueness
> of decomposition and exactly what did the source author want.
>
> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
> the start of a code point and > the end of the code point.

Yes, that sounds like a very reasonable first step!

Is it OK to use that already *now*?

Or is any change necessary to make that work?

I tried

country_name "Fran<U00E7>a"

and it seems to work:

bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name
country_name="FranÃ§a"

So maybe it is possible to use that right now without having to change
anything in the code parsing the locale source files.

-- 
Mike FABIAN <mfabian@redhat.com>