From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-82369-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 25383 invoked by alias); 24 Jul 2017 13:32:19 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 25324 invoked by uid 89); 24 Jul 2017 13:32:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AC_HTML_NONSENSE_TAGS,AWL,BAYES_00,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,T_FILL_THIS_FORM_SHORT autolearn=no version=3.3.2 spammy=consensus, Hx-spam-relays-external:209.85.220.178, H*RU:209.85.220.178
X-HELO: mail-qk0-f178.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:organization
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=xjheTe3X6/Cv/QVDowtvwvCF+Qe3V/R9SoN6ZQCJ/Ao=;
        b=s6PwaYlqOEhN/XP1HbtB/4j95+X9CIVYcPsN6OmFt3PTJnbZ9+W4cR8ucs2mrNuujq
         gsuXmfHyQlBWN423e+DHz+NbtPWvE+XBxY1HkdkoCvzRTO4SX8kRuYkQW7cGtsvoR9QH
         oaor+L72CEaIoBYQNYQBF1jZSSe/+wjr+ImNa7AeVGaHb6ce+hdlxfaXNqw2S8H2ezN3
         DvYMTuLHi/eY5avRF4A0g8dCVW5Yp7yRU8z8aZftFpe7dpOiofK2udOMMJmZkD8itJVd
         f+Ek4Bq9Vxg7IrqNlPz7DC/7w4e3FL3faStg7u+8Wq6GdOfRMn/xTvP0IQAJARHMhBnT
         WjMw==
X-Gm-Message-State: AIVw110IBPG6AriAzuSxdr9oUahIHw4duK7dAMD3lY4uXbhxO5nVGrhy
	0PiLhP3P/PD1ypX266Cncw==
X-Received: by 10.55.180.198 with SMTP id d189mr18654958qkf.103.1500903132324;
        Mon, 24 Jul 2017 06:32:12 -0700 (PDT)
Subject: Re: Is it OK to write ASCII strings directly into locale source
 files?
To: Mike FABIAN <mfabian@redhat.com>
Cc: libc-alpha@sourceware.org
References: <s9d8tje9e1k.fsf@redhat.com>
 <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com>
 <s9d7eyy6k1y.fsf@redhat.com>
From: Carlos O'Donell <carlos@redhat.com>
Message-ID: <9d38a4b0-9b06-8ee5-79b7-ed6b5e7fc40d@redhat.com>
Date: Mon, 24 Jul 2017 14:47:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <s9d7eyy6k1y.fsf@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-07/txt/msg00813.txt.bz2

On 07/24/2017 09:28 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> On 07/24/2017 09:09 AM, Mike FABIAN wrote:
>>>
>>> Currently the locale source files use a lot of code points even for
>>> strings which are pure ASCII. For example localedata/locales/de_DE
>>> contains:
>>>
>>> %	"%a %d %b %Y %T %Z"
>>> d_t_fmt
>>> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
>>>
>>> Would it be OK to write this as
>>>
>>> d_t_fmt "%a %d %b %Y %T %Z"
>>>
>>> ??
>>>
>>> This would make the files much more readable.
>>>
>>> Stuff that is mostly ASCII can probably be written like this:
>>>
>>> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
>>> country_name "Fran<U00E7>a"
>>>
>>> which is already more readable then writing it all in <U00??> code points.
>>>
>>> It would be even nicer to write it completely in UTF-8, i.e.:
>>>
>>> country_name "FranÃ§a"
>>>
>>> but I am not sure whether this is allowed in the locale source files.
>>>
>>> But at least for everything which is ASCII, it might be OK already to
>>> write the characters directly.
>>>
>>> Is writing ASCII there allowed or not??
>>  
>> It's not ASCII though is it? Since '<' and '>' have to be reserved
>> to support parsing of UTF-8 code points, so it's "almost ASCII."
>>
>> I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
>> instead of the verbose code-points, but we need to document exactly
>> which characters are allowed. I believe the answer is everything
>> except '<>'.
>>
>> I'm not entirely ready to allow all UTF-8, since that descends into
>> the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
>> which form should be used. Then there are discussions around uniqueness
>> of decomposition and exactly what did the source author want.
>>
>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>> the start of a code point and > the end of the code point.
> 
> Yes, that sounds like a very reasonable first step!
> 
> Is it OK to use that already *now*?

You and Rafal are localedata maintainers, you can assume consensus, therefore
you can start changing things in whatever way you wish.

Before you change this though I would like to see your list of reasons
for making the change, what benefits do you see it bringing? Is readability
the only one?

> Or is any change necessary to make that work?

I do not know.

> I tried
> 
> country_name "Fran<U00E7>a"
> 
> and it seems to work:
> 
> bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name
> country_name="FranÃ§a"
> 
> So maybe it is possible to use that right now without having to change
> anything in the code parsing the locale source files.
 
You need to document somewhere what is acceptable and what is not and
which ASCII characters cannot be used.

-- 
Cheers,
Carlos.