From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-82367-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 894 invoked by alias); 24 Jul 2017 13:22:55 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 880 invoked by uid 89); 24 Jul 2017 13:22:54 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AC_HTML_NONSENSE_TAGS,AWL,BAYES_00,RCVD_IN_DNSWL_LOW,RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.2 spammy=slowly, NFD, Hx-languages-length:1868, fran
X-HELO: mail-qt0-f179.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:references:from:organization
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=4xy+23YG+opXhPj/kcwga26UUviL6zjLUhHb7HLsQuI=;
        b=l0fJWq0GDrVrjR6QNPQvfjkBA08a7OXU24tTsBDayiU3IclivWfzmDEqzobmKc2I9K
         QzuxphpjXepte/CzgLflOlkjIUVXE/hpZIcXbL92quOAQFqGWt9gbdTaPM5uDSeLVPtd
         V87bkBKinBwoekaH+ejbPpj+MLFGXxvLrQZ1kQ+28TT828EJndwunbQiBAUYxaLx8Jrj
         4PlySuKlmXjUYKvsJWM4U1tTZw8ozkxkP5K/7miWQJ+O9GkBRT9ReEIQsSt+fifdGAgF
         3QnRauktFcF61Zx47Tri2pJ6poC7QI9KMY0kv0d0USm1G/KhrnzayMsgGfSPsNuvYT/R
         gLkA==
X-Gm-Message-State: AIVw112CYgyJfvbaqCSgP1NJL77ktJw6duT1hDOtpPO26cYxUVwsOalA
	x+TzqeonFxpSe2wySH3MrQ==
X-Received: by 10.200.56.175 with SMTP id f44mr20612957qtc.315.1500902571332;
        Mon, 24 Jul 2017 06:22:51 -0700 (PDT)
Subject: Re: Is it OK to write ASCII strings directly into locale source
 files?
To: Mike FABIAN <mfabian@redhat.com>, libc-alpha@sourceware.org
References: <s9d8tje9e1k.fsf@redhat.com>
From: Carlos O'Donell <carlos@redhat.com>
Message-ID: <5f71f2f6-be0e-2b5d-91ce-03386eafa7f7@redhat.com>
Date: Mon, 24 Jul 2017 13:28:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <s9d8tje9e1k.fsf@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-07/txt/msg00811.txt.bz2

On 07/24/2017 09:09 AM, Mike FABIAN wrote:
> 
> Currently the locale source files use a lot of code points even for
> strings which are pure ASCII. For example localedata/locales/de_DE
> contains:
> 
> %	"%a %d %b %Y %T %Z"
> d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
> 
> Would it be OK to write this as
> 
> d_t_fmt "%a %d %b %Y %T %Z"
> 
> ??
> 
> This would make the files much more readable.
> 
> Stuff that is mostly ASCII can probably be written like this:
> 
> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
> country_name "Fran<U00E7>a"
> 
> which is already more readable then writing it all in <U00??> code points.
> 
> It would be even nicer to write it completely in UTF-8, i.e.:
> 
> country_name "FranÃ§a"
> 
> but I am not sure whether this is allowed in the locale source files.
> 
> But at least for everything which is ASCII, it might be OK already to
> write the characters directly.
> 
> Is writing ASCII there allowed or not??
 
It's not ASCII though is it? Since '<' and '>' have to be reserved
to support parsing of UTF-8 code points, so it's "almost ASCII."

I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
instead of the verbose code-points, but we need to document exactly
which characters are allowed. I believe the answer is everything
except '<>'.

I'm not entirely ready to allow all UTF-8, since that descends into
the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
which form should be used. Then there are discussions around uniqueness
of decomposition and exactly what did the source author want.

So let us start slowly and agree with 'ASCII - [<>]' where < denotes
the start of a code point and > the end of the code point.

-- 
Cheers,
Carlos.