From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-2730-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 26633 invoked by alias); 14 Nov 2013 07:47:42 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 26622 invoked by uid 89); 14 Nov 2013 07:47:41 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.6 required=5.0 tests=BAYES_50,RDNS_NONE,URIBL_BLOCKED autolearn=no version=3.3.2
X-HELO: mailgw1.tentacle.fi
Subject: Re: locale encodings
From: Troy Korjuslommi <tjk@tksoft.com>
To: Steven Abner <pheonix@zoomtown.com>
Cc: libc-locales@sourceware.org, Carlos O'Donell <carlos@redhat.com>, Keld
 Simonsen <keld@keldix.com>
In-Reply-To: <EC3F7154-A278-4126-B33C-10E107B63BD9@zoomtown.com>
References: <31AACAB8-A716-47CC-B755-F33DD77BA51E@zoomtown.com>
	 <1384174607.4028.8.camel@uno11.loco> <20131112012257.GA31828@rap.rap.dk>
	 <5281BEB1.2010909@redhat.com> <20131112133642.GA22738@rap.rap.dk>
	 <98244D14-49A6-4953-8F6B-9D393E435324@zoomtown.com>
	 <EC3F7154-A278-4126-B33C-10E107B63BD9@zoomtown.com>
Content-Type: text/plain; charset="UTF-8"
Date: Thu, 14 Nov 2013 07:47:00 -0000
Message-ID: <1384415405.2935.29.camel@uno11.loco>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-SW-Source: 2013-q4/txt/msg00085.txt.bz2

By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
seems to contain out of date information in regards to collation. The
correct collation order/data are specified in Finnish standard SFS-EN
13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
tells me that at least the dates are off, which would imply the data
being off. The collation errors seem to be diacritic related, so I would
have to go through the actual data to determine whether the error is in
strcoll's dealing with UTF-8 or the collation data. The collation data
seems to be the most likely suspect. Keld, your name is listed as the
contact, so maybe best that you check this out. In case only the
comments are off. Also, the charset is wrong. It is listed as iso-8859-1
for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
Finnish is UTF-8. Only UTF-8 includes all the characters included in the
current standards.

Since EN 13710 specifies a European collation order, it should also be
used in other Europan locales as the default sorting order.

I've tried to push for more cooperation with CLDR in the past too, and
here is a good case in point why it would actually be a good idea to
keep an eye on CLDR. There is no need to automate the process
(difficulty of which seems to be the reason for resisting CLDR), just
get the relevant data. Running comparison tests between cldr and libc
would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
other Unicode.org data, so that would be an easy way to implement the
tests.

Troy


On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
> 
> > all data that is important, save one, is in POSIX's 7-bit ASCII
> 
>  I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
>  As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> That same designer could use UTF8, and default character information from glibc locales to
> create an embedded version of a code set for microwaves in China.
>   Not saying this is standard, but my point was, I guess, is default character set for the locale could
> or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> strings either in default character set or UTF8.
>   I am no expert, just food for thought.
> Steve