From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26633 invoked by alias); 14 Nov 2013 07:47:42 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 26622 invoked by uid 89); 14 Nov 2013 07:47:41 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.6 required=5.0 tests=BAYES_50,RDNS_NONE,URIBL_BLOCKED autolearn=no version=3.3.2 X-HELO: mailgw1.tentacle.fi Subject: Re: locale encodings From: Troy Korjuslommi To: Steven Abner Cc: libc-locales@sourceware.org, Carlos O'Donell , Keld Simonsen In-Reply-To: References: <31AACAB8-A716-47CC-B755-F33DD77BA51E@zoomtown.com> <1384174607.4028.8.camel@uno11.loco> <20131112012257.GA31828@rap.rap.dk> <5281BEB1.2010909@redhat.com> <20131112133642.GA22738@rap.rap.dk> <98244D14-49A6-4953-8F6B-9D393E435324@zoomtown.com> Content-Type: text/plain; charset="UTF-8" Date: Thu, 14 Nov 2013 07:47:00 -0000 Message-ID: <1384415405.2935.29.camel@uno11.loco> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-SW-Source: 2013-q4/txt/msg00085.txt.bz2 By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it seems to contain out of date information in regards to collation. The correct collation order/data are specified in Finnish standard SFS-EN 13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC 14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file tells me that at least the dates are off, which would imply the data being off. The collation errors seem to be diacritic related, so I would have to go through the actual data to determine whether the error is in strcoll's dealing with UTF-8 or the collation data. The collation data seems to be the most likely suspect. Keld, your name is listed as the contact, so maybe best that you check this out. In case only the comments are off. Also, the charset is wrong. It is listed as iso-8859-1 for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for Finnish is UTF-8. Only UTF-8 includes all the characters included in the current standards. Since EN 13710 specifies a European collation order, it should also be used in other Europan locales as the default sorting order. I've tried to push for more cooperation with CLDR in the past too, and here is a good case in point why it would actually be a good idea to keep an eye on CLDR. There is no need to automate the process (difficulty of which seems to be the reason for resisting CLDR), just get the relevant data. Running comparison tests between cldr and libc would also be a good idea. ICU is pretty up-to-date in terms of CLDR and other Unicode.org data, so that would be an easy way to implement the tests. Troy On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote: > On 12 Nov 2013, at 9:34 AM, Steven Abner wrote: > > > all data that is important, save one, is in POSIX's 7-bit ASCII > > I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my > head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8. > As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer > who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up > JIS0201 or even their own proprietary 128 or less byte display code, and internal communications. > That same designer could use UTF8, and default character information from glibc locales to > create an embedded version of a code set for microwaves in China. > Not saying this is standard, but my point was, I guess, is default character set for the locale could > or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted > strings either in default character set or UTF8. > I am no expert, just food for thought. > Steve