From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6362 invoked by alias); 14 Nov 2013 11:33:19 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 6349 invoked by uid 89); 14 Nov 2013 11:33:18 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.3 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,URIBL_BLOCKED autolearn=no version=3.3.2 X-HELO: rap.rap.dk Date: Thu, 14 Nov 2013 11:33:00 -0000 From: Keld Simonsen To: Troy Korjuslommi Cc: Steven Abner , libc-locales@sourceware.org, Carlos O'Donell Subject: Re: locale encodings Message-ID: <20131114113308.GA9638@rap.rap.dk> References: <31AACAB8-A716-47CC-B755-F33DD77BA51E@zoomtown.com> <1384174607.4028.8.camel@uno11.loco> <20131112012257.GA31828@rap.rap.dk> <5281BEB1.2010909@redhat.com> <20131112133642.GA22738@rap.rap.dk> <98244D14-49A6-4953-8F6B-9D393E435324@zoomtown.com> <1384415405.2935.29.camel@uno11.loco> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1384415405.2935.29.camel@uno11.loco> User-Agent: Mutt/1.5.20 (2009-06-14) X-SW-Source: 2013-q4/txt/msg00086.txt.bz2 I am aware of the problem, and will look into it. It may take some time, tho. Best regards keld On Thu, Nov 14, 2013 at 09:50:05AM +0200, Troy Korjuslommi wrote: > By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it > seems to contain out of date information in regards to collation. The > correct collation order/data are specified in Finnish standard SFS-EN > 13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC > 14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file > tells me that at least the dates are off, which would imply the data > being off. The collation errors seem to be diacritic related, so I would > have to go through the actual data to determine whether the error is in > strcoll's dealing with UTF-8 or the collation data. The collation data > seems to be the most likely suspect. Keld, your name is listed as the > contact, so maybe best that you check this out. In case only the > comments are off. Also, the charset is wrong. It is listed as iso-8859-1 > for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for > Finnish is UTF-8. Only UTF-8 includes all the characters included in the > current standards. > > Since EN 13710 specifies a European collation order, it should also be > used in other Europan locales as the default sorting order. > > I've tried to push for more cooperation with CLDR in the past too, and > here is a good case in point why it would actually be a good idea to > keep an eye on CLDR. There is no need to automate the process > (difficulty of which seems to be the reason for resisting CLDR), just > get the relevant data. Running comparison tests between cldr and libc > would also be a good idea. ICU is pretty up-to-date in terms of CLDR and > other Unicode.org data, so that would be an easy way to implement the > tests. > > Troy > > > On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote: > > On 12 Nov 2013, at 9:34 AM, Steven Abner wrote: > > > > > all data that is important, save one, is in POSIX's 7-bit ASCII > > > > I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my > > head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8. > > As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer > > who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up > > JIS0201 or even their own proprietary 128 or less byte display code, and internal communications. > > That same designer could use UTF8, and default character information from glibc locales to > > create an embedded version of a code set for microwaves in China. > > Not saying this is standard, but my point was, I guess, is default character set for the locale could > > or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted > > strings either in default character set or UTF8. > > I am no expert, just food for thought. > > Steve >