public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
From: Troy Korjuslommi <tjk@tksoft.com>
To: Steven Abner <pheonix@zoomtown.com>
Cc: libc-locales@sourceware.org, Carlos O'Donell <carlos@redhat.com>,
	Keld Simonsen <keld@keldix.com>
Subject: Re: locale encodings
Date: Thu, 14 Nov 2013 07:47:00 -0000	[thread overview]
Message-ID: <1384415405.2935.29.camel@uno11.loco> (raw)
In-Reply-To: <EC3F7154-A278-4126-B33C-10E107B63BD9@zoomtown.com>

By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
seems to contain out of date information in regards to collation. The
correct collation order/data are specified in Finnish standard SFS-EN
13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
tells me that at least the dates are off, which would imply the data
being off. The collation errors seem to be diacritic related, so I would
have to go through the actual data to determine whether the error is in
strcoll's dealing with UTF-8 or the collation data. The collation data
seems to be the most likely suspect. Keld, your name is listed as the
contact, so maybe best that you check this out. In case only the
comments are off. Also, the charset is wrong. It is listed as iso-8859-1
for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
Finnish is UTF-8. Only UTF-8 includes all the characters included in the
current standards.

Since EN 13710 specifies a European collation order, it should also be
used in other Europan locales as the default sorting order.

I've tried to push for more cooperation with CLDR in the past too, and
here is a good case in point why it would actually be a good idea to
keep an eye on CLDR. There is no need to automate the process
(difficulty of which seems to be the reason for resisting CLDR), just
get the relevant data. Running comparison tests between cldr and libc
would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
other Unicode.org data, so that would be an easy way to implement the
tests.

Troy


On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
> 
> > all data that is important, save one, is in POSIX's 7-bit ASCII
> 
>  I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
>  As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> That same designer could use UTF8, and default character information from glibc locales to
> create an embedded version of a code set for microwaves in China.
>   Not saying this is standard, but my point was, I guess, is default character set for the locale could
> or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> strings either in default character set or UTF8.
>   I am no expert, just food for thought.
> Steve


  reply	other threads:[~2013-11-14  7:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-11  1:28 Steven Abner
2013-11-11  5:19 ` Carlos O'Donell
2013-11-11 12:58 ` Troy Korjuslommi
2013-11-12  1:23   ` Keld Simonsen
2013-11-12  5:38     ` Carlos O'Donell
2013-11-12 13:36       ` Keld Simonsen
2013-11-12 14:39         ` Carlos O'Donell
2013-11-12 16:11           ` Keld Simonsen
2013-11-12 14:52         ` Steven Abner
2013-11-12 16:15           ` Steven Abner
2013-11-14  7:47             ` Troy Korjuslommi [this message]
2013-11-14 11:33               ` Keld Simonsen
2013-11-14 20:47               ` Steven Abner
2013-11-14 21:17                 ` Steven Abner
2013-11-14 21:17                 ` Keld Simonsen
2013-11-26 17:05 Marko Myllynen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1384415405.2935.29.camel@uno11.loco \
    --to=tjk@tksoft.com \
    --cc=carlos@redhat.com \
    --cc=keld@keldix.com \
    --cc=libc-locales@sourceware.org \
    --cc=pheonix@zoomtown.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).