locale encodings

public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed

* locale encodings
@ 2013-11-11  1:28 Steven Abner
  2013-11-11  5:19 ` Carlos O'Donell
  2013-11-11 12:58 ` Troy Korjuslommi
  0 siblings, 2 replies; 16+ messages in thread
From: Steven Abner @ 2013-11-11  1:28 UTC (permalink / raw)
  To: libc-locales

Hi,
 Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and "wo_SN" are encoded in? I was going to try
to fix it for my use, but can't open in a normal editor. I was doing a design test when these files tripped a non-POSIX portable character set code in my scanf()'s isspace(). I think they might be ISO8859-2 but not sure. Normal editor claims it can't be
open in UTF-8. I'd rather not second guess someone else's work, if I can. If it is  ISO8859-2, I'll just decode/encode me a
UTF file to examine. Two other files have UTF8 encodings, which is no problem. Others do but weren't within scope of
the trap (comment character to first word after). I am only trying to verify the file parser is picking up exact data, and hopefully
not being corrupted by unusual codes, as some have been.
Thanks,
Steve
pheonix@zoomtown.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-11  1:28 locale encodings Steven Abner
@ 2013-11-11  5:19 ` Carlos O'Donell
  2013-11-11 12:58 ` Troy Korjuslommi
  1 sibling, 0 replies; 16+ messages in thread
From: Carlos O'Donell @ 2013-11-11  5:19 UTC (permalink / raw)
  To: Steven Abner, libc-locales

On 11/10/2013 07:03 PM, Steven Abner wrote:
> Hi, Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and
> "wo_SN" are encoded in? I was going to try to fix it for my use, but
> can't open in a normal editor. I was doing a design test when these
> files tripped a non-POSIX portable character set code in my scanf()'s
> isspace(). I think they might be ISO8859-2 but not sure. Normal
> editor claims it can't be open in UTF-8. I'd rather not second guess
> someone else's work, if I can. If it is  ISO8859-2, I'll just
> decode/encode me a UTF file to examine. Two other files have UTF8
> encodings, which is no problem. Others do but weren't within scope
> of the trap (comment character to first word after). I am only trying
> to verify the file parser is picking up exact data, and hopefully not
> being corrupted by unusual codes, as some have been.

No idea and I've been around with the projct for a long time.

Some of these files are quire historical and we didn't have all
of the tools we do today. The goal today is that everything 
should be UTF-8, but they are not.

I tried chardet and it says MacCyrillic:
Python 2.7.5 (default, Oct  8 2013, 12:19:40) 
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> rawdata = open ("localedata/locales/cs_CZ", "r").read()
>>> result = chardet.detect(rawdata)
>>> charenc = result['encoding']
>>> print result
{'confidence': 0.7721607087786949, 'encoding': 'MacCyrillic'}

It would be great to have these properly encoded into UTF-8.

I would accept patches to do so unless someone says it *can't
be encoded in UTF-8 (which I would find very odd).

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-11  1:28 locale encodings Steven Abner
  2013-11-11  5:19 ` Carlos O'Donell
@ 2013-11-11 12:58 ` Troy Korjuslommi
  2013-11-12  1:23   ` Keld Simonsen
  1 sibling, 1 reply; 16+ messages in thread
From: Troy Korjuslommi @ 2013-11-11 12:58 UTC (permalink / raw)
  To: Steven Abner; +Cc: libc-locales

If you mean the locale data files, they have a line such as 
"% Charset: ISO-8859-1"
which tell you the charset.

It would indeed be a good idea to tell the files' maintainers to use
UTF-8 from now on. For now you can use iconv or uconv to convert them.
E.g. iconv -f iso-8859-1 -t utf-8 < file > newfile

Troy



On Sun, 2013-11-10 at 19:03 -0500, Steven Abner wrote:
> Hi,
>  Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and "wo_SN" are encoded in? I was going to try
> to fix it for my use, but can't open in a normal editor. I was doing a design test when these files tripped a non-POSIX portable character set code in my scanf()'s isspace(). I think they might be ISO8859-2 but not sure. Normal editor claims it can't be
> open in UTF-8. I'd rather not second guess someone else's work, if I can. If it is  ISO8859-2, I'll just decode/encode me a
> UTF file to examine. Two other files have UTF8 encodings, which is no problem. Others do but weren't within scope of
> the trap (comment character to first word after). I am only trying to verify the file parser is picking up exact data, and hopefully
> not being corrupted by unusual codes, as some have been.
> Thanks,
> Steve
> pheonix@zoomtown.com
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-11 12:58 ` Troy Korjuslommi
@ 2013-11-12  1:23   ` Keld Simonsen
  2013-11-12  5:38     ` Carlos O'Donell
  0 siblings, 1 reply; 16+ messages in thread
From: Keld Simonsen @ 2013-11-12  1:23 UTC (permalink / raw)
  To: Troy Korjuslommi; +Cc: Steven Abner, libc-locales

Well, the encoding of the source coode of all locales should be 7-bit ascii, for
maximum portability. Then the target encoding should be recorded via the 
% charset specification, which gives a list of possible charsets, comma separated.
UTF-8 should always be included there, but other encodings should also be available.

best regards
keld

On Mon, Nov 11, 2013 at 02:56:47PM +0200, Troy Korjuslommi wrote:
> If you mean the locale data files, they have a line such as 
> "% Charset: ISO-8859-1"
> which tell you the charset.
> 
> It would indeed be a good idea to tell the files' maintainers to use
> UTF-8 from now on. For now you can use iconv or uconv to convert them.
> E.g. iconv -f iso-8859-1 -t utf-8 < file > newfile
> 
> Troy
> 
> 
> 
> On Sun, 2013-11-10 at 19:03 -0500, Steven Abner wrote:
> > Hi,
> >  Can you tell me what file format "cs_CZ", "sk_SK", "sv_SE" and "wo_SN" are encoded in? I was going to try
> > to fix it for my use, but can't open in a normal editor. I was doing a design test when these files tripped a non-POSIX portable character set code in my scanf()'s isspace(). I think they might be ISO8859-2 but not sure. Normal editor claims it can't be
> > open in UTF-8. I'd rather not second guess someone else's work, if I can. If it is  ISO8859-2, I'll just decode/encode me a
> > UTF file to examine. Two other files have UTF8 encodings, which is no problem. Others do but weren't within scope of
> > the trap (comment character to first word after). I am only trying to verify the file parser is picking up exact data, and hopefully
> > not being corrupted by unusual codes, as some have been.
> > Thanks,
> > Steve
> > pheonix@zoomtown.com
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12  1:23   ` Keld Simonsen
@ 2013-11-12  5:38     ` Carlos O'Donell
  2013-11-12 13:36       ` Keld Simonsen
  0 siblings, 1 reply; 16+ messages in thread
From: Carlos O'Donell @ 2013-11-12  5:38 UTC (permalink / raw)
  To: Keld Simonsen, Troy Korjuslommi; +Cc: Steven Abner, libc-locales

On 11/11/2013 08:22 PM, Keld Simonsen wrote:
> Well, the encoding of the source coode of all locales should be 7-bit ascii, for
> maximum portability. Then the target encoding should be recorded via the 
> % charset specification, which gives a list of possible charsets, comma separated.
> UTF-8 should always be included there, but other encodings should also be available.

So one of the points that we've been trying to gather consensus on is:
Is it really important to have 7-bit ASCII? Why not use UTF-8 for the
the locale source? It's readily readable by all editors and allows
language specific comments in teh source files for maximum maintenance.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12  5:38     ` Carlos O'Donell
@ 2013-11-12 13:36       ` Keld Simonsen
  2013-11-12 14:39         ` Carlos O'Donell
  2013-11-12 14:52         ` Steven Abner
  0 siblings, 2 replies; 16+ messages in thread
From: Keld Simonsen @ 2013-11-12 13:36 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Troy Korjuslommi, Steven Abner, libc-locales

On Tue, Nov 12, 2013 at 12:37:53AM -0500, Carlos O'Donell wrote:
> On 11/11/2013 08:22 PM, Keld Simonsen wrote:
> > Well, the encoding of the source coode of all locales should be 7-bit ascii, for
> > maximum portability. Then the target encoding should be recorded via the 
> > % charset specification, which gives a list of possible charsets, comma separated.
> > UTF-8 should always be included there, but other encodings should also be available.
> 
> So one of the points that we've been trying to gather consensus on is:
> Is it really important to have 7-bit ASCII? Why not use UTF-8 for the
> the locale source? It's readily readable by all editors and allows
> language specific comments in teh source files for maximum maintenance.

I think to have UTF-8 is a bad idea, eg for embedded systems, and for systems that is
not maintained in UTF-8. It also can give trouble when communicating the source.

Best regards
keld

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12 13:36       ` Keld Simonsen
@ 2013-11-12 14:39         ` Carlos O'Donell
  2013-11-12 16:11           ` Keld Simonsen
  2013-11-12 14:52         ` Steven Abner
  1 sibling, 1 reply; 16+ messages in thread
From: Carlos O'Donell @ 2013-11-12 14:39 UTC (permalink / raw)
  To: Keld Simonsen; +Cc: Troy Korjuslommi, Steven Abner, libc-locales

On 11/12/2013 08:36 AM, Keld Simonsen wrote:
> On Tue, Nov 12, 2013 at 12:37:53AM -0500, Carlos O'Donell wrote:
>> On 11/11/2013 08:22 PM, Keld Simonsen wrote:
>>> Well, the encoding of the source coode of all locales should be 7-bit ascii, for
>>> maximum portability. Then the target encoding should be recorded via the 
>>> % charset specification, which gives a list of possible charsets, comma separated.
>>> UTF-8 should always be included there, but other encodings should also be available.
>>
>> So one of the points that we've been trying to gather consensus on is:
>> Is it really important to have 7-bit ASCII? Why not use UTF-8 for the
>> the locale source? It's readily readable by all editors and allows
>> language specific comments in teh source files for maximum maintenance.
> 
> I think to have UTF-8 is a bad idea, eg for embedded systems, and for systems that is
> not maintained in UTF-8. It also can give trouble when communicating the source.

Sorry, could you please expand on that?

Do you have examples of embedded systems that use glibc locale source and
don't support UTF-8? All such embedded systems that I know of run Linux
and do support UTF-8.

What do you mean by "systems that is [sic] not maintained in UTF-8?"

What kind of problems do you forsee when communicating the source?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12 13:36       ` Keld Simonsen
  2013-11-12 14:39         ` Carlos O'Donell
@ 2013-11-12 14:52         ` Steven Abner
  2013-11-12 16:15           ` Steven Abner
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Abner @ 2013-11-12 14:52 UTC (permalink / raw)
  To: Keld Simonsen; +Cc: Troy Korjuslommi, libc-locales


On 12 Nov 2013, at 8:36 AM, Keld Simonsen wrote:

> On Tue, Nov 12, 2013 at 12:37:53AM -0500, Carlos O'Donell wrote:
>> On 11/11/2013 08:22 PM, Keld Simonsen wrote:
>>> Well, the encoding of the source coode of all locales should be 7-bit ascii, for
>>> maximum portability. Then the target encoding should be recorded via the 
>>> % charset specification, which gives a list of possible charsets, comma separated.
>>> UTF-8 should always be included there, but other encodings should also be available.
>> 
>> So one of the points that we've been trying to gather consensus on is:
>> Is it really important to have 7-bit ASCII? Why not use UTF-8 for the
>> the locale source? It's readily readable by all editors and allows
>> language specific comments in teh source files for maximum maintenance.
> 
> I think to have UTF-8 is a bad idea, eg for embedded systems, and for systems that is
> not maintained in UTF-8. It also can give trouble when communicating the source.

FWIW all data that is important, save one, is in POSIX's 7-bit ASCII. From the ones I've examined and
patched, seem to be an almost identical copy from Section 7 of The Open Group Base Specifications.
There are some that have minor data problems, but I was trying to access the default character set.
That happens to be in the "comments" section for some reason.
Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12 14:39         ` Carlos O'Donell
@ 2013-11-12 16:11           ` Keld Simonsen
  0 siblings, 0 replies; 16+ messages in thread
From: Keld Simonsen @ 2013-11-12 16:11 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Troy Korjuslommi, Steven Abner, libc-locales

On Tue, Nov 12, 2013 at 09:39:08AM -0500, Carlos O'Donell wrote:
> On 11/12/2013 08:36 AM, Keld Simonsen wrote:
> > On Tue, Nov 12, 2013 at 12:37:53AM -0500, Carlos O'Donell wrote:
> >> On 11/11/2013 08:22 PM, Keld Simonsen wrote:
> >>> Well, the encoding of the source coode of all locales should be 7-bit ascii, for
> >>> maximum portability. Then the target encoding should be recorded via the 
> >>> % charset specification, which gives a list of possible charsets, comma separated.
> >>> UTF-8 should always be included there, but other encodings should also be available.
> >>
> >> So one of the points that we've been trying to gather consensus on is:
> >> Is it really important to have 7-bit ASCII? Why not use UTF-8 for the
> >> the locale source? It's readily readable by all editors and allows
> >> language specific comments in teh source files for maximum maintenance.
> > 
> > I think to have UTF-8 is a bad idea, eg for embedded systems, and for systems that is
> > not maintained in UTF-8. It also can give trouble when communicating the source.
> 
> Sorry, could you please expand on that?
> 
> Do you have examples of embedded systems that use glibc locale source and
> don't support UTF-8? All such embedded systems that I know of run Linux
> and do support UTF-8.

No, I don't have examples of embedded systems not run in UTF-8.
But I believe they are out there. Like TV-sets, routers and the like.
And non-linux systems. libc can run on many platforms, not just Linux.

> What do you mean by "systems that is [sic] not maintained in UTF-8?"

Many Linux-systems does not run UTF-8 natively. My own for example.
And the all the UTF-16 and UCS-2 systems. Think Apple.

> What kind of problems do you forsee when communicating the source?

In some IBM systems even some ASCII characters are converted wrongly. Thus the use of %
as a comment character in stead of #. On some printers # is printed wrongly.
And so on. In japan somtimes \ is printed wrongly. In my own country
sometimes Ã˜ is printed wrongly.  If we go to full UCS, then many printers
do not support full UCs. Even with fonts many do not summprt full UCS,
and really not the latest version of 10646.

Even if a character is correctly displayed, it could be difficult to see
what character it is, out of the over 100.000 characters in ISO 10646.

Many of our sources do restrict themselves to a restricted ASCII, for the same reasons.
This includes ISO 14652 and ISO 30112. I also believe Unicode tables do the same.

Best regards
keld

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12 14:52         ` Steven Abner
@ 2013-11-12 16:15           ` Steven Abner
  2013-11-14  7:47             ` Troy Korjuslommi
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Abner @ 2013-11-12 16:15 UTC (permalink / raw)
  To: libc-locales; +Cc: Carlos O'Donell, Keld Simonsen, Steven Abner

On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:

> all data that is important, save one, is in POSIX's 7-bit ASCII

 I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
 As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
That same designer could use UTF8, and default character information from glibc locales to
create an embedded version of a code set for microwaves in China.
  Not saying this is standard, but my point was, I guess, is default character set for the locale could
or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
strings either in default character set or UTF8.
  I am no expert, just food for thought.
Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-12 16:15           ` Steven Abner
@ 2013-11-14  7:47             ` Troy Korjuslommi
  2013-11-14 11:33               ` Keld Simonsen
  2013-11-14 20:47               ` Steven Abner
  0 siblings, 2 replies; 16+ messages in thread
From: Troy Korjuslommi @ 2013-11-14  7:47 UTC (permalink / raw)
  To: Steven Abner; +Cc: libc-locales, Carlos O'Donell, Keld Simonsen

By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
seems to contain out of date information in regards to collation. The
correct collation order/data are specified in Finnish standard SFS-EN
13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
tells me that at least the dates are off, which would imply the data
being off. The collation errors seem to be diacritic related, so I would
have to go through the actual data to determine whether the error is in
strcoll's dealing with UTF-8 or the collation data. The collation data
seems to be the most likely suspect. Keld, your name is listed as the
contact, so maybe best that you check this out. In case only the
comments are off. Also, the charset is wrong. It is listed as iso-8859-1
for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
Finnish is UTF-8. Only UTF-8 includes all the characters included in the
current standards.

Since EN 13710 specifies a European collation order, it should also be
used in other Europan locales as the default sorting order.

I've tried to push for more cooperation with CLDR in the past too, and
here is a good case in point why it would actually be a good idea to
keep an eye on CLDR. There is no need to automate the process
(difficulty of which seems to be the reason for resisting CLDR), just
get the relevant data. Running comparison tests between cldr and libc
would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
other Unicode.org data, so that would be an easy way to implement the
tests.

Troy

On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
> 
> > all data that is important, save one, is in POSIX's 7-bit ASCII
> 
>  I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
>  As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> That same designer could use UTF8, and default character information from glibc locales to
> create an embedded version of a code set for microwaves in China.
>   Not saying this is standard, but my point was, I guess, is default character set for the locale could
> or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> strings either in default character set or UTF8.
>   I am no expert, just food for thought.
> Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-14  7:47             ` Troy Korjuslommi
@ 2013-11-14 11:33               ` Keld Simonsen
  2013-11-14 20:47               ` Steven Abner
  1 sibling, 0 replies; 16+ messages in thread
From: Keld Simonsen @ 2013-11-14 11:33 UTC (permalink / raw)
  To: Troy Korjuslommi; +Cc: Steven Abner, libc-locales, Carlos O'Donell

I am aware of the problem, and will look into it.
It may take some time, tho.

Best regards
keld

On Thu, Nov 14, 2013 at 09:50:05AM +0200, Troy Korjuslommi wrote:
> By the way, I ran some tests on the fi_FI locale for glibc-2.18 and it
> seems to contain out of date information in regards to collation. The
> correct collation order/data are specified in Finnish standard SFS-EN
> 13710 published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
> 14651) and CLDR, and implemented in ICU. Quick look at the fi_FI file
> tells me that at least the dates are off, which would imply the data
> being off. The collation errors seem to be diacritic related, so I would
> have to go through the actual data to determine whether the error is in
> strcoll's dealing with UTF-8 or the collation data. The collation data
> seems to be the most likely suspect. Keld, your name is listed as the
> contact, so maybe best that you check this out. In case only the
> comments are off. Also, the charset is wrong. It is listed as iso-8859-1
> for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
> Finnish is UTF-8. Only UTF-8 includes all the characters included in the
> current standards.
> 
> Since EN 13710 specifies a European collation order, it should also be
> used in other Europan locales as the default sorting order.
> 
> I've tried to push for more cooperation with CLDR in the past too, and
> here is a good case in point why it would actually be a good idea to
> keep an eye on CLDR. There is no need to automate the process
> (difficulty of which seems to be the reason for resisting CLDR), just
> get the relevant data. Running comparison tests between cldr and libc
> would also be a good idea. ICU is pretty up-to-date in terms of CLDR and
> other Unicode.org data, so that would be an easy way to implement the
> tests.
> 
> Troy
> 
> 
> On Tue, 2013-11-12 at 10:37 -0500, Steven Abner wrote:
> > On 12 Nov 2013, at 9:34 AM, Steven Abner wrote:
> > 
> > > all data that is important, save one, is in POSIX's 7-bit ASCII
> > 
> >  I wish to add, the quoted strings however are UTF8 instead of the default set. Off the top of my
> > head, the JP file has quoted ("") strings for correct display of months, hours, etc. in UTF8.
> >  As far as embedded, a Japanese microwave doesn't need UTF8 for display, but the designer
> > who butchers the code for the microwave, even a Japanese one, can readily use UTF8 to set up
> > JIS0201 or even their own proprietary 128 or less byte display code, and internal communications.
> > That same designer could use UTF8, and default character information from glibc locales to
> > create an embedded version of a code set for microwaves in China.
> >   Not saying this is standard, but my point was, I guess, is default character set for the locale could
> > or should go into the ASCII section of "LC" data. Comments in any encoding get gobbled, quoted
> > strings either in default character set or UTF8.
> >   I am no expert, just food for thought.
> > Steve
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-14  7:47             ` Troy Korjuslommi
  2013-11-14 11:33               ` Keld Simonsen
@ 2013-11-14 20:47               ` Steven Abner
  2013-11-14 21:17                 ` Steven Abner
  2013-11-14 21:17                 ` Keld Simonsen
  1 sibling, 2 replies; 16+ messages in thread
From: Steven Abner @ 2013-11-14 20:47 UTC (permalink / raw)
  To: Troy Korjuslommi; +Cc: libc-locales, Keld Simonsen

On 14 Nov 2013, at 2:50 AM, Troy Korjuslommi wrote:

>  Also, the charset is wrong. It is listed as iso-8859-1
> for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
> Finnish is UTF-8. 

I did a little more digging, it appears "% Charset:" might just be for historical purposes.
The ones I was going to fill in have to be UTF-8. Most of the information is in the "% Charset:"
enumerated set, but because all now reference either directly or indirectly "copy "i18n"" in
the LC_CTYPE section. This includes codes outside the listed Charset. I am not sure yet
as to interaction of "_t1_common" but it looks like all, even the Chinese must use UTF-8,
unless:
  We (those using these files) are to isolate only those codes in i18n and t1 that apply to the Charset?
Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-14 20:47               ` Steven Abner
@ 2013-11-14 21:17                 ` Steven Abner
  2013-11-14 21:17                 ` Keld Simonsen
  1 sibling, 0 replies; 16+ messages in thread
From: Steven Abner @ 2013-11-14 21:17 UTC (permalink / raw)
  To: Troy Korjuslommi; +Cc: Keld Simonsen, libc-locales, Steven Abner

  If I understand now how your file data works, all locales are loaded as UTF-8.
Your modifiers were an attempt to limit the UTF-8 to a particular Charset?
Locales load, but the encoder/decoder weeds out input/output of the UTF-8 set.
This would imply that should the end user( guy at keyboard) select a set that can't
display his currency symbol, thats correct behavior!
  I would assume, not looked into, that a modifier changes a "Keyword" and has nothing
to do with Charset.
  So now only re-encoding old files, if want to view, is left from original question, IF my
understanding of your files you supply is correct?
Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
  2013-11-14 20:47               ` Steven Abner
  2013-11-14 21:17                 ` Steven Abner
@ 2013-11-14 21:17                 ` Keld Simonsen
  1 sibling, 0 replies; 16+ messages in thread
From: Keld Simonsen @ 2013-11-14 21:17 UTC (permalink / raw)
  To: Steven Abner; +Cc: Troy Korjuslommi, libc-locales

On Thu, Nov 14, 2013 at 03:26:31PM -0500, Steven Abner wrote:
> 
> On 14 Nov 2013, at 2:50 AM, Troy Korjuslommi wrote:
> 
> >  Also, the charset is wrong. It is listed as iso-8859-1
> > for fi_FI and iso-8859-15 for fi_FI@euro. The correct charset for
> > Finnish is UTF-8. 
> 
> I did a little more digging, it appears "% Charset:" might just be for historical purposes.
> The ones I was going to fill in have to be UTF-8. Most of the information is in the "% Charset:"
> enumerated set, but because all now reference either directly or indirectly "copy "i18n"" in
> the LC_CTYPE section. This includes codes outside the listed Charset. I am not sure yet
> as to interaction of "_t1_common" but it looks like all, even the Chinese must use UTF-8,
> unless:
>   We (those using these files) are to isolate only those codes in i18n and t1 that apply to the Charset?
> Steve

The cahrset specification is to list the encodings that a given locale can work with. I think all should work in UTF-8,
but some also work in other encodings. I know the Danish locale also work with iso-8859-1 and iso-8859-15.
I would also think that the Finnish locale would work well in iso-8859-1 and iso-8859-15.

Best regards
keld

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: locale encodings
@ 2013-11-26 17:05 Marko Myllynen
  0 siblings, 0 replies; 16+ messages in thread
From: Marko Myllynen @ 2013-11-26 17:05 UTC (permalink / raw)
  To: 'Troy Korjuslommi'; +Cc: libc-locales

Hi Troy,

> I ran some tests on the fi_FI locale for glibc-2.18 and it seems to
> contain out of date information in regards to collation. The correct
> collation order/data are specified in Finnish standard SFS-EN 13710
> published in 2011 (Finnish standard based on EN 13710 ~aka ISO/IEC
> 14651) and CLDR, and implemented in ICU.

yes, as Keld mentioned, we're aware of this, once EN 13710 is
implemented it should be easy to implement SFS-EN 13710 on top of it.
We're tracking EN 13710 support in
https://sourceware.org/bugzilla/show_bug.cgi?id=16052.

> The charset is wrong. It is listed as iso-8859-1 for fi_FI and
> iso-8859-15 for fi_FI@euro. The correct charset for Finnish is UTF-8.
> Only UTF-8 includes all the characters included in the current
> standards.

Good point, ISO-8859-1 is certainly incorrect since the introduction of
the Euro sign, I'll send a patch to fix this shortly.

> I've tried to push for more cooperation with CLDR in the past too, and
> here is a good case in point why it would actually be a good idea to
> keep an eye on CLDR. There is no need to automate the process
> (difficulty of which seems to be the reason for resisting CLDR), just
> get the relevant data. Running comparison tests between cldr and libc
> would also be a good idea. ICU is pretty up-to-date in terms of CLDR
> and other Unicode.org data, so that would be an easy way to implement
> the tests.

I updated fi_FI two years ago to match CLDR where applicable and to
implement some missing fields (see
https://sourceware.org/bugzilla/show_bug.cgi?id=12962). Based on that,
comparing glibc vs CLDR data manually is quite tedious and even today
some parts are not fully compatible with CLDR/recommendations due to
limitations in POSIX (see
https://sourceware.org/bugzilla/show_bug.cgi?id=12747). However, now
that it's been done once it should be pretty straightforward to keep the
glibc fi_FI data in sync with CLDR.

Thanks,

-- 
Marko Myllynen

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-11-26 17:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-11  1:28 locale encodings Steven Abner
2013-11-11  5:19 ` Carlos O'Donell
2013-11-11 12:58 ` Troy Korjuslommi
2013-11-12  1:23   ` Keld Simonsen
2013-11-12  5:38     ` Carlos O'Donell
2013-11-12 13:36       ` Keld Simonsen
2013-11-12 14:39         ` Carlos O'Donell
2013-11-12 16:11           ` Keld Simonsen
2013-11-12 14:52         ` Steven Abner
2013-11-12 16:15           ` Steven Abner
2013-11-14  7:47             ` Troy Korjuslommi
2013-11-14 11:33               ` Keld Simonsen
2013-11-14 20:47               ` Steven Abner
2013-11-14 21:17                 ` Steven Abner
2013-11-14 21:17                 ` Keld Simonsen
2013-11-26 17:05 Marko Myllynen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).