public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651
@ 2012-05-10 20:32 jsm28 at gcc dot gnu.org
  2013-11-26 17:05 ` [Bug localedata/14095] " myllynen at redhat dot com
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2012-05-10 20:32 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=14095

             Bug #: 14095
           Summary: Review / update collation data from Unicode / ISO
                    14651
           Product: glibc
           Version: 2.15
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: unassigned@sourceware.org
        ReportedBy: jsm28@gcc.gnu.org
                CC: libc-locales@sources.redhat.com
    Classification: Unclassified


The localedata/locales/iso14651_t1_* files are probably, from their names,
originally based on some version of ISO 14651 collation data.  They should be
updated if possible to be based on the current Unicode collation data and
algorithms.

http://www.unicode.org/reports/tr10/

Since there have been a lot of changes to these files since the original
addition in

2000-05-24  Ulrich Drepper  <drepper@redhat.com>

        * locales/iso14651_t1: New file.

it's likely there will be a lot of work to understand how the files relate to
ISO 14651 and what local changes are still relevant.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
@ 2013-11-26 17:05 ` myllynen at redhat dot com
  2014-02-18  9:24 ` pravin.d.s at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: myllynen at redhat dot com @ 2013-11-26 17:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Marko Myllynen <myllynen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |myllynen at redhat dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
  2013-11-26 17:05 ` [Bug localedata/14095] " myllynen at redhat dot com
@ 2014-02-18  9:24 ` pravin.d.s at gmail dot com
  2014-06-25 11:02 ` fweimer at redhat dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-02-18  9:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pravin.d.s at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
  2013-11-26 17:05 ` [Bug localedata/14095] " myllynen at redhat dot com
  2014-02-18  9:24 ` pravin.d.s at gmail dot com
@ 2014-06-25 11:02 ` fweimer at redhat dot com
  2014-10-10 15:26 ` maiku.fabian at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: fweimer at redhat dot com @ 2014-06-25 11:02 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2014-06-25 11:02 ` fweimer at redhat dot com
@ 2014-10-10 15:26 ` maiku.fabian at gmail dot com
  2015-06-30  3:52 ` pabs3 at bonedaddy dot net
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-10-10 15:26 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2014-10-10 15:26 ` maiku.fabian at gmail dot com
@ 2015-06-30  3:52 ` pabs3 at bonedaddy dot net
  2015-06-30 11:14 ` joseph at codesourcery dot com
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pabs3 at bonedaddy dot net @ 2015-06-30  3:52 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Paul Wise <pabs3 at bonedaddy dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pabs3 at bonedaddy dot net

--- Comment #1 from Paul Wise <pabs3 at bonedaddy dot net> ---
Why did glibc fork the Unicode collation data instead of sending changes
upstream?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2015-06-30  3:52 ` pabs3 at bonedaddy dot net
@ 2015-06-30 11:14 ` joseph at codesourcery dot com
  2015-06-30 13:44 ` carlos at redhat dot com
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: joseph at codesourcery dot com @ 2015-06-30 11:14 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

--- Comment #2 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
The people involved in getting the collation data to its present state are 
mostly no longer involved in glibc development, so if you want an 
authoritative answer you'll need to do a lot of work tracking them down.  
My hypothesis would be that each person submitting a change generally had 
their own itch to scratch (supporting collation for their own language 
better, with no interest in a more general update to a newer version of 
ISO 14651, if a newer version even existed at that time, or insufficient 
time / expertise / resources to get involved in their national standards 
committees parallel to JTC1/SC2/WG2, if ISO 14651 did not support their 
language then) and that each person accepting such a change decided that 
it was better to have the incremental improvement than to have no 
collation support for that language for the indefinite future until 
someone appeared to contribute a more thorough update.

We don't, however, need to know people's motivations for making 
incremental changes rather than larger bulk updates.  The questions that 
are actually relevant for updating the data now are more along the lines 
of: for the original addition of the ISO 14651 data, what differences are 
there from the relevant version of ISO 14651?  Do those differences relate 
to conceptual differences between the POSIX collation model and the ISO 
14651 collation model, or do they reflect different choices for how to 
collate particular characters?  If they reflect different choices, do we 
still agree that those choices are appropriate for the contexts in which 
glibc locales are used, or, with hindsight, would the ISO 14651 choices 
now be better?  Where a change was made subsequently affecting existing 
characters, is the change still at variance with current ISO 14651, and do 
we think there is still a good reason for such a difference?  Where 
collation support for new characters was added, how does that support 
compare to the support, if any, for those characters in current ISO 14651, 
and are there any differences we think are deliberate and should be 
preserved?  Do any differences reflect cases where e.g. different national 
standards specify different collation for the same characters (or 
collation differs by context), and so individual locales may need to 
override the generic international version?

Yes, there is a lot of detailed, careful work involved in analysis of the 
history of the current collation data in order to produce a justified 
analysis of those questions with recommendations for how to use data from 
current ISO 14651.  Given the responsibility to users to avoid 
regressions, we need to understand what changes would be involved in such 
an update, and satisfy ourselves that they are good changes rather than 
regressions, as part of making such an update.  Contributors willing to 
help with that careful analysis are welcome.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2015-06-30 11:14 ` joseph at codesourcery dot com
@ 2015-06-30 13:44 ` carlos at redhat dot com
  2015-06-30 15:29 ` keld at keldix dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2015-06-30 13:44 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #3 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to joseph@codesourcery.com from comment #2)
> Yes, there is a lot of detailed, careful work involved in analysis of the 
> history of the current collation data in order to produce a justified 
> analysis of those questions with recommendations for how to use data from 
> current ISO 14651.  Given the responsibility to users to avoid 
> regressions, we need to understand what changes would be involved in such 
> an update, and satisfy ourselves that they are good changes rather than 
> regressions, as part of making such an update.  Contributors willing to 
> help with that careful analysis are welcome.

I agree completely with Joseph.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2015-06-30 13:44 ` carlos at redhat dot com
@ 2015-06-30 15:29 ` keld at keldix dot com
  2015-06-30 16:03 ` joseph at codesourcery dot com
  2015-07-01  7:58 ` keld at keldix dot com
  9 siblings, 0 replies; 11+ messages in thread
From: keld at keldix dot com @ 2015-06-30 15:29 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

--- Comment #4 from keld at keldix dot com <keld at keldix dot com> ---
On Tue, Jun 30, 2015 at 11:14:35AM +0000, joseph at codesourcery dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=14095
> 
> --- Comment #2 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
> The people involved in getting the collation data to its present state are 
> mostly no longer involved in glibc development, so if you want an 
> authoritative answer you'll need to do a lot of work tracking them down.  
> My hypothesis would be that each person submitting a change generally had 
> their own itch to scratch (supporting collation for their own language 
> better, with no interest in a more general update to a newer version of 
> ISO 14651, if a newer version even existed at that time, or insufficient 
> time / expertise / resources to get involved in their national standards 
> committees parallel to JTC1/SC2/WG2, if ISO 14651 did not support their 
> language then) and that each person accepting such a change decided that 
> it was better to have the incremental improvement than to have no 
> collation support for that language for the indefinite future until 
> someone appeared to contribute a more thorough update.
> 
> We don't, however, need to know people's motivations for making 
> incremental changes rather than larger bulk updates.  The questions that 
> are actually relevant for updating the data now are more along the lines 
> of: for the original addition of the ISO 14651 data, what differences are 
> there from the relevant version of ISO 14651?  Do those differences relate 
> to conceptual differences between the POSIX collation model and the ISO 
> 14651 collation model, or do they reflect different choices for how to 
> collate particular characters?  If they reflect different choices, do we 
> still agree that those choices are appropriate for the contexts in which 
> glibc locales are used, or, with hindsight, would the ISO 14651 choices 
> now be better?  Where a change was made subsequently affecting existing 
> characters, is the change still at variance with current ISO 14651, and do 
> we think there is still a good reason for such a difference?  Where 
> collation support for new characters was added, how does that support 
> compare to the support, if any, for those characters in current ISO 14651, 
> and are there any differences we think are deliberate and should be 
> preserved?  Do any differences reflect cases where e.g. different national 
> standards specify different collation for the same characters (or 
> collation differs by context), and so individual locales may need to 
> override the generic international version?
> 
> Yes, there is a lot of detailed, careful work involved in analysis of the 
> history of the current collation data in order to produce a justified 
> analysis of those questions with recommendations for how to use data from 
> current ISO 14651.  Given the responsibility to users to avoid 
> regressions, we need to understand what changes would be involved in such 
> an update, and satisfy ourselves that they are good changes rather than 
> regressions, as part of making such an update.  Contributors willing to 
> help with that careful analysis are welcome.

Well, I was the author of many of the collation specs for different
languages, and I am still around, and I have even joined glibc maintenance
just a few years ago.

The 14651 and POSIX model are the same, or 14651 is backwards compatible
with Posix. We cannot say that we are following POSIX straightly,
then we could not have locales working, as POSIX is not well suited for
ISO 10646 UCS. So we are not adhering to POSIX, but rather 14651.

The different locale collation data were designed to adhere to
14651, in an orthogonal way, just like 14651 was designed to be used.

I am willing to contribute with a look on the different issues.

Best regards
Keld

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2015-06-30 15:29 ` keld at keldix dot com
@ 2015-06-30 16:03 ` joseph at codesourcery dot com
  2015-07-01  7:58 ` keld at keldix dot com
  9 siblings, 0 replies; 11+ messages in thread
From: joseph at codesourcery dot com @ 2015-06-30 16:03 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

--- Comment #5 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Tue, 30 Jun 2015, keld at keldix dot com wrote:

> I am willing to contribute with a look on the different issues.

That would be very helpful, thanks!  The first question would probably be 
where the original iso14651_t1 file (added in commit 
b0a3e2e6238f4846bc7a99145d2721b8d5b5ec31 in the history repository) came 
from; if we can reproduce it from old ISO 14651 data, we can hopefully 
build a corresponding file from current ISO 14651 data - and then start to 
understand, for all the changes made to the data over the past 15 years, 
which of them are still relevant and desirable given current ISO 14651 / 
Unicode data as a base, and what the right way is to handle those changes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/14095] Review / update collation data from Unicode / ISO 14651
  2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2015-06-30 16:03 ` joseph at codesourcery dot com
@ 2015-07-01  7:58 ` keld at keldix dot com
  9 siblings, 0 replies; 11+ messages in thread
From: keld at keldix dot com @ 2015-07-01  7:58 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

--- Comment #6 from keld at keldix dot com <keld at keldix dot com> ---
On Tue, Jun 30, 2015 at 04:03:54PM +0000, joseph at codesourcery dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=14095
> 
> --- Comment #5 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
> On Tue, 30 Jun 2015, keld at keldix dot com wrote:
> 
> > I am willing to contribute with a look on the different issues.
> 
> That would be very helpful, thanks!  The first question would probably be 
> where the original iso14651_t1 file (added in commit 
> b0a3e2e6238f4846bc7a99145d2721b8d5b5ec31 in the history repository) came 
> from; if we can reproduce it from old ISO 14651 data, we can hopefully 
> build a corresponding file from current ISO 14651 data - and then start to 
> understand, for all the changes made to the data over the past 15 years, 
> which of them are still relevant and desirable given current ISO 14651 / 
> Unicode data as a base, and what the right way is to handle those changes.

It is my plan to work with the editor of 14651 on making the 14651
data directly useable with glibc. This is not currently the case
and we know it.

Keld

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-07-01  7:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-10 20:32 [Bug localedata/14095] New: Review / update collation data from Unicode / ISO 14651 jsm28 at gcc dot gnu.org
2013-11-26 17:05 ` [Bug localedata/14095] " myllynen at redhat dot com
2014-02-18  9:24 ` pravin.d.s at gmail dot com
2014-06-25 11:02 ` fweimer at redhat dot com
2014-10-10 15:26 ` maiku.fabian at gmail dot com
2015-06-30  3:52 ` pabs3 at bonedaddy dot net
2015-06-30 11:14 ` joseph at codesourcery dot com
2015-06-30 13:44 ` carlos at redhat dot com
2015-06-30 15:29 ` keld at keldix dot com
2015-06-30 16:03 ` joseph at codesourcery dot com
2015-07-01  7:58 ` keld at keldix dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).