public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
From: Carlos O'Donell <carlos@redhat.com>
To: Florian Weimer <fw@deneb.enyo.de>,
	Alexander Bantyev <balsoft@balsoft.ru>
Cc: libc-help@sourceware.org, mfabian@redhat.com
Subject: Re: [idea] Update ISO 14651 file in locales to the latest standard version
Date: Tue, 2 Nov 2021 12:52:58 -0400	[thread overview]
Message-ID: <4d7eb2e9-7a41-97ce-1041-795e266803ce@redhat.com> (raw)
In-Reply-To: <878rz0zwvu.fsf@mid.deneb.enyo.de>

On 10/10/21 12:07, Florian Weimer wrote:
> * Alexander Bantyev:
> 
>> The file localedef/locales/iso14651_t1_common is, as far as I can tell,
>> supposed to be taken from <https://standards.iso.org/iso-iec/14651>. 
>> However,
>> the version in glibc repository is quite old (from 2016, I think) and is
>> missing some new Unicode codepoints. There have been new editions to the
>> standard, the newest being edition 6 from 2020:
>> <https://standards.iso.org/iso-iec/14651/ed-6/en/ISO14651_2020_TABLE1_en.txt>
>>
>> Perhaps the file in the glibc repository can be updated to match the
>> latest standard?
> 
> I think it's scary to update this file because it alters the result of
> bracket patterns in regular expressions.  The file is no longer fully
> automatically generated, I think.  Implementing rational ranges where
> it counts in glibc would be one way forward here.
> 
> Cc:ing Mike and Carlos, who have more details.

(1) Where does glibc's ISO 14651 data come from?

We use ISO 14651 in glibc for collation weights.

We do not use ISO 14651 in glibc for collation element ordering (CEO).

(2) Is glibc's ISO 14651 data updated in an automated fashion?

No. Importing new ISO 14651 data is a manual and difficult process that involves
harmonizing with all existing locale and their collation tailorings. This is
difficult and requires reviewing the tailorings and harmoizning them with the
updates from ISO 14651.

(3) What about regexp ranges?

Regular expression ranges rely on "collation element ordering" (not weights)
and so after importing ISO 14651 updates we must update the element orders to
retain rational ranges for English language speaker expectations for ranges
e.g. [a-z], [A-Z], and [0-9].

(4) When was the ISO 14651 data last updated for glibc?

In 2018 we updated to ISO 14651 4th Edition which was harmonized with Unicode 9.0.0.

We have not updated to 5th or 6th Edition yet.

I've filed the following bug to track this:
Bug 28528 - Update to ISO 14651 6th Edition 2020.
https://sourceware.org/bugzilla/show_bug.cgi?id=28528

Hopefully this answers your questions.

-- 
Cheers,
Carlos.


      reply	other threads:[~2021-11-02 16:53 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-10 15:53 Alexander Bantyev
2021-10-10 16:07 ` Florian Weimer
2021-11-02 16:52   ` Carlos O'Donell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d7eb2e9-7a41-97ce-1041-795e266803ce@redhat.com \
    --to=carlos@redhat.com \
    --cc=balsoft@balsoft.ru \
    --cc=fw@deneb.enyo.de \
    --cc=libc-help@sourceware.org \
    --cc=mfabian@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).