public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "myllynen at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/16061] Review / update transliteration data
Date: Mon, 04 May 2015 07:53:00 -0000	[thread overview]
Message-ID: <bug-16061-131-O6vnhnxkxX@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-16061-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #4 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #2)
> (In reply to Marko Myllynen from comment #0)
> 
> C-translit.h.in seems to be manually edited and not generated from
> Unicode data.

Based on earlier changelog comments it seems that C-translit.h.in was updated
manually for Unicode 3.2.0, should it now be updated for Unicode 7.0.0 by some
means?

As discussed off-list, it seems that there are transliterations defined only in
C-translit.h.in (like U20B9, INDIAN RUPEE SIGN) which take effect only with the
C/POSIX locale but they are not in any translit_* files, should C-translit.h.in
and translit_* files be synced for such cases? Or should C/POSIX perhaps be
"pure" without any other rules except those from derived from Unicode while the
rest could use locally added rules as well?

> These files seem to be automatically generated with some manual additions:
> 
>     locales/translit_circle
>     locales/translit_cjk_compat
>     locales/translit_combining
>     locales/translit_compat
>     locales/translit_font  
>     locales/translit_fraction
> 
> my patch updates them automatically from UnicodeData.txt keeping
> the manual additions whereever they seem to make sense.

Related to above, I wonder should we make local changes more obvious for
example by having translit_combining_unicode included from translit_combining?
It would make it much easier for others to see what definitions are coming from
Unicode and what definitions are ones provided by glibc. Or alternatively group
the generated rules separately inside translit_* files.

> is apparently manually edited and not generated.
> 
>     locales/translit_cjk_variants
> 
> is not generated from Unicode data either but from a UniVariants.Z
> file which can still be found here:
> 
> http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
> 
> It is from 2002-08-15 and I have no idea how it has been created.
> So I did not touch /translit_cjk_variants.

Perhaps we could add a note about its origins to the file.

> The following files
> 
>     locales/translit_hangul
>     locales/translit_narrow
>     locales/translit_small
>     locales/translit_wide
> 
> are automatically generated, but generating them automatically from
> Unicode 7.0.0 data would just reproduce the files as they are now,
> there have been no updates. Therefore I didn’t write generator
> scripts for these. Would generator scripts nevertheless be useful,
> so that we would notice if a change happens? I think a change
> in these files is very unlikely though.

It indeed sounds unlikely but having a generator available might make things
easier 10 or 20 years from now if someone wants to verify the situation then.
But I think it's your call, I'm ok either way.

> > Some individual examples of currently missing characters are U+00D8 (Ø)
> 
> This is already here:
> 
>     translit_neutral:<U00D8> "<U004F><U0045>"

Yes, it was added a bit after this report, see bug 15593 and commit f20820.

> (manually edited). And my patch adds it to translit_combining as:
> 
>     +% LATIN CAPITAL LETTER O WITH STROKE
>     +<U00D8> <U004F>
> 
> (But as a special hack, this does not come from UnicodeData.txt).

Please see the above bug for more discussion on this, not sure is there one
right answer which transliteration is the correct one to use here.

Also, shouldn't Ø and Æ be handled in the same way?

Looking at translit_neutral in more detail, I think it's actually wrong place
for letters, it should contain non-letters only and if specific rules are
needed for letters like Ø or Æ, those should be added directly in locale files
(so the patch discussed in bug 15593 should have not been applied to
translit_neutral after all). This would also mean that the special rules in the
generator for cases like EM DASH and EN DASH should probably end up to
translit_neutral not translit_combining.

> > and U+0110 (Đ)
> 
> Adding this seems to make sense as well

Perhaps it might be best to start with minimal set of special rules and commit
additional ones later (for example, I'd like to see 00D0, 00DE, and 014A with
their lowercase counterparts added)?

> > but some characters (like U+00D6, Ö) have decomposition defined in
> > Unicode but not in glibc.
> 
> glibc had this already in translit_combining:
> 
> (was already there, not added by my patch, it is generated from
> UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> combining character U+0308).

Yes, I think what I meant to say was that the decomposition to U+004F U+0308
was missing but as you point out it is defined in some locales where it would
be needed. Btw, I wonder should U+00D6 actually decompose to U+004F U+00A8
after U+004F U+0308 in those locales?

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 10:02:19 2015
Return-Path: <glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 20163 invoked by alias); 4 May 2015 10:02:19 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 20114 invoked by uid 48); 4 May 2015 10:02:15 -0000
From: "fweimer at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug libc/18369] va_list versions of error.h functions (error, error_at_line) missing
Date: Mon, 04 May 2015 10:02:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: libc
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: fweimer at redhat dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: cc flagtypes.name
Message-ID: <bug-18369-131-zs1D0C6tHB@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-18369-131@http.sourceware.org/bugzilla/>
References: <bug-18369-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00004.txt.bz2
Content-length: 453

https://sourceware.org/bugzilla/show_bug.cgi?id\x18369

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com
              Flags|                            |security-

--
You are receiving this mail because:
You are on the CC list for the bug.


  parent reply	other threads:[~2015-05-04  7:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-18  8:04 [Bug localedata/16061] New: " myllynen at redhat dot com
2014-02-18  9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com
2014-06-13 12:38 ` fweimer at redhat dot com
2014-10-10 15:26 ` maiku.fabian at gmail dot com
2015-04-28 17:30 ` maiku.fabian at gmail dot com
2015-04-29  7:12 ` maiku.fabian at gmail dot com
2015-05-04  7:53 ` myllynen at redhat dot com [this message]
2015-05-04 10:42 ` maiku.fabian at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-16061-131-O6vnhnxkxX@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).