public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "myllynen at redhat dot com" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug localedata/16061] Review / update transliteration data Date: Mon, 04 May 2015 07:53:00 -0000 [thread overview] Message-ID: <bug-16061-131-O6vnhnxkxX@http.sourceware.org/bugzilla/> (raw) In-Reply-To: <bug-16061-131@http.sourceware.org/bugzilla/> https://sourceware.org/bugzilla/show_bug.cgi?id=16061 --- Comment #4 from Marko Myllynen <myllynen at redhat dot com> --- (In reply to Mike FABIAN from comment #2) > (In reply to Marko Myllynen from comment #0) > > C-translit.h.in seems to be manually edited and not generated from > Unicode data. Based on earlier changelog comments it seems that C-translit.h.in was updated manually for Unicode 3.2.0, should it now be updated for Unicode 7.0.0 by some means? As discussed off-list, it seems that there are transliterations defined only in C-translit.h.in (like U20B9, INDIAN RUPEE SIGN) which take effect only with the C/POSIX locale but they are not in any translit_* files, should C-translit.h.in and translit_* files be synced for such cases? Or should C/POSIX perhaps be "pure" without any other rules except those from derived from Unicode while the rest could use locally added rules as well? > These files seem to be automatically generated with some manual additions: > > locales/translit_circle > locales/translit_cjk_compat > locales/translit_combining > locales/translit_compat > locales/translit_font > locales/translit_fraction > > my patch updates them automatically from UnicodeData.txt keeping > the manual additions whereever they seem to make sense. Related to above, I wonder should we make local changes more obvious for example by having translit_combining_unicode included from translit_combining? It would make it much easier for others to see what definitions are coming from Unicode and what definitions are ones provided by glibc. Or alternatively group the generated rules separately inside translit_* files. > is apparently manually edited and not generated. > > locales/translit_cjk_variants > > is not generated from Unicode data either but from a UniVariants.Z > file which can still be found here: > > http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z > > It is from 2002-08-15 and I have no idea how it has been created. > So I did not touch /translit_cjk_variants. Perhaps we could add a note about its origins to the file. > The following files > > locales/translit_hangul > locales/translit_narrow > locales/translit_small > locales/translit_wide > > are automatically generated, but generating them automatically from > Unicode 7.0.0 data would just reproduce the files as they are now, > there have been no updates. Therefore I didn’t write generator > scripts for these. Would generator scripts nevertheless be useful, > so that we would notice if a change happens? I think a change > in these files is very unlikely though. It indeed sounds unlikely but having a generator available might make things easier 10 or 20 years from now if someone wants to verify the situation then. But I think it's your call, I'm ok either way. > > Some individual examples of currently missing characters are U+00D8 (Ø) > > This is already here: > > translit_neutral:<U00D8> "<U004F><U0045>" Yes, it was added a bit after this report, see bug 15593 and commit f20820. > (manually edited). And my patch adds it to translit_combining as: > > +% LATIN CAPITAL LETTER O WITH STROKE > +<U00D8> <U004F> > > (But as a special hack, this does not come from UnicodeData.txt). Please see the above bug for more discussion on this, not sure is there one right answer which transliteration is the correct one to use here. Also, shouldn't Ø and Æ be handled in the same way? Looking at translit_neutral in more detail, I think it's actually wrong place for letters, it should contain non-letters only and if specific rules are needed for letters like Ø or Æ, those should be added directly in locale files (so the patch discussed in bug 15593 should have not been applied to translit_neutral after all). This would also mean that the special rules in the generator for cases like EM DASH and EN DASH should probably end up to translit_neutral not translit_combining. > > and U+0110 (Đ) > > Adding this seems to make sense as well Perhaps it might be best to start with minimal set of special rules and commit additional ones later (for example, I'd like to see 00D0, 00DE, and 014A with their lowercase counterparts added)? > > but some characters (like U+00D6, Ö) have decomposition defined in > > Unicode but not in glibc. > > glibc had this already in translit_combining: > > (was already there, not added by my patch, it is generated from > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the > combining character U+0308). Yes, I think what I meant to say was that the decomposition to U+004F U+0308 was missing but as you point out it is defined in some locales where it would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F U+00A8 after U+004F U+0308 in those locales? Thanks. -- You are receiving this mail because: You are on the CC list for the bug. >From glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 10:02:19 2015 Return-Path: <glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org> Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 20163 invoked by alias); 4 May 2015 10:02:19 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <glibc-bugs.sourceware.org> List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org> List-Post: <mailto:glibc-bugs@sourceware.org> List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs> Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 20114 invoked by uid 48); 4 May 2015 10:02:15 -0000 From: "fweimer at redhat dot com" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug libc/18369] va_list versions of error.h functions (error, error_at_line) missing Date: Mon, 04 May 2015 10:02:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: libc X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: fweimer at redhat dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: cc flagtypes.name Message-ID: <bug-18369-131-zs1D0C6tHB@http.sourceware.org/bugzilla/> In-Reply-To: <bug-18369-131@http.sourceware.org/bugzilla/> References: <bug-18369-131@http.sourceware.org/bugzilla/> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-05/txt/msg00004.txt.bz2 Content-length: 453 https://sourceware.org/bugzilla/show_bug.cgi?id\x18369 Florian Weimer <fweimer at redhat dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fweimer at redhat dot com Flags| |security- -- You are receiving this mail because: You are on the CC list for the bug.
next prev parent reply other threads:[~2015-05-04 7:53 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2013-10-18 8:04 [Bug localedata/16061] New: " myllynen at redhat dot com 2014-02-18 9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com 2014-06-13 12:38 ` fweimer at redhat dot com 2014-10-10 15:26 ` maiku.fabian at gmail dot com 2015-04-28 17:30 ` maiku.fabian at gmail dot com 2015-04-29 7:12 ` maiku.fabian at gmail dot com 2015-05-04 7:53 ` myllynen at redhat dot com [this message] 2015-05-04 10:42 ` maiku.fabian at gmail dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-16061-131-O6vnhnxkxX@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).