public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/16061] New: Review / update transliteration data
@ 2013-10-18  8:04 myllynen at redhat dot com
  2014-02-18  9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: myllynen at redhat dot com @ 2013-10-18  8:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

            Bug ID: 16061
           Summary: Review / update transliteration data
           Product: glibc
           Version: 2.18
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: myllynen at redhat dot com
                CC: libc-locales at sourceware dot org

The localedata/locales/translit_* files are probably, based on comments in
them, at least partially generated from some version of UnicodeData.txt (based
on 93a568 it looks like the last major update has been for Unicode 3.2 and
17b16e suggests them originally coming from an external contributor). However,
there are some characters missing even from the Latin-1 Supplement block and in
general it doesn't seem possible to update the files just by using
UnicodeData.txt. Some of the rules live in locale/C-translit.h /
locale/C-translit.h.in which also contain local changes (like 61d5a6 / 2a81ea).

It requires likely a lot of work to understand how the files have been
generated in the first place, how to identify relevant local changes, and how
to automate the process to update them in the future.

Some individual examples of currently missing characters are U+00D8 (Ø) and
U+0110 (Đ) whereas other characters like U+00C6 (Æ) and U+0141 (Ł) from their
blocks (Latin-1 Supplement and Latin Extended-A, respectively) are present.
Some characters (like U+2033, ″) have decomposition defined as is in Unicode
but some characters (like U+00D6, Ö) have decomposition defined in Unicode but
not in glibc.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-19859-listarch-glibc-bugs=sources.redhat.com@sourceware.org Fri Oct 18 11:12:01 2013
Return-Path: <glibc-bugs-return-19859-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 11131 invoked by alias); 18 Oct 2013 11:12:01 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 11083 invoked by uid 48); 18 Oct 2013 11:11:58 -0000
From: "bugdal at aerifal dot cx" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug stdio/5994] fflush after ungetc on seekable input stream
Date: Fri, 18 Oct 2013 11:12:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: stdio
X-Bugzilla-Version: 2.3.4
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: bugdal at aerifal dot cx
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: drepper.fsp at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-5994-131-xFI3BXNvYR@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-5994-131@http.sourceware.org/bugzilla/>
References: <bug-5994-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-10/txt/msg00218.txt.bz2
Content-length: 464

https://sourceware.org/bugzilla/show_bug.cgi?idY94

--- Comment #7 from Rich Felker <bugdal at aerifal dot cx> ---
Thanks Eric. I tested again and got the same results. I was just uncertain,
with this bug having been around so long, whether anything had changed since it
was first reported. Now I guess we need someone familiar with the code to look
at what's involved in fixing it.

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
@ 2014-02-18  9:24 ` pravin.d.s at gmail dot com
  2014-06-13 12:38 ` fweimer at redhat dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-02-18  9:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pravin.d.s at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
  2014-02-18  9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com
@ 2014-06-13 12:38 ` fweimer at redhat dot com
  2014-10-10 15:26 ` maiku.fabian at gmail dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: fweimer at redhat dot com @ 2014-06-13 12:38 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
  2014-02-18  9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com
  2014-06-13 12:38 ` fweimer at redhat dot com
@ 2014-10-10 15:26 ` maiku.fabian at gmail dot com
  2015-04-28 17:30 ` maiku.fabian at gmail dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-10-10 15:26 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
                   ` (2 preceding siblings ...)
  2014-10-10 15:26 ` maiku.fabian at gmail dot com
@ 2015-04-28 17:30 ` maiku.fabian at gmail dot com
  2015-04-29  7:12 ` maiku.fabian at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-04-28 17:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #1 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I made a patch to update most
of the translit_* files, see:

https://sourceware.org/ml/libc-alpha/2015-04/msg00361.html

Joseph S. Myers answered on libc-alpha:

Joseph S. Myers> This is [BZ #16061] (I don't know if it fully
Joseph S. Myers> addresses everything discussed in that bug, but it's
Joseph S. Myers> at least relevant to that bug so the bug number
Joseph S. Myers> should be mentioned in the ChangeLog entry, whether
Joseph S. Myers> or not it's also added to NEWS and the bug closed
Joseph S. Myers> once the patch is in).  I don't know if it's relevant
Joseph S. Myers> to any other open transliteration bug.

Ah, yes, I’ll add the bug number.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28112-listarch-glibc-bugs=sources.redhat.com@sourceware.org Tue Apr 28 19:17:42 2015
Return-Path: <glibc-bugs-return-28112-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 21200 invoked by alias); 28 Apr 2015 19:17:42 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 20968 invoked by uid 48); 28 Apr 2015 19:17:39 -0000
From: "carlos at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug malloc/16159] malloc_printerr() deadlock, when calling malloc_printerr() again
Date: Tue, 28 Apr 2015 19:17:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: malloc
X-Bugzilla-Version: 2.12
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: carlos at redhat dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: siddhesh at redhat dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: bug_status assigned_to
Message-ID: <bug-16159-131-yq53PSYXuY@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-16159-131@http.sourceware.org/bugzilla/>
References: <bug-16159-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-04/txt/msg00170.txt.bz2
Content-length: 605

https://sourceware.org/bugzilla/show_bug.cgi?id\x16159

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|SUSPENDED                   |ASSIGNED
           Assignee|unassigned at sourceware dot org   |siddhesh at redhat dot com

--- Comment #22 from Carlos O'Donell <carlos at redhat dot com> ---
Solution posted:
https://sourceware.org/ml/libc-alpha/2015-02/msg00653.html

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
                   ` (3 preceding siblings ...)
  2015-04-28 17:30 ` maiku.fabian at gmail dot com
@ 2015-04-29  7:12 ` maiku.fabian at gmail dot com
  2015-05-04  7:53 ` myllynen at redhat dot com
  2015-05-04 10:42 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-04-29  7:12 UTC (permalink / raw)
  To: glibc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 7812 bytes --]

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Marko Myllynen from comment #0)
> The localedata/locales/translit_* files are probably, based on comments in
> them, at least partially generated from some version of UnicodeData.txt
> (based on 93a568 it looks like the last major update has been for Unicode
> 3.2 and 17b16e suggests them originally coming from an external
> contributor). However, there are some characters missing even from the
> Latin-1 Supplement block and in general it doesn't seem possible to update
> the files just by using UnicodeData.txt. Some of the rules live in
> locale/C-translit.h / locale/C-translit.h.in which also contain local
> changes (like 61d5a6 / 2a81ea).

C-translit.h.in seems to be manually edited and not generated from
Unicode data.

> It requires likely a lot of work to understand how the files have been
> generated in the first place, how to identify relevant local changes, and
> how to automate the process to update them in the future.

These files seem to be automatically generated with some manual additions:

    locales/translit_circle
    locales/translit_cjk_compat
    locales/translit_combining
    locales/translit_compat
    locales/translit_font  
    locales/translit_fraction

my patch updates them automatically from UnicodeData.txt keeping
the manual additions whereever they seem to make sense.

    locales/translit_neutral

is apparently manually edited and not generated.

    locales/translit_cjk_variants

is not generated from Unicode data either but from a UniVariants.Z
file which can still be found here:

http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z

It is from 2002-08-15 and I have no idea how it has been created.
So I did not touch /translit_cjk_variants.

The following files

    locales/translit_hangul
    locales/translit_narrow
    locales/translit_small
    locales/translit_wide

are automatically generated, but generating them automatically from
Unicode 7.0.0 data would just reproduce the files as they are now,
there have been no updates. Therefore I didn’t write generator
scripts for these. Would generator scripts nevertheless be useful,
so that we would notice if a change happens? I think a change
in these files is very unlikely though.

> Some individual examples of currently missing characters are U+00D8 (Ø)

This is already here:

    translit_neutral:<U00D8> "<U004F><U0045>"

(manually edited). And my patch adds it to translit_combining as:

    +% LATIN CAPITAL LETTER O WITH STROKE
    +<U00D8> <U004F>

(But as a special hack, this does not come from UnicodeData.txt).

> and U+0110 (Đ)

Adding this seems to make sense as well, I added it to the “special
hack” section of my gen_translit_combining.py:

    special_decompose_dict = {
        (0x0110,): [0x0044], # Đ → D
        (0x0111,): [0x0064], # đ → d
    ...

> whereas other characters like U+00C6 (Æ) and U+0141
> (Ł) from their blocks (Latin-1 Supplement and Latin Extended-A,
> respectively) are present. Some characters (like U+2033, ″) have
> decomposition defined as is in Unicode

Yes, this one is in translit_compat:

    $ grep -B1 U2033 translit_compat 
    % DOUBLE PRIME
    <U2033> "<U2032><U2032>"

(was already there, not added by my patch, it is generated from
UnicodeData.txt).

> but some characters (like U+00D6, Ö) have decomposition defined in
> Unicode but not in glibc.

glibc had this already in translit_combining:

    $ grep -B1 U00D6 translit_combining
    % LATIN CAPITAL LETTER O WITH DIAERESIS
    <U00D6> <U004F>

(was already there, not added by my patch, it is generated from
UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
combining character U+0308).

before commit 18a3a9a3 this was in locale/C-translit.h.in
but it was apparently removed on purpose by commit 18a3a9a3:

    -/* <U00D6> LATIN CAPITAL LETTER O WITH DIAERESIS.  */
    -/* XXX It is not clear whether this is the best transliteration for
    -   all locales.  If not, we probably have to take it out completely.  */
    -"\xd6"   "OE"

“Ö” is transliterated to “OE” for example in German, but in English
one usually transliterates it just as “O”. Therefore, translit_combining
transliterates it to “O” by decomposing and stripping the combining
character and locales like de_DE add their own transliteration rules:

    $ grep -A20 translit_start de_DE
    translit_start

    include "translit_combining";""

    % German umlauts.
    % LATIN CAPITAL LETTER A WITH DIAERESIS.
    <U00C4> "<U0041><U0308>";"<U0041><U0045>"
    % LATIN CAPITAL LETTER O WITH DIAERESIS.
    <U00D6> "<U004F><U0308>";"<U004F><U0045>"
    % LATIN CAPITAL LETTER U WITH DIAERESIS.
    <U00DC> "<U0055><U0308>";"<U0055><U0045>"
    % LATIN SMALL LETTER A WITH DIAERESIS.
    <U00E4> "<U0061><U0308>";"<U0061><U0065>"
    % LATIN SMALL LETTER O WITH DIAERESIS.
    <U00F6> "<U006F><U0308>";"<U006F><U0065>"
    % LATIN SMALL LETTER U WITH DIAERESIS.
    <U00FC> "<U0075><U0308>";"<U0075><U0065>"

    % Danish.
    % LATIN CAPITAL LETTER A WITH RING ABOVE.
    <U00C5> "<U0041><U030A>";"<U0041><U0041>"
    ...

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28117-listarch-glibc-bugs=sources.redhat.com@sourceware.org Wed Apr 29 07:24:17 2015
Return-Path: <glibc-bugs-return-28117-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 37507 invoked by alias); 29 Apr 2015 07:24:13 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 36984 invoked by uid 48); 29 Apr 2015 07:24:08 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/16061] Review / update transliteration data
Date: Wed, 29 Apr 2015 07:24:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.18
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: attachments.created
Message-ID: <bug-16061-131-SneX3TgTo5@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-16061-131@http.sourceware.org/bugzilla/>
References: <bug-16061-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-04/txt/msg00175.txt.bz2
Content-length: 369

https://sourceware.org/bugzilla/show_bug.cgi?id\x16061

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 8287
  --> https://sourceware.org/bugzilla/attachment.cgi?id‚87&actioníit
0001-Update-the-translit-files-to-Unicode-7.0.0.patch

updated patch.

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
                   ` (4 preceding siblings ...)
  2015-04-29  7:12 ` maiku.fabian at gmail dot com
@ 2015-05-04  7:53 ` myllynen at redhat dot com
  2015-05-04 10:42 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: myllynen at redhat dot com @ 2015-05-04  7:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #4 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #2)
> (In reply to Marko Myllynen from comment #0)
> 
> C-translit.h.in seems to be manually edited and not generated from
> Unicode data.

Based on earlier changelog comments it seems that C-translit.h.in was updated
manually for Unicode 3.2.0, should it now be updated for Unicode 7.0.0 by some
means?

As discussed off-list, it seems that there are transliterations defined only in
C-translit.h.in (like U20B9, INDIAN RUPEE SIGN) which take effect only with the
C/POSIX locale but they are not in any translit_* files, should C-translit.h.in
and translit_* files be synced for such cases? Or should C/POSIX perhaps be
"pure" without any other rules except those from derived from Unicode while the
rest could use locally added rules as well?

> These files seem to be automatically generated with some manual additions:
> 
>     locales/translit_circle
>     locales/translit_cjk_compat
>     locales/translit_combining
>     locales/translit_compat
>     locales/translit_font  
>     locales/translit_fraction
> 
> my patch updates them automatically from UnicodeData.txt keeping
> the manual additions whereever they seem to make sense.

Related to above, I wonder should we make local changes more obvious for
example by having translit_combining_unicode included from translit_combining?
It would make it much easier for others to see what definitions are coming from
Unicode and what definitions are ones provided by glibc. Or alternatively group
the generated rules separately inside translit_* files.

> is apparently manually edited and not generated.
> 
>     locales/translit_cjk_variants
> 
> is not generated from Unicode data either but from a UniVariants.Z
> file which can still be found here:
> 
> http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
> 
> It is from 2002-08-15 and I have no idea how it has been created.
> So I did not touch /translit_cjk_variants.

Perhaps we could add a note about its origins to the file.

> The following files
> 
>     locales/translit_hangul
>     locales/translit_narrow
>     locales/translit_small
>     locales/translit_wide
> 
> are automatically generated, but generating them automatically from
> Unicode 7.0.0 data would just reproduce the files as they are now,
> there have been no updates. Therefore I didn’t write generator
> scripts for these. Would generator scripts nevertheless be useful,
> so that we would notice if a change happens? I think a change
> in these files is very unlikely though.

It indeed sounds unlikely but having a generator available might make things
easier 10 or 20 years from now if someone wants to verify the situation then.
But I think it's your call, I'm ok either way.

> > Some individual examples of currently missing characters are U+00D8 (Ø)
> 
> This is already here:
> 
>     translit_neutral:<U00D8> "<U004F><U0045>"

Yes, it was added a bit after this report, see bug 15593 and commit f20820.

> (manually edited). And my patch adds it to translit_combining as:
> 
>     +% LATIN CAPITAL LETTER O WITH STROKE
>     +<U00D8> <U004F>
> 
> (But as a special hack, this does not come from UnicodeData.txt).

Please see the above bug for more discussion on this, not sure is there one
right answer which transliteration is the correct one to use here.

Also, shouldn't Ø and Æ be handled in the same way?

Looking at translit_neutral in more detail, I think it's actually wrong place
for letters, it should contain non-letters only and if specific rules are
needed for letters like Ø or Æ, those should be added directly in locale files
(so the patch discussed in bug 15593 should have not been applied to
translit_neutral after all). This would also mean that the special rules in the
generator for cases like EM DASH and EN DASH should probably end up to
translit_neutral not translit_combining.

> > and U+0110 (Đ)
> 
> Adding this seems to make sense as well

Perhaps it might be best to start with minimal set of special rules and commit
additional ones later (for example, I'd like to see 00D0, 00DE, and 014A with
their lowercase counterparts added)?

> > but some characters (like U+00D6, Ö) have decomposition defined in
> > Unicode but not in glibc.
> 
> glibc had this already in translit_combining:
> 
> (was already there, not added by my patch, it is generated from
> UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> combining character U+0308).

Yes, I think what I meant to say was that the decomposition to U+004F U+0308
was missing but as you point out it is defined in some locales where it would
be needed. Btw, I wonder should U+00D6 actually decompose to U+004F U+00A8
after U+004F U+0308 in those locales?

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 10:02:19 2015
Return-Path: <glibc-bugs-return-28139-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 20163 invoked by alias); 4 May 2015 10:02:19 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 20114 invoked by uid 48); 4 May 2015 10:02:15 -0000
From: "fweimer at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug libc/18369] va_list versions of error.h functions (error, error_at_line) missing
Date: Mon, 04 May 2015 10:02:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: libc
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: fweimer at redhat dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: cc flagtypes.name
Message-ID: <bug-18369-131-zs1D0C6tHB@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-18369-131@http.sourceware.org/bugzilla/>
References: <bug-18369-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00004.txt.bz2
Content-length: 453

https://sourceware.org/bugzilla/show_bug.cgi?id\x18369

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com
              Flags|                            |security-

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/16061] Review / update transliteration data
  2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
                   ` (5 preceding siblings ...)
  2015-05-04  7:53 ` myllynen at redhat dot com
@ 2015-05-04 10:42 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-05-04 10:42 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #5 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Marko Myllynen from comment #4)
> (In reply to Mike FABIAN from comment #2)
> > (In reply to Marko Myllynen from comment #0)
> > 
> > C-translit.h.in seems to be manually edited and not generated from
> > Unicode data.
> 
> Based on earlier changelog comments it seems that C-translit.h.in was
> updated manually for Unicode 3.2.0, should it now be updated for Unicode
> 7.0.0 by some means?

Probably, but how?
> > is apparently manually edited and not generated.
> > 
> >     locales/translit_cjk_variants
> > 
> > is not generated from Unicode data either but from a UniVariants.Z
> > file which can still be found here:
> > 
> > http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z
> > 
> > It is from 2002-08-15 and I have no idea how it has been created.
> > So I did not touch /translit_cjk_variants.
> 
> Perhaps we could add a note about its origins to the file.

There is already a note in the comment section of that file.

> Also, shouldn't Ø and Æ be handled in the same way?

What do you mean by “handled in the same way”? 

> Looking at translit_neutral in more detail, I think it's actually wrong
> place for letters, it should contain non-letters only and if specific rules
> are needed for letters like Ø or Æ, those should be added directly in locale
> files (so the patch discussed in bug 15593 should have not been applied to
> translit_neutral after all). This would also mean that the special rules in
> the generator for cases like EM DASH and EN DASH should probably end up to
> translit_neutral not translit_combining.

My guess is that the purpose of translit_neutral is to contain
transliterations which are locale “neutral”, i.e. are the same for
all locales. So I see no reason not to include letters.

> > > but some characters (like U+00D6, Ö) have decomposition defined in
> > > Unicode but not in glibc.
> > 
> > glibc had this already in translit_combining:
> > 
> > (was already there, not added by my patch, it is generated from
> > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> > combining character U+0308).
> 
> Yes, I think what I meant to say was that the decomposition to U+004F U+0308
> was missing but as you point out it is defined in some locales where it
> would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F
> U+00A8 after U+004F U+0308 in those locales?

Ö -> O¨

Why? Is that a reasonable transliteration? It throws away less
information but I think it is common practice to transliterate Ö
just as O in English for example.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28141-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 11:37:33 2015
Return-Path: <glibc-bugs-return-28141-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 13399 invoked by alias); 4 May 2015 11:37:33 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 13319 invoked by uid 48); 4 May 2015 11:37:29 -0000
From: "myllynen at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/16061] Review / update transliteration data
Date: Mon, 04 May 2015 11:37:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.18
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: myllynen at redhat dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-16061-131-t7xW2pMxqb@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-16061-131@http.sourceware.org/bugzilla/>
References: <bug-16061-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00006.txt.bz2
Content-length: 3660

https://sourceware.org/bugzilla/show_bug.cgi?id=16061

--- Comment #6 from Marko Myllynen <myllynen at redhat dot com> ---
(In reply to Mike FABIAN from comment #5)
> (In reply to Marko Myllynen from comment #4)
> > (In reply to Mike FABIAN from comment #2)
> > > (In reply to Marko Myllynen from comment #0)
> > > 
> > > C-translit.h.in seems to be manually edited and not generated from
> > > Unicode data.
> > 
> > Based on earlier changelog comments it seems that C-translit.h.in was
> > updated manually for Unicode 3.2.0, should it now be updated for Unicode
> > 7.0.0 by some means?
> 
> Probably, but how?

Good question - do you see it feasible to use the generator to also produce
C-translit.h.in (sans the previous individual additions)?

> > Perhaps we could add a note about its origins to the file.
> 
> There is already a note in the comment section of that file.

Ah, not sure how I missed that.

> > Also, shouldn't Ø and Æ be handled in the same way?
> 
> What do you mean by “handled in the same way”? 

After applying the patch we would have different kind of rules for Ø (U+00D6)
and Æ (U+00C6):

locales/translit_combining:<U00D8> <U004F>
locales/translit_neutral:<U00D8> "<U004F><U0045>"
locales/translit_combining:<U00C6> "<U0041><U0045>"
locales/translit_neutral:<U00C6> "<U0041><U0045>"

> > Looking at translit_neutral in more detail, I think it's actually wrong
> > place for letters, it should contain non-letters only and if specific rules
> > are needed for letters like Ø or Æ, those should be added directly in locale
> > files (so the patch discussed in bug 15593 should have not been applied to
> > translit_neutral after all). This would also mean that the special rules in
> > the generator for cases like EM DASH and EN DASH should probably end up to
> > translit_neutral not translit_combining.
> 
> My guess is that the purpose of translit_neutral is to contain
> transliterations which are locale “neutral”, i.e. are the same for
> all locales. So I see no reason not to include letters.

Yeah, outright excluding *all* letters might be too harsh for cases where it's
clear what the result should be but from the discussion in bug 15593 and the
above handling of Ø I got an impression translit_neutral is probably not the
right place for it? If a letter is being added to translit_combining by the
generator isn't it then better to have it there than in the manually created
translit_neutral? I see that i18n includes translit_neutral, not sure does that
impose some requirements in any way.

> > > > but some characters (like U+00D6, Ö) have decomposition defined in
> > > > Unicode but not in glibc.
> > > 
> > > glibc had this already in translit_combining:
> > > 
> > > (was already there, not added by my patch, it is generated from
> > > UnicodeData.txt by decomposing to U+004F U+0308 and then stripping the
> > > combining character U+0308).
> > 
> > Yes, I think what I meant to say was that the decomposition to U+004F U+0308
> > was missing but as you point out it is defined in some locales where it
> > would be needed. Btw, I wonder should U+00D6 actually decompose to U+004F
> > U+00A8 after U+004F U+0308 in those locales?
> 
> Ö -> O¨
> 
> Why? Is that a reasonable transliteration? It throws away less
> information but I think it is common practice to transliterate Ö
> just as O in English for example.

I was merely speculating on this, perhaps we can forget this part.

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28142-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 12:06:58 2015
Return-Path: <glibc-bugs-return-28142-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 56503 invoked by alias); 4 May 2015 12:06:58 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 56442 invoked by uid 48); 4 May 2015 12:06:54 -0000
From: "polymorphm at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug libc/13502] SEGFAULT in fork() when pthread_atfork() was called from a library loaded/unloaded with dlopen/dlclose
Date: Mon, 04 May 2015 12:06:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: libc
X-Bugzilla-Version: 2.12
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: polymorphm at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-13502-131-FDo4mYfgtK@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-13502-131@http.sourceware.org/bugzilla/>
References: <bug-13502-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00007.txt.bz2
Content-length: 398

https://sourceware.org/bugzilla/show_bug.cgi?id\x13502

Andrej Antonov <polymorphm at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |polymorphm at gmail dot com

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-04 10:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-18  8:04 [Bug localedata/16061] New: Review / update transliteration data myllynen at redhat dot com
2014-02-18  9:24 ` [Bug localedata/16061] " pravin.d.s at gmail dot com
2014-06-13 12:38 ` fweimer at redhat dot com
2014-10-10 15:26 ` maiku.fabian at gmail dot com
2015-04-28 17:30 ` maiku.fabian at gmail dot com
2015-04-29  7:12 ` maiku.fabian at gmail dot com
2015-05-04  7:53 ` myllynen at redhat dot com
2015-05-04 10:42 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).