public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
@ 2020-11-30 13:59 vincent-srcware at vinc17 dot net
  2021-04-29 21:58 ` [Bug locale/26984] " carlos at redhat dot com
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2020-11-30 13:59 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

            Bug ID: 26984
           Summary: conversion to ascii//TRANSLIT with iconv does not work
                    in C locale on many characters
           Product: glibc
           Version: 2.31
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: vincent-srcware at vinc17 dot net
  Target Milestone: ---

Conversion to ascii//TRANSLIT with iconv does not work in C locale on many
characters:

$ for i in C en_US.iso885915 en_US.utf8 ; do echo '.éèêëâîôûàùïçå.€.²³⁴.Ææ.' |
LC_ALL=$i /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT ; done
.?????????????.EUR.???.AEae.
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.

On this example, only €, Æ and æ are correctly handled in the C locale.

This is similar to bug 12031 comment 8, but this bug has been fixed (and
indeed, I can't reproduce the issue when using LANG).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
@ 2021-04-29 21:58 ` carlos at redhat dot com
  2021-04-30  0:55 ` vincent-srcware at vinc17 dot net
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-29 21:58 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
The POSIX and C locale have no specified transliteration rules. That means that
the LC_CTYPE contains no transliteration rules for any of the characters you
have specified if they are outside of ASCII and so they are replaced with
REPLACEMENT CHARACTER (roughly '?' in the C locale).

In practice this is a not a bug, but an expectation mismatch of the C locale.
The C locale is very small and very restricted. Do you have access to a
distribution that is shipping C.UTF-8? 

Example on Fedora:
for i in C.UTF-8 en_US.iso885915 en_US.utf8 ; do echo
'.éèêëâîôûàùïçå.€.²³⁴.Ææ.' | LC_ALL=$i /usr/bin/iconv -f utf-8 -t
ascii//TRANSLIT ; done
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.
.eeeeaiouauica.EUR.234.AEae.

Does that answer your question?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
  2021-04-29 21:58 ` [Bug locale/26984] " carlos at redhat dot com
@ 2021-04-30  0:55 ` vincent-srcware at vinc17 dot net
  2021-04-30  2:05 ` carlos at redhat dot com
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2021-04-30  0:55 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #2 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Carlos O'Donell from comment #1)
> The POSIX and C locale have no specified transliteration rules.

But why does this depend on the locale, anyway? This is silly! And this doesn't
behave as documented: "The iconv program reads in text in one encoding and
outputs the text in another encoding."

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
  2021-04-29 21:58 ` [Bug locale/26984] " carlos at redhat dot com
  2021-04-30  0:55 ` vincent-srcware at vinc17 dot net
@ 2021-04-30  2:05 ` carlos at redhat dot com
  2021-04-30  8:00 ` vincent-srcware at vinc17 dot net
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-30  2:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #3 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Vincent Lefèvre from comment #2)
> (In reply to Carlos O'Donell from comment #1)
> > The POSIX and C locale have no specified transliteration rules.
> 
> But why does this depend on the locale, anyway? This is silly! And this
> doesn't behave as documented: "The iconv program reads in text in one
> encoding and outputs the text in another encoding."

It depends on locale because transliteration is influenced by localization.

There are over 100+ language influenced transliterations across all the
implemented localizations.

What might be obvious to an English speaker as a way to break down a letter or
sound into another letter or sound may not be obvious to an Arabic speaker. 

There are some "neutral" transliterations (language independent), but today the
POSIX and C locales specify no transliterations.

It does behave as documented, the text is read in from one encoding and output
to the other encoding. It is just that the transliteration rules applied depend
on your localization.

The behaviour you are seeing is existing for decades at this point, and
changing it could have an impact on existing applications, particularly for
POSIX and C locales.

My suggestion is to use C.UTF-8 if available in your distribution (we are
working to add a harmonized generic C.UTF-8 to glibc).

The Linux man pages for iconv could also have a clarifying sentence added about
where the transliterations come from. Note that the Linux man pages are a
distinct project (https://www.kernel.org/doc/man-pages/contributing.html)

Does that answer your question?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (2 preceding siblings ...)
  2021-04-30  2:05 ` carlos at redhat dot com
@ 2021-04-30  8:00 ` vincent-srcware at vinc17 dot net
  2021-04-30 10:21 ` schwab@linux-m68k.org
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2021-04-30  8:00 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #4 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Carlos O'Donell from comment #3)
> It depends on locale because transliteration is influenced by localization.
> 
> There are over 100+ language influenced transliterations across all the
> implemented localizations.

Could you explain how the "²" transliteration depends on the language?

And why not using a commonly used transliteration for letters, derived from the
Unicode description? For instance, since "é" is "e" with an accent, the
transliteration to ASCII should be just "e". In any case, this is better than
the replacement character.

> There are some "neutral" transliterations (language independent), but today
> the POSIX and C locales specify no transliterations.

The POSIX standard does not define the concept of transliteration at all. So it
doesn't make sense to rely on it. Everything else is an extension.

> My suggestion is to use C.UTF-8 if available in your distribution (we are
> working to add a harmonized generic C.UTF-8 to glibc).

C.UTF-8 doesn't specify a language, just like C. So the fact that it behaves
differently from C is wrong.

Note also that using the C.UTF-8 locale is incorrect when using a terminal that
doesn't support UTF-8 (because error messages could potentially contain
non-ASCII characters), which is precisely the case where transliteration is
needed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (3 preceding siblings ...)
  2021-04-30  8:00 ` vincent-srcware at vinc17 dot net
@ 2021-04-30 10:21 ` schwab@linux-m68k.org
  2021-04-30 11:13 ` vincent-srcware at vinc17 dot net
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: schwab@linux-m68k.org @ 2021-04-30 10:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #5 from Andreas Schwab <schwab@linux-m68k.org> ---
The C locale does not contain any characters that need transliterantion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (4 preceding siblings ...)
  2021-04-30 10:21 ` schwab@linux-m68k.org
@ 2021-04-30 11:13 ` vincent-srcware at vinc17 dot net
  2021-04-30 11:59 ` schwab@linux-m68k.org
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2021-04-30 11:13 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #6 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Andreas Schwab from comment #5)
> The C locale does not contain any characters that need transliterantion.

Yes, but the input text does.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (5 preceding siblings ...)
  2021-04-30 11:13 ` vincent-srcware at vinc17 dot net
@ 2021-04-30 11:59 ` schwab@linux-m68k.org
  2021-04-30 12:05 ` vincent-srcware at vinc17 dot net
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: schwab@linux-m68k.org @ 2021-04-30 11:59 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #7 from Andreas Schwab <schwab@linux-m68k.org> ---
Characters outside the range of the locale charset are forbidden.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (6 preceding siblings ...)
  2021-04-30 11:59 ` schwab@linux-m68k.org
@ 2021-04-30 12:05 ` vincent-srcware at vinc17 dot net
  2021-04-30 16:18 ` carlos at redhat dot com
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2021-04-30 12:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #8 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Andreas Schwab from comment #7)
> Characters outside the range of the locale charset are forbidden.

Then what's the point of the -f option?
And in the C locale, why does "€" get converted?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (7 preceding siblings ...)
  2021-04-30 12:05 ` vincent-srcware at vinc17 dot net
@ 2021-04-30 16:18 ` carlos at redhat dot com
  2021-04-30 16:25 ` carlos at redhat dot com
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-30 16:18 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #9 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Vincent Lefèvre from comment #8)
> (In reply to Andreas Schwab from comment #7)
> > Characters outside the range of the locale charset are forbidden.
> 
> Then what's the point of the -f option?
> And in the C locale, why does "€" get converted?

This got me digging. It turns out I was wrong.

Because the C locale is "builtin" to the library to be able to provide it 100%
of the time regardless of the upgraded state of the system, and to provide it
in a consistent way, there is a *limited* set of C locale transliteration
rules... but they are embedded (locale/C-translit.h.in).

There are indeed +1650 transliteration rules defined for the C locale.

So this refutes my argument that we shouldn't be doing transliterations in
C/POSIX.

In that case we *could* attempt to take all the neutral transliterations and
autogenerate locale/C-translit.h.in from them and thus resolve this issue by
providing all "neutral" transliterations to C/POSIX in a builtin way.

This would increase the .data for libc.so.6 to carry these builtin...

Thoughts?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (8 preceding siblings ...)
  2021-04-30 16:18 ` carlos at redhat dot com
@ 2021-04-30 16:25 ` carlos at redhat dot com
  2021-04-30 16:31 ` carlos at redhat dot com
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-30 16:25 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #10 from Carlos O'Donell <carlos at redhat dot com> ---
If we automatically processed lcoaledata/locales/translit_neutral, we would be
adding ~25,000 transliterations to the builtin C/POSIX locale. These
transliterations don't always produce ASCII, so we may need to do some more
processing of the mappings to end up at ASCII. That's a lot of
transliterations.

The intermediate fix is to add just the mappings for a subset, say all accented
characters.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (9 preceding siblings ...)
  2021-04-30 16:25 ` carlos at redhat dot com
@ 2021-04-30 16:31 ` carlos at redhat dot com
  2021-04-30 16:47 ` vincent-srcware at vinc17 dot net
  2021-04-30 18:03 ` carlos at redhat dot com
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-30 16:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #11 from Carlos O'Donell <carlos at redhat dot com> ---
So as a start I would accept and review a patch to update
locale/C-translit.h.in with blocks for all the accented characters and that has
immediate value until we solve the bigger issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (10 preceding siblings ...)
  2021-04-30 16:31 ` carlos at redhat dot com
@ 2021-04-30 16:47 ` vincent-srcware at vinc17 dot net
  2021-04-30 18:03 ` carlos at redhat dot com
  12 siblings, 0 replies; 14+ messages in thread
From: vincent-srcware at vinc17 dot net @ 2021-04-30 16:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #12 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
I'm wondering. In the C/POSIX locale, can't iconv internally use the
transliteration rules from C.UTF-8 when they are available?

During a system upgrade, transliteration may not fully work, but I don't think
that this is a big problem compared to other things that will not work.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug locale/26984] conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters
  2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
                   ` (11 preceding siblings ...)
  2021-04-30 16:47 ` vincent-srcware at vinc17 dot net
@ 2021-04-30 18:03 ` carlos at redhat dot com
  12 siblings, 0 replies; 14+ messages in thread
From: carlos at redhat dot com @ 2021-04-30 18:03 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26984

--- Comment #13 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Vincent Lefèvre from comment #12)
> I'm wondering. In the C/POSIX locale, can't iconv internally use the
> transliteration rules from C.UTF-8 when they are available?

It can. It would be a bespoke thing. We are working to make C.UTF-8 become
builtin, but for now it's a distinct locale (that on Fedora can't be
uninstalled with a package manager).

Right now I'm going to make sure C.UTF-8 does what you're asking for, and then
we still need to circle back and implement something like this check:

* When the locale is C or POSIX
* When the system has a usable C.UTF-8
* Use the transliteration rules from C.UTF-8 (superset of C/POSIX) to support
broader transliteration.

The short term fix is to update C-translit.h.in though... and you or others
could work on that and we'd have a fairly robust fix right away :-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-04-30 18:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 13:59 [Bug locale/26984] New: conversion to ascii//TRANSLIT with iconv does not work in C locale on many characters vincent-srcware at vinc17 dot net
2021-04-29 21:58 ` [Bug locale/26984] " carlos at redhat dot com
2021-04-30  0:55 ` vincent-srcware at vinc17 dot net
2021-04-30  2:05 ` carlos at redhat dot com
2021-04-30  8:00 ` vincent-srcware at vinc17 dot net
2021-04-30 10:21 ` schwab@linux-m68k.org
2021-04-30 11:13 ` vincent-srcware at vinc17 dot net
2021-04-30 11:59 ` schwab@linux-m68k.org
2021-04-30 12:05 ` vincent-srcware at vinc17 dot net
2021-04-30 16:18 ` carlos at redhat dot com
2021-04-30 16:25 ` carlos at redhat dot com
2021-04-30 16:31 ` carlos at redhat dot com
2021-04-30 16:47 ` vincent-srcware at vinc17 dot net
2021-04-30 18:03 ` carlos at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).