public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
@ 2012-05-06  9:08 ` aj at suse dot de
  2012-05-06 12:24 ` bugdal at aerifal dot cx
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: aj at suse dot de @ 2012-05-06  9:08 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=2253

Andreas Jaeger <aj at suse dot de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2012-05-06
                 CC|                            |aj at suse dot de
         AssignedTo|drepper.fsp at gmail dot    |unassigned at sourceware
                   |com                         |dot org

--- Comment #2 from Andreas Jaeger <aj at suse dot de> 2012-05-06 09:07:06 UTC ---
This is reproduceable with glibc 2.15.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
  2012-05-06  9:08 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others) aj at suse dot de
@ 2012-05-06 12:24 ` bugdal at aerifal dot cx
  2012-05-06 14:52 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin//translit " samuel.thibault@ens-lyon.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugdal at aerifal dot cx @ 2012-05-06 12:24 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=2253

Rich Felker <bugdal at aerifal dot cx> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugdal at aerifal dot cx

--- Comment #3 from Rich Felker <bugdal at aerifal dot cx> 2012-05-06 12:23:53 UTC ---
Please mark this bug as INVALID. This conversion definitely should NOT happen
unless //TRANSLIT is used. By default iconv should accurately reflect the
one-to-one nature of Unicode's round-trip mappings to legacy character sets.
This is important because many users of iconv will test performing a conversion
to a particular legacy character set to determine if the data can be faithfully
stored in that character set, e.g. for compression or transmission purposes. As
an example, mutt does this to choose the charset to send messages in, using a
user-provided list of charsets to try. I believe several IM clients also do it.
If iconv silently converted (for example) U+0061 U+0300 to U+00E0, such
applications would wrongly assume that this destructive conversion was
faithful, causing them to lose data. (That is, converting back would not
faithfully restore the original data, and the application should have just
stored it in the original UTF-8, but it had no way to know this because iconv
lied.)

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/2253] unicode combining accents can't be iconv-ed to latin//translit (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
  2012-05-06  9:08 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others) aj at suse dot de
  2012-05-06 12:24 ` bugdal at aerifal dot cx
@ 2012-05-06 14:52 ` samuel.thibault@ens-lyon.org
  2014-02-07  2:55 ` [Bug localedata/2253] " jsm28 at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: samuel.thibault@ens-lyon.org @ 2012-05-06 14:52 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=2253

Samuel Thibault <samuel.thibault@ens-lyon.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|unicode combining accents   |unicode combining accents
                   |can't be iconv-ed to latin  |can't be iconv-ed to
                   |(and others)                |latin//translit (and
                   |                            |others)

--- Comment #4 from Samuel Thibault <samuel.thibault@ens-lyon.org> 2012-05-06 14:51:41 UTC ---
Right, thus changing bug title: the transliteration however still produces "e",
while it could produce "é".

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/2253] unicode combining accents can't be iconv-ed to latin//translit (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2012-05-06 14:52 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin//translit " samuel.thibault@ens-lyon.org
@ 2014-02-07  2:55 ` jsm28 at gcc dot gnu.org
  2014-06-26  5:08 ` pravin.d.s at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2014-02-07  2:55 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=2253

Joseph Myers <jsm28 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |libc-locales at sourceware dot org
          Component|libc                        |localedata

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/2253] unicode combining accents can't be iconv-ed to latin//translit (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2014-02-07  2:55 ` [Bug localedata/2253] " jsm28 at gcc dot gnu.org
@ 2014-06-26  5:08 ` pravin.d.s at gmail dot com
  2015-05-04 18:23 ` maiku.fabian at gmail dot com
  2015-05-05  9:40 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-26  5:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=2253

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pravin.d.s at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/2253] unicode combining accents can't be iconv-ed to latin//translit (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2014-06-26  5:08 ` pravin.d.s at gmail dot com
@ 2015-05-04 18:23 ` maiku.fabian at gmail dot com
  2015-05-05  9:40 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-05-04 18:23 UTC (permalink / raw)
  To: glibc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 7531 bytes --]

https://sourceware.org/bugzilla/show_bug.cgi?id=2253

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

--- Comment #5 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Samuel Thibault from comment #4)
> Right, thus changing bug title: the transliteration however still produces
> "e", while it could produce "é".

Transliterating to “e” is probably OK in most locales, for
example in English dropping accents seemst to be common usage.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28148-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 18:53:57 2015
Return-Path: <glibc-bugs-return-28148-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 4264 invoked by alias); 4 May 2015 18:53:57 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 4189 invoked by uid 48); 4 May 2015 18:53:54 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/12031] iconv -t ascii//translit with Greek characters
Date: Mon, 04 May 2015 18:53:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: libc-locales at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-12031-131-a0mZD0WOu1@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-12031-131@http.sourceware.org/bugzilla/>
References: <bug-12031-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00013.txt.bz2
Content-length: 2202

https://sourceware.org/bugzilla/show_bug.cgi?id=12031

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

--- Comment #8 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Petter Reinholdtsen from comment #5)
> (In reply to comment #4)
> > gives me "ae,?,a" but in my opinion it should give me "ae,o,a".
> [...]
> > Is this a bug?
> 
> I believe it is a bug.

It works in recent glibc (glibc-2.20-8.fc21.x86_64)
in *all* locales except C/POSIX. 

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=nb_NO.UTF-8 iconv -t ascii//TRANSLIT 
AE,ae,OE,oe,A,a

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=en_US.UTF-8 iconv -t ascii//TRANSLIT 
AE,ae,OE,oe,A,a

$ echo 'Æ,æ,Ø,ø,Å,å' | LANG=POSIX iconv -t ascii//TRANSLIT 
iconv: illegal input sequence at position 0

It is independent of the locale because all locales (except C/POSIX)
include translit_neutral where this is defined.

> The request to change transliteration for æøå is
> http://sourceware.org/bugzilla/show_bug.cgi?id=89 .  Please explain there
> why you believe it should transliterate to ae,o,a and not ae,oe,aa.

For Scandinavian locales, transliterating 'Æ,æ,Ø,ø,Å,å' to 'Ae, ae,
Oe, oe, Aa, aa' is more appropriate. For most other locales,
transliterating å to a is probably OK.  I am a bit puzzled about Æ ->
AE, shouldn’t this be transliterated to Ae, even in English locales?
(Same with Ø, transliterating to just O or maybe Oe in
translit_neutral for all locales which do not have special rules
seems better.

The patch attached to

https://sourceware.org/bugzilla/show_bug.cgi?id=89#c5

fixes the transliteration for Norwegian locales (nn_NO and nb_NO).
Probably the same fix should be applied also for Swedish and Finnish
locales (and maybe Icelandic locales as well).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-28149-listarch-glibc-bugs=sources.redhat.com@sourceware.org Mon May 04 18:54:36 2015
Return-Path: <glibc-bugs-return-28149-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 6177 invoked by alias); 4 May 2015 18:54:36 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 4825 invoked by uid 48); 4 May 2015 18:54:32 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/89]=?UTF-8?Q? Locales nb_NO and nn_NO should transliterate æøå?Date: Mon, 04 May 2015 18:54:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: REOPENED
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-89-131-L6o240y639@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-89-131@http.sourceware.org/bugzilla/>
References: <bug-89-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-05/txt/msg00014.txt.bz2
Content-length: 692

https://sourceware.org/bugzilla/show_bug.cgi?id‰

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

--- Comment #8 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Aurelien Jarno from comment #5)
> Created attachment 6193 [details]
> Patch to fix the issue
>
> Patch attached for reference

I guess this fix should also be added to se_SV, fi_FI and maybe is_IS
locales, right?

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug localedata/2253] unicode combining accents can't be iconv-ed to latin//translit (and others)
       [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2015-05-04 18:23 ` maiku.fabian at gmail dot com
@ 2015-05-05  9:40 ` maiku.fabian at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-05-05  9:40 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=2253

--- Comment #7 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Samuel Thibault from comment #6)
> Err, but here e+combineacute *is* representable in latin1, it's eacute. So
> transliteration should not discard the accent.

Yes, maybe.

But is this doable with the glibc transliteration system?
All the glibc/localedata/locales/translit_* files just transliterate
one single character to another character or a list of characters.
It never starts with a character sequence. So I guess this is not supported.

As Jungshik Shin suggests in comment#1, iconv could
normalize the input to NFC before attempting a transliteration.

Certainly not without transliteration, as Rich Felker writes in
comment#3, but *if* transliteration is used, normalizing to NFC and
then doing the transliteration might be a reasonable approach.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others)
  2006-01-31 22:22 [Bug libc/2253] New: unicode combining accents can't be iconv-ed to latin " samuel dot thibault at ens-lyon dot org
@ 2007-02-02  1:58 ` jshin1987 at gmail dot com
  0 siblings, 0 replies; 8+ messages in thread
From: jshin1987 at gmail dot com @ 2007-02-02  1:58 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From jshin1987 at gmail dot com  2007-02-02 01:58 -------
Before the conversion, the input should be normalized to NFC. However, I'm not
sure if that's the scope of 'iconv'. 


-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=2253

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-05  9:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-2253-131@http.sourceware.org/bugzilla/>
2012-05-06  9:08 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin (and others) aj at suse dot de
2012-05-06 12:24 ` bugdal at aerifal dot cx
2012-05-06 14:52 ` [Bug libc/2253] unicode combining accents can't be iconv-ed to latin//translit " samuel.thibault@ens-lyon.org
2014-02-07  2:55 ` [Bug localedata/2253] " jsm28 at gcc dot gnu.org
2014-06-26  5:08 ` pravin.d.s at gmail dot com
2015-05-04 18:23 ` maiku.fabian at gmail dot com
2015-05-05  9:40 ` maiku.fabian at gmail dot com
2006-01-31 22:22 [Bug libc/2253] New: unicode combining accents can't be iconv-ed to latin " samuel dot thibault at ens-lyon dot org
2007-02-02  1:58 ` [Bug libc/2253] " jshin1987 at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).