From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <joel.rees@gmail.com>
Received: from mail-qk1-x730.google.com (mail-qk1-x730.google.com
 [IPv6:2607:f8b0:4864:20::730])
 by sourceware.org (Postfix) with ESMTPS id EC3293857C4C
 for <cygwin@cygwin.com>; Mon,  5 Apr 2021 10:49:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org EC3293857C4C
Received: by mail-qk1-x730.google.com with SMTP id x14so10983038qki.10
 for <cygwin@cygwin.com>; Mon, 05 Apr 2021 03:49:50 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:content-transfer-encoding;
 bh=sRSLoyLxqxotuHLz0SvRLRGlUJ5ZFSJ2MaeGpJ+UJd0=;
 b=AfVG9We06xHSIcWyC4f7Q2N6nJcdkVSHcgx2zJYCia4vLHN0MTFyH1StWMpzWyce6u
 fuxgXUOXDMwqk/5XLdra5DBcAp8m9NLegvpphGhG9W1j4G3/MnrsM2NKo3K5tzggfLkD
 JJP0qZflaOOIVqFCC4hmWNC0hg6W8lc7lOKiaxeLe9Ky18C1PhRLQRmRfEqTAFAPODHl
 +yNK64mTwbaWJ6MjIg55KhKS9z2rsXlEFr/gI7PDvp24pNayXXGWhHJymZUvj7a90W32
 67K9T4AFuXqr3CQD+v90Td0xp1rY5BeZPhA2Ti9yk35DE83RKuH0NRjH1N5lo3cIPOTY
 3kSw==
X-Gm-Message-State: AOAM531AHPcLR73qqWi1NoJgYACdXtSCoFBIdQA0M0RXk5Qd0CBcSK6U
 jSBiiJbUgkOUZlhowcCbWGvs9C3IJLBPZIiJ23MVBvK8
X-Google-Smtp-Source: ABdhPJwWTahRjaEBMJNFEVy798CjJjV0BMzfb/2JHXDH1rQiXU3fjc/a4qcRy0Nng9Hmc4qPJxnZbqbVq9GyRMITkZE=
X-Received: by 2002:a05:620a:1277:: with SMTP id
 b23mr22626801qkl.457.1617619790545; 
 Mon, 05 Apr 2021 03:49:50 -0700 (PDT)
MIME-Version: 1.0
References: <d3342ff4-f717-f882-5c41-b27ab272dc03@cyberXpress.co.nz>
 <CAAr43iOdVea3YYThgdYpJxRCaVtFVhyHz_FwMTQhqTw8+YT-zg@mail.gmail.com>
 <606AD7CE.6090606@tlinx.org>
In-Reply-To: <606AD7CE.6090606@tlinx.org>
From: Joel Rees <joel.rees@gmail.com>
Date: Mon, 5 Apr 2021 19:49:39 +0900
Message-ID: <CAAr43iMuc3LRxy=BqJJuZTkzU14c+XERMv2oVVc7Lg-kuMY5BQ@mail.gmail.com>
Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?
To: cygwin@cygwin.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
X-List-Received-Date: Mon, 05 Apr 2021 10:49:52 -0000

On Mon, Apr 5, 2021 at 6:26 PM L A Walsh <cygwin@tlinx.org> wrote:
>
> On 2021/04/04 14:26, Joel Rees via Cygwin wrote:
> >
> >> 1. What perl Unicode modules should I consider, if not Text::Unidecode=
?
> >> The present need
> >> is to be able to convert those few "foreign" characters (like
> >> =C3=87=C4=86=C4=88=C4=8A=C3=A7=C4=89=C4=8B=C4=9C=C4=9E=C4=A0=C4=A2=C4=
=9D=C4=A3=C4=9F=C4=A1=C3=8B=C3=8C=C3=8D=C3=8E=C3=8F=C3=92=C3=93=C3=94=C3=95=
)
> >> that are basically ASCII with accent marks to their closest ASCII
> >> equivalents, but I'd
> >> like to do more with Unicode in the future, without going down any
> >> dead-ends as far as
> >> being able to run under cygwin is concerned.
> >>
> >>
> >
> > "Stripping those few foreign accent characters" is probably not really =
what
> > you want to do.
> >
> ----
>     Why not?  You don't know his use case and you are misinterpreting his
> example as random garbage.

Actually, I was specifically _not_ interpreting them as random garbage. If =
they
were random garbage, it wouldn't matter what he does with them.

> Those aren't a random foreign encoding -- those are C's G's then E, I O
> with accent variations that he may want to collapse for purposes of stori=
ng
> in a text storage and retrieval (search) application.

in this world many things are possible, and those may actually be intention=
al
strings of characters with assorted diacriticals, some sort of example of
diacriticals, and he may have some reason to force the characters to their
base form instead of regenerating the text. Or maybe I'm misinterpreting
his intent. Maybe he doesn't want to strip the diacriticals so much as conv=
ert
the combinations to something like punycode.

> They are all well
> formed/well-coded UTF-8 characters -- they are not some 8-bit encoding
> that was remangled during a no-recoding display of them in a UTF-8
> context.

I've seen lots of strings like that that are the result of e-mail software
mangling. In Japan, we call it =E6=96=87=E5=AD=97=E5=8C=96=E3=81=91 (mojiba=
ke). And, yes, the e-mail
software "helpfully" converts the misinterpreted bytes to well-formed
but entirely irrelevant UTF-8 in many cases.

I will acknowledge that I don't see it as often as I used to, but it
still happens.

> I didn't know about Text::Unidecode -- but it specifically to create
> Latinized alternatives to foreign characters.  That was another hint
> that it wasn't a random mistake.  The manpage for it says:
>
>        It often happens that you have non-Roman text data in Unicode,
> but you
>        can't display it -- usually because you're trying to show it to a
> user
>        via an application that doesn't support Unicode, or because the fo=
nts
>        you need aren't accessible.  You could represent the Unicode
> characters
>        as "???????" or "\15BA\15A0\1610...", but that's nearly useless
> to the
>        user who actually wants to read what the text says.
>
> An example was like:
>
> tperl
> use utf8;
> use Text::Unidecode;
> my $name=3D"\x{5317}\x{4EB0}";
>
> printf "name, %s =3D=3D %s\n", $name, unidecode($name);
> '
> name, =E5=8C=97=E4=BA=B0 =3D=3D Bei Jing

I would not call that "stripping" accent marks. It's a process of
recognizing the
characters, looking them up in a dictionary, and finding a reasonable Latin=
ized
equivalent, which is a fairly involved process requiring a bit of
heuristics, since
there is often a many-to-many mapping involved.

> It's not just about removing accents but getting an English
> like translation based on the foreign text.

And that's actually what I was trying to point him to?

Okay, maybe my suggestions were too elliptical. Maybe I should have told
myself I was too busy and ignored his question like everybody else.

[snip]

--=20
Joel Rees

http://reiisi.blogspot.jp/p/novels-i-am-writing.html