From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark.aitchison@cyberxpress.co.nz>
Received: from mailroot.plain.net.nz (akvs4.plain.net.nz [202.49.68.68])
 by sourceware.org (Postfix) with ESMTP id ABE2D3857801
 for <cygwin@cygwin.com>; Mon,  5 Apr 2021 21:50:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org ABE2D3857801
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=cyberxpress.co.nz
Authentication-Results: sourceware.org; spf=pass
 smtp.mailfrom=mark.aitchison@cyberxpress.co.nz
Received: from mailroot.plain.net.nz (localhost.localdomain [127.0.0.1])
 by mailroot.plain.net.nz (Postfix) with ESMTP id E060D79;
 Tue,  6 Apr 2021 09:50:21 +1200 (NZST)
DKIM-Signature: v=1; a=rsa-sha1; c=simple; d=plain.co.nz; h=in-reply-to
 :references:mime-version:content-type:content-transfer-encoding
 :subject:from:date:to:cc:message-id; s=mail; bh=pG67wKEMB4mJn53V
 HWLFpEkM+Nw=; b=da+npVGZX1gQVjhamYQ6t4fPEhRuwAcgB301t2wPut0DJ0YF
 CJ1c3nw+5YXGcmbnztFMYABrGSEMjHhGTadPA+lC0GfliUGexcDpLO872FO9PUHR
 +hfuW7bcME0ILiAuFwfIEzdiwrPAB591iJabrPBeDB0H5Hp+MxQr6Ov3rmE=
Received: from [192.168.1.76] (125-236-204-204.adsl.xtra.co.nz
 [125.236.204.204])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested) (Authenticated sender: aitchison)
 by mailroot.plain.net.nz (Postfix) with ESMTPSA id 288AD78;
 Tue,  6 Apr 2021 09:50:21 +1200 (NZST)
In-Reply-To: <CAAr43iMuc3LRxy=BqJJuZTkzU14c+XERMv2oVVc7Lg-kuMY5BQ@mail.gmail.com>
References: <d3342ff4-f717-f882-5c41-b27ab272dc03@cyberXpress.co.nz>
 <CAAr43iOdVea3YYThgdYpJxRCaVtFVhyHz_FwMTQhqTw8+YT-zg@mail.gmail.com>
 <606AD7CE.6090606@tlinx.org>
 <CAAr43iMuc3LRxy=BqJJuZTkzU14c+XERMv2oVVc7Lg-kuMY5BQ@mail.gmail.com>
X-Referenced-Uid: 147829
Thread-Topic: Re: Perl Unidecode modules - which to use (if not
 Text::Unidecode)?
User-Agent: Android
X-Is-Generated-Message-Id: true
MIME-Version: 1.0
Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?
From: Mark Aitchison <mark.aitchison@cyberxpress.co.nz>
Date: Tue, 06 Apr 2021 09:50:14 +1200
To: Joel Rees <joel.rees@gmail.com>
CC: cygwin@cygwin.com
Message-ID: <abb3cba3-2d64-4ffd-bedb-e63df3f34439@cyberxpress.co.nz>
X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, HTML_MESSAGE, JMQ_SPF_NEUTRAL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain;
 charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
X-List-Received-Date: Mon, 05 Apr 2021 21:50:28 -0000


A little more detail=2E=2E=2E I realise that stripping accents off is ofte=
n not a good thing to do, but at the moment that basically is what I'm afte=
r, or to be more specific: I want to know if the character is a consonant o=
r vowel=2E=2E=2E I basically throw away vowels and punctuation in this odd =
application=2E Later I will want to do all sorts of things with input text =
that might be utf8 or utf16 or some encoding that (hopefully) I can guess a=
nd translate to the same standard and ultimately spit out on a web page=2E
=

There seem to be many perl modules that do similar things=2E=2E=2E I want =
to be able to distribute my code and not require people to download things =
from cpan=2E I'd like to stick with modules that are as stock standard as s=
tandard can be, i=2Ee=2E are in a standard cygwin distribution, and are nor=
mally found in other perl environments=2E In a sense, searching cpan gives =
me too many options because that includes modules that might require a cust=
omer to do more than I should ask them to have to do, if it could have been=
 avoided by me choosing a more standard way of achieving the goal in the fi=
rst place=2E

What I probably should have asked is=2E=2E=2E
1=2E What perl =
module, that comes with cygwin, is good for telling whether a letter is a c=
onsonant?
2=2E Later on I will also need something that makes a reasonable =
guess as to what kind of encoding is used in some text (that might not have=
 a helpful header telling me the answer), with the view to converting it to=
 whatever encoding I want? I can find software to do this, but I would like=
 to restrict options to just those a cygwin user can install with the setup=
 program=2E=2E=2E if I'm not being too unrealistic about that requirement=
=2E
Thanks, Mark

On 5 Apr 2021, 22:50, at 22:50, Joel Rees via Cygwin <cyg=
win@cygwin=2Ecom> wrote:
>On Mon, Apr 5, 2021 at 6:26 PM L A Walsh <cygwin@=
tlinx=2Eorg> wrote:
>>
>> On 2021/04/04 14:26, Joel Rees via Cygwin wrote:
=
>> >
>> >> 1=2E What perl Unicode modules should I consider, if not
>Text::=
Unidecode?
>> >> The present need
>> >> is to be able to convert those few =
"foreign" characters (like
>> >> =C3=87=C4=86=C4=88=C4=8A=C3=A7=C4=89=C4=8B=
=C4=9C=C4=9E=C4=A0=C4=A2=C4=9D=C4=A3=C4=9F=C4=A1=C3=8B=C3=8C=C3=8D=C3=8E=C3=
=8F=C3=92=C3=93=C3=94=C3=95)
>> >> that are basically ASCII with accent mar=
ks to their closest ASCII
>> >> equivalents, but I'd
>> >> like to do more =
with Unicode in the future, without going down any
>> >> dead-ends as far a=
s
>> >> being able to run under cygwin is concerned=2E
>> >>
>> >>
>> >
>> =
> "Stripping those few foreign accent characters" is probably not
>really w=
hat
>> > you want to do=2E
>> >
>> ----
>>     Why not?  You don't know his=
 use case and you are misinterpreting
>his
>> example as random garbage=2E
=
>
>Actually, I was specifically _not_ interpreting them as random garbage=
=2E
>If they
>were random garbage, it wouldn't matter what he does with the=
m=2E
>
>> Those aren't a random foreign encoding -- those are C's G's then =
E, I
>O
>> with accent variations that he may want to collapse for purposes=
 of
>storing
>> in a text storage and retrieval (search) application=2E
>
>=
in this world many things are possible, and those may actually be
>intentio=
nal
>strings of characters with assorted diacriticals, some sort of example=

>of
>diacriticals, and he may have some reason to force the characters to
=
>their
>base form instead of regenerating the text=2E Or maybe I'm
>misinte=
rpreting
>his intent=2E Maybe he doesn't want to strip the diacriticals so =
much as
>convert
>the combinations to something like punycode=2E
>
>> They =
are all well
>> formed/well-coded UTF-8 characters -- they are not some 8-b=
it
>encoding
>> that was remangled during a no-recoding display of them in =
a UTF-8
>> context=2E
>
>I've seen lots of strings like that that are the r=
esult of e-mail
>software
>mangling=2E In Japan, we call it =E6=96=87=E5=AD=
=97=E5=8C=96=E3=81=91 (mojibake)=2E And, yes, the e-mail
>software "helpful=
ly" converts the misinterpreted bytes to well-formed
>but entirely irreleva=
nt UTF-8 in many cases=2E
>
>I will acknowledge that I don't see it as ofte=
n as I used to, but it
>still happens=2E
>
>> I didn't know about Text::Uni=
decode -- but it specifically to create
>> Latinized alternatives to foreig=
n characters=2E  That was another hint
>> that it wasn't a random mistake=
=2E  The manpage for it says:
>>
>>        It often happens that you have n=
on-Roman text data in Unicode,
>> but you
>>        can't display it -- usu=
ally because you're trying to show it
>to a
>> user
>>        via an applic=
ation that doesn't support Unicode, or because
>the fonts
>>        you nee=
d aren't accessible=2E  You could represent the Unicode
>> characters
>>   =
     as "???????" or "\15BA\15A0\1610=2E=2E=2E", but that's nearly
>useless=

>> to the
>>        user who actually wants to read what the text says=2E
=
>>
>> An example was like:
>>
>> tperl
>> use utf8;
>> use Text::Unidecode;=

>> my $name=3D"\x{5317}\x{4EB0}";
>>
>> printf "name, %s =3D=3D %s\n", $na=
me, unidecode($name);
>> '
>> name, =E5=8C=97=E4=BA=B0 =3D=3D Bei Jing
>
>I=
 would not call that "stripping" accent marks=2E It's a process of
>recogni=
zing the
>characters, looking them up in a dictionary, and finding a reason=
able
>Latinized
>equivalent, which is a fairly involved process requiring a=
 bit of
>heuristics, since
>there is often a many-to-many mapping involved=
=2E
>
>> It's not just about removing accents but getting an English
>> lik=
e translation based on the foreign text=2E
>
>And that's actually what I wa=
s trying to point him to?
>
>Okay, maybe my suggestions were too elliptical=
=2E Maybe I should have
>told
>myself I was too busy and ignored his questi=
on like everybody else=2E
>
>[snip]
>
>-- 
>Joel Rees
>
>http://reiisi=2Ebl=
ogspot=2Ejp/p/novels-i-am-writing=2Ehtml
>--
>Problem reports:      https:/=
/cygwin=2Ecom/problems=2Ehtml
>FAQ:                  https://cygwin=2Ecom/f=
aq/
>Documentation:        https://cygwin=2Ecom/docs=2Ehtml
>Unsubscribe in=
fo:     https://cygwin=2Ecom/ml/#unsubscribe-simple