From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailroot.plain.net.nz (akvs4.plain.net.nz [202.49.68.68]) by sourceware.org (Postfix) with ESMTP id ABE2D3857801 for ; Mon, 5 Apr 2021 21:50:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org ABE2D3857801 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=cyberxpress.co.nz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mark.aitchison@cyberxpress.co.nz Received: from mailroot.plain.net.nz (localhost.localdomain [127.0.0.1]) by mailroot.plain.net.nz (Postfix) with ESMTP id E060D79; Tue, 6 Apr 2021 09:50:21 +1200 (NZST) DKIM-Signature: v=1; a=rsa-sha1; c=simple; d=plain.co.nz; h=in-reply-to :references:mime-version:content-type:content-transfer-encoding :subject:from:date:to:cc:message-id; s=mail; bh=pG67wKEMB4mJn53V HWLFpEkM+Nw=; b=da+npVGZX1gQVjhamYQ6t4fPEhRuwAcgB301t2wPut0DJ0YF CJ1c3nw+5YXGcmbnztFMYABrGSEMjHhGTadPA+lC0GfliUGexcDpLO872FO9PUHR +hfuW7bcME0ILiAuFwfIEzdiwrPAB591iJabrPBeDB0H5Hp+MxQr6Ov3rmE= Received: from [192.168.1.76] (125-236-204-204.adsl.xtra.co.nz [125.236.204.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aitchison) by mailroot.plain.net.nz (Postfix) with ESMTPSA id 288AD78; Tue, 6 Apr 2021 09:50:21 +1200 (NZST) In-Reply-To: References: <606AD7CE.6090606@tlinx.org> X-Referenced-Uid: 147829 Thread-Topic: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)? User-Agent: Android X-Is-Generated-Message-Id: true MIME-Version: 1.0 Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)? From: Mark Aitchison Date: Tue, 06 Apr 2021 09:50:14 +1200 To: Joel Rees CC: cygwin@cygwin.com Message-ID: X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, HTML_MESSAGE, JMQ_SPF_NEUTRAL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Apr 2021 21:50:28 -0000 A little more detail=2E=2E=2E I realise that stripping accents off is ofte= n not a good thing to do, but at the moment that basically is what I'm afte= r, or to be more specific: I want to know if the character is a consonant o= r vowel=2E=2E=2E I basically throw away vowels and punctuation in this odd = application=2E Later I will want to do all sorts of things with input text = that might be utf8 or utf16 or some encoding that (hopefully) I can guess a= nd translate to the same standard and ultimately spit out on a web page=2E = There seem to be many perl modules that do similar things=2E=2E=2E I want = to be able to distribute my code and not require people to download things = from cpan=2E I'd like to stick with modules that are as stock standard as s= tandard can be, i=2Ee=2E are in a standard cygwin distribution, and are nor= mally found in other perl environments=2E In a sense, searching cpan gives = me too many options because that includes modules that might require a cust= omer to do more than I should ask them to have to do, if it could have been= avoided by me choosing a more standard way of achieving the goal in the fi= rst place=2E What I probably should have asked is=2E=2E=2E 1=2E What perl = module, that comes with cygwin, is good for telling whether a letter is a c= onsonant? 2=2E Later on I will also need something that makes a reasonable = guess as to what kind of encoding is used in some text (that might not have= a helpful header telling me the answer), with the view to converting it to= whatever encoding I want? I can find software to do this, but I would like= to restrict options to just those a cygwin user can install with the setup= program=2E=2E=2E if I'm not being too unrealistic about that requirement= =2E Thanks, Mark On 5 Apr 2021, 22:50, at 22:50, Joel Rees via Cygwin wrote: >On Mon, Apr 5, 2021 at 6:26 PM L A Walsh wrote: >> >> On 2021/04/04 14:26, Joel Rees via Cygwin wrote: = >> > >> >> 1=2E What perl Unicode modules should I consider, if not >Text::= Unidecode? >> >> The present need >> >> is to be able to convert those few = "foreign" characters (like >> >> =C3=87=C4=86=C4=88=C4=8A=C3=A7=C4=89=C4=8B= =C4=9C=C4=9E=C4=A0=C4=A2=C4=9D=C4=A3=C4=9F=C4=A1=C3=8B=C3=8C=C3=8D=C3=8E=C3= =8F=C3=92=C3=93=C3=94=C3=95) >> >> that are basically ASCII with accent mar= ks to their closest ASCII >> >> equivalents, but I'd >> >> like to do more = with Unicode in the future, without going down any >> >> dead-ends as far a= s >> >> being able to run under cygwin is concerned=2E >> >> >> >> >> > >> = > "Stripping those few foreign accent characters" is probably not >really w= hat >> > you want to do=2E >> > >> ---- >> Why not? You don't know his= use case and you are misinterpreting >his >> example as random garbage=2E = > >Actually, I was specifically _not_ interpreting them as random garbage= =2E >If they >were random garbage, it wouldn't matter what he does with the= m=2E > >> Those aren't a random foreign encoding -- those are C's G's then = E, I >O >> with accent variations that he may want to collapse for purposes= of >storing >> in a text storage and retrieval (search) application=2E > >= in this world many things are possible, and those may actually be >intentio= nal >strings of characters with assorted diacriticals, some sort of example= >of >diacriticals, and he may have some reason to force the characters to = >their >base form instead of regenerating the text=2E Or maybe I'm >misinte= rpreting >his intent=2E Maybe he doesn't want to strip the diacriticals so = much as >convert >the combinations to something like punycode=2E > >> They = are all well >> formed/well-coded UTF-8 characters -- they are not some 8-b= it >encoding >> that was remangled during a no-recoding display of them in = a UTF-8 >> context=2E > >I've seen lots of strings like that that are the r= esult of e-mail >software >mangling=2E In Japan, we call it =E6=96=87=E5=AD= =97=E5=8C=96=E3=81=91 (mojibake)=2E And, yes, the e-mail >software "helpful= ly" converts the misinterpreted bytes to well-formed >but entirely irreleva= nt UTF-8 in many cases=2E > >I will acknowledge that I don't see it as ofte= n as I used to, but it >still happens=2E > >> I didn't know about Text::Uni= decode -- but it specifically to create >> Latinized alternatives to foreig= n characters=2E That was another hint >> that it wasn't a random mistake= =2E The manpage for it says: >> >> It often happens that you have n= on-Roman text data in Unicode, >> but you >> can't display it -- usu= ally because you're trying to show it >to a >> user >> via an applic= ation that doesn't support Unicode, or because >the fonts >> you nee= d aren't accessible=2E You could represent the Unicode >> characters >> = as "???????" or "\15BA\15A0\1610=2E=2E=2E", but that's nearly >useless= >> to the >> user who actually wants to read what the text says=2E = >> >> An example was like: >> >> tperl >> use utf8; >> use Text::Unidecode;= >> my $name=3D"\x{5317}\x{4EB0}"; >> >> printf "name, %s =3D=3D %s\n", $na= me, unidecode($name); >> ' >> name, =E5=8C=97=E4=BA=B0 =3D=3D Bei Jing > >I= would not call that "stripping" accent marks=2E It's a process of >recogni= zing the >characters, looking them up in a dictionary, and finding a reason= able >Latinized >equivalent, which is a fairly involved process requiring a= bit of >heuristics, since >there is often a many-to-many mapping involved= =2E > >> It's not just about removing accents but getting an English >> lik= e translation based on the foreign text=2E > >And that's actually what I wa= s trying to point him to? > >Okay, maybe my suggestions were too elliptical= =2E Maybe I should have >told >myself I was too busy and ignored his questi= on like everybody else=2E > >[snip] > >-- >Joel Rees > >http://reiisi=2Ebl= ogspot=2Ejp/p/novels-i-am-writing=2Ehtml >-- >Problem reports: https:/= /cygwin=2Ecom/problems=2Ehtml >FAQ: https://cygwin=2Ecom/f= aq/ >Documentation: https://cygwin=2Ecom/docs=2Ehtml >Unsubscribe in= fo: https://cygwin=2Ecom/ml/#unsubscribe-simple