From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from Ishtar.sc.tlinx.org (ishtar.tlinx.org [173.164.175.65]) by sourceware.org (Postfix) with ESMTPS id A47123857C4C for ; Mon, 5 Apr 2021 09:26:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A47123857C4C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tlinx.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=cygwin@tlinx.org Received: from [192.168.3.12] (Athenae [192.168.3.12]) by Ishtar.sc.tlinx.org (8.14.7/8.14.4/SuSE Linux 0.8) with ESMTP id 1359Qbot046022; Mon, 5 Apr 2021 02:26:39 -0700 Message-ID: <606AD7CE.6090606@tlinx.org> Date: Mon, 05 Apr 2021 02:26:38 -0700 From: L A Walsh User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: Joel Rees CC: cygwin@cygwin.com Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00, BODY_8BITS, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Apr 2021 09:26:43 -0000 On 2021/04/04 14:26, Joel Rees via Cygwin wrote: > >> 1. What perl Unicode modules should I consider, if not Text::Unidecode= ? >> The present need >> is to be able to convert those few "foreign" characters (like >> =C3=87=C4=86=C4=88=C4=8A=C3=A7=C4=89=C4=8B=C4=9C=C4=9E=C4=A0=C4=A2=C4=9D= =C4=A3=C4=9F=C4=A1=C3=8B=C3=8C=C3=8D=C3=8E=C3=8F=C3=92=C3=93=C3=94=C3=95)= >> that are basically ASCII with accent marks to their closest ASCII >> equivalents, but I'd >> like to do more with Unicode in the future, without going down any >> dead-ends as far as >> being able to run under cygwin is concerned. >> >> =20 > > "Stripping those few foreign accent characters" is probably not really = what > you want to do. > =20 ---- Why not? You don't know his use case and you are misinterpreting his= example as random garbage. Those aren't a random foreign encoding -- those are C's G's then E, I O with accent variations that he may want to collapse for purposes of stori= ng in a text storage and retrieval (search) application. They are all well formed/well-coded UTF-8 characters -- they are not some 8-bit encoding that was remangled during a no-recoding display of them in a UTF-8 context. I didn't know about Text::Unidecode -- but it specifically to create Latinized alternatives to foreign characters. That was another hint that it wasn't a random mistake. The manpage for it says: It often happens that you have non-Roman text data in Unicode,=20 but you can't display it -- usually because you're trying to show it to a = user via an application that doesn't support Unicode, or because the fo= nts you need aren't accessible. You could represent the Unicode=20 characters as "???????" or "\15BA\15A0\1610...", but that's nearly useless=20 to the user who actually wants to read what the text says. An example was like: tperl use utf8; use Text::Unidecode; my $name=3D"\x{5317}\x{4EB0}"; printf "name, %s =3D=3D %s\n", $name, unidecode($name); ' name, =E5=8C=97=E4=BA=B0 =3D=3D Bei Jing It's not just about removing accents but getting an English like translation based on the foreign text. All of the characters he used as example were well coded utf-8 characters -- > Those "accent characters" are misinterpreted foreign encoding (likely n= ot > to be Unicode) characters. Simply "stripping" the "accent characters" w= ill > basically convert them to truly meaningless junk. I suppose the meaning= less > junk can then be interpreted by the reader as "used to be a be a foreig= n > word here", but why bother contributing further to information entropy?= > > 2. I see some talk of Internationalization in Chapter 2 of "Setting up > =20 >> Cygwin", but >> cannot see anything relating to perl modules, and I don't see any easy= way >> to search many >> months of the mailing list for a keyword... is there any information I= >> should know about? >> =20 > > > Have you read the perldoc on internationalization? > -- > Problem reports: https://cygwin.com/problems.html > FAQ: https://cygwin.com/faq/ > Documentation: https://cygwin.com/docs.html > Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple > > =20