public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: L A Walsh <cygwin@tlinx.org>
To: Joel Rees <joel.rees@gmail.com>
Cc: cygwin@cygwin.com
Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?
Date: Mon, 05 Apr 2021 02:26:38 -0700	[thread overview]
Message-ID: <606AD7CE.6090606@tlinx.org> (raw)
In-Reply-To: <CAAr43iOdVea3YYThgdYpJxRCaVtFVhyHz_FwMTQhqTw8+YT-zg@mail.gmail.com>

On 2021/04/04 14:26, Joel Rees via Cygwin wrote:
>
>> 1. What perl Unicode modules should I consider, if not Text::Unidecode?
>> The present need
>> is to be able to convert those few "foreign" characters (like
>> ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ)
>> that are basically ASCII with accent marks to their closest ASCII
>> equivalents, but I'd
>> like to do more with Unicode in the future, without going down any
>> dead-ends as far as
>> being able to run under cygwin is concerned.
>>
>>     
>
> "Stripping those few foreign accent characters" is probably not really what
> you want to do.
>   
----
    Why not?  You don't know his use case and you are misinterpreting his
example as random garbage.

Those aren't a random foreign encoding -- those are C's G's then E, I O
with accent variations that he may want to collapse for purposes of storing
in a text storage and retrieval (search) application.  They are all well
formed/well-coded UTF-8 characters -- they are not some 8-bit encoding
that was remangled during a no-recoding display of them in a UTF-8
context.

I didn't know about Text::Unidecode -- but it specifically to create
Latinized alternatives to foreign characters.  That was another hint
that it wasn't a random mistake.  The manpage for it says:

       It often happens that you have non-Roman text data in Unicode, 
but you
       can't display it -- usually because you're trying to show it to a 
user
       via an application that doesn't support Unicode, or because the fonts
       you need aren't accessible.  You could represent the Unicode 
characters
       as "???????" or "\15BA\15A0\1610...", but that's nearly useless 
to the
       user who actually wants to read what the text says.

An example was like:

tperl
use utf8;
use Text::Unidecode;
my $name="\x{5317}\x{4EB0}";

printf "name, %s == %s\n", $name, unidecode($name);
'
name, 北亰 == Bei Jing

It's not just about removing accents but getting an English
like translation based on the foreign text.






All of the characters he used as example were well coded utf-8
characters --



> Those "accent characters" are misinterpreted foreign encoding (likely not
> to be Unicode) characters. Simply "stripping" the "accent characters" will
> basically convert them to truly meaningless junk. I suppose the meaningless
> junk can then be interpreted by the reader as "used to be a be a foreign
> word here", but why bother contributing further to information entropy?
>
> 2. I see some talk of Internationalization in Chapter 2 of "Setting up
>   
>> Cygwin", but
>> cannot see anything relating to perl modules, and I don't see any easy way
>> to search many
>> months of the mailing list for a keyword... is there any information I
>> should know about?
>>     
>
>
> Have you read the perldoc on internationalization?
> --
> Problem reports:      https://cygwin.com/problems.html
> FAQ:                  https://cygwin.com/faq/
> Documentation:        https://cygwin.com/docs.html
> Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
>
>   


  reply	other threads:[~2021-04-05  9:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-01 20:35 Mark Aitchison
2021-04-04 20:22 ` L A Walsh
2021-04-04 20:27 ` L A Walsh
2021-04-04 21:26 ` Joel Rees
2021-04-05  9:26   ` L A Walsh [this message]
2021-04-05 10:49     ` Joel Rees
2021-04-05 21:50       ` Mark Aitchison
2021-04-05 22:39         ` Joel Rees
2021-04-05  6:43 ` Achim Gratz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=606AD7CE.6090606@tlinx.org \
    --to=cygwin@tlinx.org \
    --cc=cygwin@cygwin.com \
    --cc=joel.rees@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).