public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Re: Unicode width data inconsistent/outdated
Date: Tue, 08 Aug 2017 08:22:00 -0000	[thread overview]
Message-ID: <20170808082220.GA13759@calimero.vinschen.de> (raw)
In-Reply-To: <3eb4ee2f-f62c-cb19-3e4b-10cc57852ba9@towo.net>

[-- Attachment #1: Type: text/plain, Size: 3619 bytes --]

On Aug  7 21:27, Thomas Wolff wrote:
> Am 07.08.2017 um 11:28 schrieb Corinna Vinschen:
> > On Aug  5 21:06, Thomas Wolff wrote:
> > > I have a working version now, and it uses much less as the category table is
> > > range-based.
> > > Another table is needed for case conversion. Size estimates are as follows
> > > (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
> > > of course):
> > > 
> > > Categories: 2313 entries (10.0: 2715)
> > > each entry needs 9 bytes, total 20817 bytes
> > > I don't know whether that expands by some word-alignment.
> > > I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
> > > or 13878).
> > > 
> > > Case conversion: 2062 entries (10.0: 2621)
> > > each entry needs 12 bytes, total 24744
> > > packed 8 bytes, total 16496
> > > 
> > > The Categories table could be boiled down to 1223 entries (penalty: double
> > > runtime for iswupper and iswlower)
> > > The Case conversion table could be transformed to a compact form
> > > Case conversion compact: 1201 entries
> > > each entry needs 16 bytes, total 19216
> > > packed 12 or 11 (or even 10), total 14412 (or 12010)
> > > So I think the increase is acceptable for the benefit of simple and
> > > automatic generation
> > So we're at 40K+ plus code then.
> No, if I implement the packed versions, it's 19.3K, so even smaller the
> currently.

Apparently I added up wrongly.

> > > I had noticed meanwhile that this is not active in Cygwin, but it's broken
> > > anyway for multiple reasons:
> > >     * platforms for which wchar_t is not Unicode should be explicitly listed
> > >     * if used, the transformation needs to be applied to all non-Unicode
> > > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> > >     * for towupper and towlower, the result must be back-transformed into the
> > > respective locale encoding
> > >     * particulary the locale-specific _l functions inconsistently do not use
> > > the transformation but have this note:
> > No, no, no.  The functionality is restricted to certain use-cases and
> > always was.  It was a paid-for customer extension back in the day and it
> > was *sufficient* for the use-cases.  It's not clear how many newlib
> > users are still using it, but it's not a good idea to remove it without
> > checking first.  That means, ask on the newlib mailing list how many are
> > using the historical jp2uc code, and if we don't get a reply within,
> > say, a month, we can probably nuke it.
> OK, let's make such a request after holiday time.
> But, even if this shall persist as a special solution, it's still broken and
> should be fixed.
> Can we then substitute the current table with calling the iconvdata
> functions? In that case, as I said, the back-conversion would be available
> too, and I could fix that and add the missing handling of the _l functions,
> for a consistent solution.

I'm not quite sure I follow.  Do you mean, iconvdata tables for the
three japanese codesets only?  Wouldn't that mean to convert the
multibyte stuff into unicode and vice versa, basically getting rid
of the jp2uc workaround?

After a night's sleep, that might actually be the best way anyway.  I
agree that the jp2uc workaround is a bit of a hack.  Well, not a bit.

However, give that this does not affect Cygwin, we should really discuss
this on the newlib list.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

      reply	other threads:[~2017-08-08  8:22 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-26 14:23 Thomas Wolff
2017-07-26 14:59 ` Corinna Vinschen
2017-07-26 17:03   ` Yaakov Selkowitz
2017-07-26 17:06     ` Corinna Vinschen
2017-07-27 17:09       ` Thomas Wolff
2017-07-29 15:23         ` Corinna Vinschen
2017-08-03 19:44           ` Thomas Wolff
2017-08-04 17:02             ` Corinna Vinschen
2017-08-05 19:06               ` Thomas Wolff
2017-08-05 20:24                 ` Brian Inglis
2017-08-05 20:53                   ` Thomas Wolff
2017-08-07  9:28                 ` Corinna Vinschen
2017-08-07 10:41                   ` Corinna Vinschen
2017-08-07 19:07                   ` Brian Inglis
2017-08-07 19:31                     ` Thomas Wolff
2017-08-07 21:29                       ` Brian Inglis
2017-08-08  0:29                         ` Thomas Wolff
2017-08-07 19:27                   ` Thomas Wolff
2017-08-08  8:22                     ` Corinna Vinschen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170808082220.GA13759@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).