* Unicode update of width and other character properties
@ 2017-08-06 5:36 Thomas Wolff
2017-08-07 10:31 ` Corinna Vinschen
2017-12-02 11:25 ` Ping: " Thomas Wolff
0 siblings, 2 replies; 15+ messages in thread
From: Thomas Wolff @ 2017-08-06 5:36 UTC (permalink / raw)
To: newlib
Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.
The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.
There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.
I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.
My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.
Thanks
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-08-06 5:36 Unicode update of width and other character properties Thomas Wolff
@ 2017-08-07 10:31 ` Corinna Vinschen
2017-08-07 19:18 ` Thomas Wolff
2017-12-02 11:25 ` Ping: " Thomas Wolff
1 sibling, 1 reply; 15+ messages in thread
From: Corinna Vinschen @ 2017-08-07 10:31 UTC (permalink / raw)
To: newlib
[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]
On Aug 6 07:36, Thomas Wolff wrote:
> Hi,
> this is a proposal to update wcwidth and the character properties functions
> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> as well as to simplify automatic generation of respective tables for an
> easier update step.
> Table size is moderate (using ranges for character properties) but there is
> still an option to reduce the two big tables in size.
As per the aforementioned discussion the table sizes are at least
twice as big, so this should be done with all due caution towards
the goals of smaller targets.
> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
That's not how it works. Please create a git patch series and post
it here.
There's probably also a bit more to discuss before changing how this
works since it affects all targets using wide char functions.
Thanks,
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-08-07 10:31 ` Corinna Vinschen
@ 2017-08-07 19:18 ` Thomas Wolff
2017-08-08 8:30 ` Corinna Vinschen
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Wolff @ 2017-08-07 19:18 UTC (permalink / raw)
To: newlib
Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
> On Aug 6 07:36, Thomas Wolff wrote:
>> Hi,
>> this is a proposal to update wcwidth and the character properties functions
>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>> as well as to simplify automatic generation of respective tables for an
>> easier update step.
>> Table size is moderate (using ranges for character properties) but there is
>> still an option to reduce the two big tables in size.
> As per the aforementioned discussion the table sizes are at least
> twice as big, so this should be done with all due caution towards
> the goals of smaller targets.
If I'm going to implement the packed versions, they will be even smaller
than the current tables.
>> The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> That's not how it works. Please create a git patch series and post it here.
Any howto available, please? What's the git URL, how to produce the
desired patch format/series.
And then the patch would be included here by email?
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-08-07 19:18 ` Thomas Wolff
@ 2017-08-08 8:30 ` Corinna Vinschen
2017-08-17 11:03 ` Thomas Wolff
0 siblings, 1 reply; 15+ messages in thread
From: Corinna Vinschen @ 2017-08-08 8:30 UTC (permalink / raw)
To: newlib
[-- Attachment #1: Type: text/plain, Size: 1566 bytes --]
On Aug 7 21:18, Thomas Wolff wrote:
> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
> > On Aug 6 07:36, Thomas Wolff wrote:
> > > Hi,
> > > this is a proposal to update wcwidth and the character properties functions
> > > isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
> > > https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
> > > as well as to simplify automatic generation of respective tables for an
> > > easier update step.
> > > Table size is moderate (using ranges for character properties) but there is
> > > still an option to reduce the two big tables in size.
> > As per the aforementioned discussion the table sizes are at least
> > twice as big, so this should be done with all due caution towards
> > the goals of smaller targets.
> If I'm going to implement the packed versions, they will be even smaller
> than the current tables.
>
> > > The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
> > That's not how it works. Please create a git patch series and post it here.
> Any howto available, please? What's the git URL,
https://cygwin.com/git.html
> how to produce the desired patch format/series.
Just as with any other git-based project:
$ git co -b my-stuff
[hack, hack, hack]
$ git commit [in useful chunks]
$ git format-patch -X (X == number of commits)
> And then the patch would be included here by email?
Yes:
$ git send-email --to="newlib@sourceware.org"
Thanks,
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-08-08 8:30 ` Corinna Vinschen
@ 2017-08-17 11:03 ` Thomas Wolff
2017-12-03 14:07 ` Corinna Vinschen
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Wolff @ 2017-08-17 11:03 UTC (permalink / raw)
To: newlib
[-- Attachment #1: Type: text/plain, Size: 2396 bytes --]
Am 08.08.2017 um 10:30 schrieb Corinna Vinschen:
> On Aug 7 21:18, Thomas Wolff wrote:
>> Am 07.08.2017 um 12:30 schrieb Corinna Vinschen:
>>> On Aug 6 07:36, Thomas Wolff wrote:
>>>> Hi,
>>>> this is a proposal to update wcwidth and the character properties functions
>>>> isw*/towupper/towlower to Unicode 10.0, as discussed in the mail thread
>>>> https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
>>>> as well as to simplify automatic generation of respective tables for an
>>>> easier update step.
>>>> Table size is moderate (using ranges for character properties) but there is
>>>> still an option to reduce the two big tables in size.
>>> As per the aforementioned discussion the table sizes are at least
>>> twice as big, so this should be done with all due caution towards
>>> the goals of smaller targets.
>> If I'm going to implement the packed versions, they will be even smaller
>> than the current tables.
>>
>> ...
>> how to produce the desired patch format/series.
> Just as with any other git-based project:
>
> $ git co -b my-stuff
> [hack, hack, hack]
> $ git commit [in useful chunks]
> $ git format-patch -X (X == number of commits)
>
>> And then the patch would be included here by email?
> Yes:
>
> $ git send-email --to="newlib@sourceware.org"
I'm attaching my patches here for assessment.
I have revised table handling further, using gcc bit struct packing. The
two big tables have a total size of 14340 bytes now, for Unicode 10.0.
I have fixed locale handling in the isw* and tow* functions, but I've
not yet changed JP conversion. Unfortunately, the routines from
newlib/iconvdata are not as straight-forward to be employed as I
thought, because the work on multi-byte representations.
Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
handled in iconvdata (JIS-201/208/212) is a little bit obscure.
Likewise obscure is the relation between newlib/iconvdata and
newlib/libc/iconv.
To be on the safe side, Iâm leaving the actual jp2uc conversion
untouched for now, and Iâve just added a dummy back-conversion uc2jp
with a #warning. If the #warning is ignored or removed, the non-Cygwin
build should work as before, fixing just locale handling.
I'm attaching the wcwidth part here, all patches are available at
http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
mailbox size limit).
Thomas
[-- Attachment #2: 0001-creation-of-width-data-supporting-Unicode-updates.patch --]
[-- Type: text/plain, Size: 26840 bytes --]
From 9c5d6b1adcf949269e3fceeaf31203921745d2c9 Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 21:59:25 +0200
Subject: [PATCH 1/4] creation of width data, supporting Unicode updates
---
newlib/libc/string/Makefile.widthdata | 47 +++
newlib/libc/string/mkwide | 49 +++
newlib/libc/string/mkwidthA | 20 +
newlib/libc/string/uniset | 678 ++++++++++++++++++++++++++++++++++
4 files changed, 794 insertions(+)
create mode 100644 newlib/libc/string/Makefile.widthdata
create mode 100755 newlib/libc/string/mkwide
create mode 100755 newlib/libc/string/mkwidthA
create mode 100755 newlib/libc/string/uniset
diff --git a/newlib/libc/string/Makefile.widthdata b/newlib/libc/string/Makefile.widthdata
new file mode 100644
index 0000000..14adab5
--- /dev/null
+++ b/newlib/libc/string/Makefile.widthdata
@@ -0,0 +1,47 @@
+#############################################################################
+# generate Unicode width data for newlib/libc/string/wcwidth.c
+
+
+#############################################################################
+# table sets to be generated
+
+widthdata=combining.t ambiguous.t wide.t
+
+widthdata: $(widthdata)
+
+
+#############################################################################
+# tools and data
+
+#WGET=wget -N -t 1 --timeout=55
+WGET=curl -R -O --connect-timeout 55
+WGET+=-z $@
+
+%.txt:
+ ln -s /usr/share/unicode/ucd/$@ . || $(WGET) http://unicode.org/Public/UNIDATA/$@
+
+uniset.tar.gz:
+ $(WGET) http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz
+
+uniset: uniset.tar.gz
+ gzip -dc uniset.tar.gz | tar xvf - uniset
+
+
+#############################################################################
+# width data for libc/string/wcwidth.c
+
+combining.t: uniset UnicodeData.txt Blocks.txt
+ PATH="${PATH}:." uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B +D7B0-D7C6 +D7CB-D7FB c > combining.t
+
+WIDTH-A: uniset UnicodeData.txt Blocks.txt EastAsianWidth.txt
+ PATH="${PATH}:." sh ./mkwidthA
+
+ambiguous.t: uniset WIDTH-A UnicodeData.txt Blocks.txt
+ PATH="${PATH}:." uniset +WIDTH-A -cat=Me -cat=Mn -cat=Cf c > ambiguous.t
+
+wide.t: uniset UnicodeData.txt Blocks.txt EastAsianWidth.txt
+ PATH="${PATH}:." sh ./mkwide
+
+
+#############################################################################
+# end
diff --git a/newlib/libc/string/mkwide b/newlib/libc/string/mkwide
new file mode 100755
index 0000000..55a0bab
--- /dev/null
+++ b/newlib/libc/string/mkwide
@@ -0,0 +1,49 @@
+#! /bin/sh
+
+# generate list of wide characters, with convex closure
+
+skipcheck=false
+
+if [ ! -r EastAsianWidth.txt ]
+then ln -s /usr/share/unicode/ucd/EastAsianWidth.txt . || exit 1
+fi
+if [ ! -r UnicodeData.txt ]
+then ln -s /usr/share/unicode/ucd/UnicodeData.txt . || exit 1
+fi
+if [ ! -r Blocks.txt ]
+then ln -s /usr/share/unicode/ucd/Blocks.txt . || exit 1
+fi
+
+sed -e "s,^\([^;]*\);[NAH],\1," -e t -e d EastAsianWidth.txt > wide.na
+sed -e "s,^\([^;]*\);[WF],\1," -e t -e d EastAsianWidth.txt > wide.fw
+
+PATH="$PATH:." # for uniset
+
+nrfw=`uniset +wide.fw nr | sed -e 's,.*:,,'`
+echo FW $nrfw
+nrna=`uniset +wide.na nr | sed -e 's,.*:,,'`
+echo NAH $nrna
+
+extrablocks="2E80-303E"
+
+# check all blocks
+includes () {
+ nr=`uniset +wide.$2 -$1 nr | sed -e 's,.*:,,'`
+ test $nr != $3
+}
+echo "adding compact closure of wide ranges, this may take ~10min"
+for b in $extrablocks `sed -e 's,^\([0-9A-F]*\)\.\.\([0-9A-F]*\).*,\1-\2,' -e t -e d Blocks.txt`
+do range=$b
+ echo checking $range $* >&2
+ if includes $range fw $nrfw && ! includes $range na $nrna
+ then echo $range
+ fi
+done > wide.blocks
+
+(
+sed -e "s,^,//," -e 1q EastAsianWidth.txt
+sed -e "s,^,//," -e 1q Blocks.txt
+uniset `sed -e 's,^,+,' wide.blocks` +wide.fw c
+) > wide.t
+
+rm -f wide.na wide.fw wide.blocks
diff --git a/newlib/libc/string/mkwidthA b/newlib/libc/string/mkwidthA
new file mode 100755
index 0000000..343ab40
--- /dev/null
+++ b/newlib/libc/string/mkwidthA
@@ -0,0 +1,20 @@
+#! /bin/sh
+
+# generate WIDTH-A file, listing Unicode characters with width property
+# Ambiguous, from EastAsianWidth.txt
+
+if [ ! -r EastAsianWidth.txt ]
+then ln -s /usr/share/unicode/ucd/EastAsianWidth.txt . || exit 1
+fi
+if [ ! -r UnicodeData.txt ]
+then ln -s /usr/share/unicode/ucd/UnicodeData.txt . || exit 1
+fi
+if [ ! -r Blocks.txt ]
+then ln -s /usr/share/unicode/ucd/Blocks.txt . || exit 1
+fi
+
+sed -e "s,^\([^;]*\);A,\1," -e t -e d EastAsianWidth.txt > width-a-new
+rm -f WIDTH-A
+echo "# UAX #11: East Asian Ambiguous" > WIDTH-A
+PATH="$PATH:." uniset +width-a-new compact >> WIDTH-A
+rm -f width-a-new
diff --git a/newlib/libc/string/uniset b/newlib/libc/string/uniset
new file mode 100755
index 0000000..415e219
--- /dev/null
+++ b/newlib/libc/string/uniset
@@ -0,0 +1,678 @@
+#!/usr/bin/perl
+# Uniset -- Unicode subset manager -- Markus Kuhn
+# http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz
+# $Id: uniset,v 1.18 2004-04-10 21:19:39+01 mgk25 Exp mgk25 $
+
+require 5.008;
+use open ':utf8';
+
+binmode(STDOUT, ":utf8");
+binmode(STDIN, ":utf8");
+
+my (%name, %invname, %category, %comment);
+
+print <<End if $#ARGV < 0;
+Uniset -- Unicode subset manager -- Markus Kuhn
+
+Uniset allows to merge and subtract Unicode subsets. It can output and
+analyse the resulting set in various formats.
+
+The following commands can be supplied to uniset on the command line:
+
+Commands to define a set of characters:
+
+ + filename add the character set described in the file to the set
+ - filename remove the character set described in the file from the set
+ +: filename add the characters in the UTF-8 file to the set
+ -: filename remove the characters in the UTF-8 file from the set
+ +xxxx..yyyy add the range to the set (xxxx and yyyy are hex numbers)
+ -xxxx..yyyy remove the range from the set (xxxx and yyyy are hex numbers)
+ +cat=Xx add all Unicode characters with category code Xx
+ -cat=Xx remove all Unicode characters with category code Xx
+ -cat!=Xx remove all Unicode characters without category code Xx
+ clean remove any elements that do not appear in the Unicode database
+ unknown remove any elements that do appear in the Unicode database
+
+Command to output descriptions of the constructed set of characters:
+
+ table write a full table with one line per character
+ compact output the set in compact MES format
+ c output the set as C interval array
+ nr output the number of characters
+ sources output a table that shows the number of characters contributed
+ by the various combinations of input sets added with +.
+ utf8-list output a list of all characters encoded in UTF-8
+
+Commands to tailor the following output commands:
+
+ html write HTML tables instead of plain text
+ ucs add the unicode character itself to the table (UTF-8 in
+ plain table, numeric character reference in HTML)
+
+Formats of character set input files read by the + and - command:
+
+Empty lines, white space at the start and end of the line and any
+comment text following a \# are ignored. The following formats are
+recognized
+
+xx yyyy xx is the hex code in an 8-bit character set and yyyy
+ is the corresponding Unicode value. Both can optionally
+ be prefixed by 0x. This is the format used in the
+ files on <ftp://ftp.unicode.org/Public/MAPPINGS/>.
+
+yyyy yyyy (optionally prefixed with 0x) is a Unicode character
+ belonging to the specified subset.
+
+yyyy-yyyy a range of Unicode characters belonging to
+yyyy..yyyy the specified subset.
+
+xx yy yy yy-yy yy xx denotes a row (high-byte) and the yy specify
+ corresponding low bytes or with a hyphen also ranges of
+ low bytes in the Unicode values that belong to this
+ subset. This is also the format that is generated by
+ the compact command.
+End
+exit 1 if $#ARGV < 0;
+
+
+# Subroutine to identify whether the ISO 10646/Unicode character code
+# ucs belongs into the East Asian Wide (W) or East Asian FullWidth
+# (F) category as defined in Unicode Technical Report #11.
+
+sub iswide ($) {
+ my $ucs = shift(@_);
+
+ return ($ucs >= 0x1100 &&
+ ($ucs <= 0x115f || # Hangul Jamo
+ $ucs == 0x2329 || $ucs == 0x232a ||
+ ($ucs >= 0x2e80 && $ucs <= 0xa4cf &&
+ $ucs != 0x303f) || # CJK .. Yi
+ ($ucs >= 0xac00 && $ucs <= 0xd7a3) || # Hangul Syllables
+ ($ucs >= 0xf900 && $ucs <= 0xfaff) || # CJK Comp. Ideographs
+ ($ucs >= 0xfe30 && $ucs <= 0xfe6f) || # CJK Comp. Forms
+ ($ucs >= 0xff00 && $ucs <= 0xff60) || # Fullwidth Forms
+ ($ucs >= 0xffe0 && $ucs <= 0xffe6) ||
+ ($ucs >= 0x20000 && $ucs <= 0x2fffd) ||
+ ($ucs >= 0x30000 && $ucs <= 0x3fffd)));
+}
+
+# Return the Unicode name that belongs to a given character code
+
+# Jamo short names, see Unicode 3.0, table 4-4, page 86
+
+my @lname = ('G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '',
+ 'J', 'JJ', 'C', 'K', 'T', 'P', 'H'); # 1100..1112
+my @vname = ('A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O',
+ 'WA', 'WAE', 'OE', 'YO', 'U', 'WEO', 'WE', 'WI', 'YU',
+ 'EU', 'YI', 'I'); # 1161..1175
+my @tname = ('G', 'GG', 'GS', 'N', 'NJ', 'NH', 'D', 'L', 'LG', 'LM',
+ 'LB', 'LS', 'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS',
+ 'NG', 'J', 'C', 'K', 'T', 'P', 'H'); # 11a8..11c2
+
+sub name {
+ my $ucs = shift(@_);
+
+ # The intervals used here reflect Unicode Version 3.2
+ if (($ucs >= 0x3400 && $ucs <= 0x4db5) ||
+ ($ucs >= 0x4e00 && $ucs <= 0x9fa5) ||
+ ($ucs >= 0x20000 && $ucs <= 0x2a6d6)) {
+ return "CJK UNIFIED IDEOGRAPH-" . sprintf("%04X", $ucs);
+ }
+
+ if ($ucs >= 0xac00 && $ucs <= 0xd7a3) {
+ my $s = $ucs - 0xac00;
+ my $l = 0x1100 + int($s / (21 * 28));
+ my $v = 0x1161 + int(($s % (21 * 28)) / 28);
+ my $t = 0x11a7 + $s % 28;
+ return "HANGUL SYLLABLE " .
+ ($lname[int($s / (21 * 28))] .
+ $vname[int(($s % (21 * 28)) / 28)] .
+ $tname[$s % 28 - 1]);
+ }
+
+ return $name{$ucs};
+}
+
+sub is_unicode {
+ my $ucs = shift(@_);
+
+ # The intervals used here reflect Unicode Version 3.2
+ if (($ucs >= 0x3400 && $ucs <= 0x4db5) ||
+ ($ucs >= 0x4e00 && $ucs <= 0x9fa5) ||
+ ($ucs >= 0xac00 && $ucs <= 0xd7a3) ||
+ ($ucs >= 0x20000 && $ucs <= 0x2a6d6)) {
+ return 1;
+ }
+
+ return exists $name{$ucs};
+}
+
+
+my $html = 0;
+my $image = 0;
+my $adducs = 0;
+my $unicodedata = "UnicodeData.txt";
+my $blockdata = "Blocks.txt";
+my $datadir = "$ENV{HOME}/local/lib/ucs";
+
+# read list of all Unicode names
+if (!open(UDATA, $unicodedata) && !open(UDATA, "$datadir/$unicodedata")) {
+ die ("Can't open Unicode database '$unicodedata':\n$!\n\n" .
+ "Please make sure that you have downloaded the file\n" .
+ "ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt\n");
+}
+while (<UDATA>) {
+ if (/^([0-9,A-F]{4,8});([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*)$/) {
+ next if $2 ne '<control>' && substr($2, 0, 1) eq '<';
+ $ucs = hex($1);
+ $name{$ucs} = $2;
+ $invname{$2} = $ucs;
+ $category{$ucs} = $3;
+ $comment{$ucs} = $12;
+ } else {
+ die("Syntax error in line '$_' in file '$unicodedata'");
+ }
+}
+close(UDATA);
+
+# read list of all Unicode blocks
+if (!open(UDATA, $blockdata) && !open(UDATA, "$datadir/$blockdata")) {
+ die ("Can't open Unicode blockname list '$blockdata':\n$!\n\n" .
+ "Please make sure that you have downloaded the file\n" .
+ "ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt\n");
+}
+my $blocks = 0;
+my (@blockstart, @blockend, @blockname);
+while (<UDATA>) {
+ if (/^\s*([0-9,A-F]{4,8})\s*\.\.\s*([0-9,A-F]{4,8})\s*;\s*(.*)$/) {
+ $blockstart[$blocks] = hex($1);
+ $blockend [$blocks] = hex($2);
+ $blockname [$blocks] = $3;
+ $blocks++;
+ } elsif (/^\s*\#/ || /^\s*$/) {
+ # ignore comments and empty lines
+ } else {
+ die("Syntax error in line '$_' in file '$blockdata'");
+ }
+}
+close(UDATA);
+if ($blockend[$blocks-1] < 0x110000) {
+ $blockstart[$blocks] = 0x110000;
+ $blockend [$blocks] = 0x7FFFFFFF;
+ $blockname [$blocks] = "Beyond Plane 16";
+ $blocks++;
+}
+
+# process command line arguments
+while ($_ = shift(@ARGV)) {
+ if (/^html$/) {
+ $html = 1;
+ } elsif (/^ucs$/) {
+ $adducs = 1;
+ } elsif (/^img$/) {
+ $html = 1;
+ $image = 1;
+ } elsif (/^template$/) {
+ $template = shift(@ARGV);
+ open(TEMPLATE, $template) || die("Can't open template file '$template': '$!'");
+ while (<TEMPLATE>) {
+ if (/^\#\s*include\s+\"([^\"]*)\"\s*$/) {
+ open(INCLUDE, $1) || die("Can't open template include file '$1': '$!'");
+ while (<INCLUDE>) {
+ print $_;
+ }
+ close(INCLUDE);
+ } elsif (/^\#\s*quote\s+\"([^\"]*)\"\s*$/) {
+ open(INCLUDE, $1) || die("Can't open template include file '$1': '$!'");
+ while (<INCLUDE>) {
+ s/&/&/g;
+ s/</</g;
+ print $_;
+ }
+ close(INCLUDE);
+ } else {
+ print $_;
+ }
+ }
+ close(TEMPLATE);
+ } elsif (/^\+cat=(.+)$/) {
+ # add characters with given category
+ $cat = $1;
+ for $i (keys(%category)) {
+ $used{$i} = "[${cat}]" if $category{$i} eq $cat;
+ }
+ } elsif (/^\-cat=(.+)$/) {
+ # remove characters with given category
+ $cat = $1;
+ for $i (keys(%category)) {
+ delete $used{$i} if $category{$i} eq $cat;
+ }
+ } elsif (/^\-cat!=(.+)$/) {
+ # remove characters without given category
+ $cat = $1;
+ for $i (keys(%category)) {
+ delete $used{$i} unless $category{$i} eq $cat;
+ }
+ } elsif (/^([+-]):(.*)/) {
+ $remove = $1 eq "-";
+ $setfile = $2;
+ $setfile = shift(@ARGV) if $setfile eq "";
+ push(@SETS, $setfile);
+ open(SET, $setfile) || die("Can't open set file '$setfile': '$!'");
+ $setname = $setfile;
+ while (<SET>) {
+ while ($_) {
+ $i = ord($_);
+ $used{$i} .= "[${setname}]" unless $remove;
+ delete $used{$i} if $remove;
+ $_ = substr($_, 1);
+ }
+ }
+ close SET;
+ } elsif (/^([+-])(.*)/) {
+ $remove = $1 eq "-";
+ $setfile = $2;
+ $setfile = "$setfile..$setfile" if $setfile =~ /^([0-9A-Fa-f]{4,8})$/;
+ if ($setfile =~ /^([0-9A-Fa-f]{4,8})(-|\.\.)([0-9A-Fa-f]{4,8})$/) {
+ # handle intervall specification on command line
+ $first = hex($1);
+ $last = hex($3);
+ for ($i = $first; $i <= $last; $i++) {
+ $used{$i} .= "[ARG]" unless $remove;
+ delete $used{$i} if $remove;
+ }
+ next;
+ }
+ $setfile = shift(@ARGV) if $setfile eq "";
+ push(@SETS, $setfile);
+ open(SET, $setfile) || die("Can't open set file '$setfile': '$!'");
+ $cedf = ($setfile =~ /cedf/); # detect Kosta Kosti's trans CEDF format by path name
+ $setname = $setfile;
+ $setname =~ s/([^.\[\]]*)\..*/$1/;
+ while (<SET>) {
+ if (/^<code_set_name>/) {
+ # handle ISO 15897 (POSIX registry) charset mapping format
+ undef $comment_char;
+ undef $escape_char;
+ while (<SET>) {
+ if ($comment_char && /^$comment_char/) {
+ # remove comments
+ $_ = $`;
+ }
+ next if (/^\032?\s*$/); # skip empty lines
+ if (/^<comment_char> (\S)$/) {
+ $comment_char = $1;
+ } elsif (/^<escape_char> (\S)$/) {
+ $escape_char = $1;
+ } elsif (/^(END )?CHARMAP$/) {
+ #ignore
+ } elsif (/^<.*>\s*\/x([0-9A-F]{2})\s*<U([0-9A-F]{4,8})>/) {
+ $used{hex($2)} .= "[${setname}{$1}]" unless $remove;
+ delete $used{hex($2)} if $remove;
+ } else {
+ die("Syntax error in line $. in file '$setfile':\n'$_'\n");
+ }
+ }
+ next;
+ } elsif (/^STARTFONT /) {
+ # handle X11 BDF file
+ while (<SET>) {
+ if (/^ENCODING\s+([0-9]+)/) {
+ $used{$1} .= "[${setname}]" unless $remove;
+ delete $used{$1} if $remove;
+ }
+ }
+ next;
+ }
+ tr/a-z/A-Z/; # make input uppercase
+ if ($cedf) {
+ if ($. > 4) {
+ if (/^([0-9A-F]{2})\t.?\t(.*)$/) {
+ # handle Kosta Kosti's trans CEDF format
+ next if (hex($1) < 32 || (hex($1) > 0x7e && hex($1) < 0xa0));
+ $ucs = $invname{$2};
+ die "unknown ISO 10646 name '$2' in '$setfile' line $..\n" if ! $ucs;
+ $used{$ucs} .= "[${setname}{$1}]" unless $remove;
+ delete $used{$ucs} if $remove;
+ } else {
+ die("Syntax error in line $. in CEDF file '$setfile':\n'$_'\n");
+ }
+ }
+ next;
+ }
+ if (/^\s*(0X|U\+|U-)?([0-9A-F]{2})\s+\#\s*UNDEFINED\s*$/) {
+ # ignore ftp.unicode.org mapping file lines with #UNDEFINED
+ next;
+ }
+ s/^([^\#]*)\#.*$/$1/; # remove comments
+ next if (/^\032?\s*$/); # skip empty lines
+ if (/^\s*(0X)?([0-9A-F-]{2})\s+(0X|U\+|U-)?([0-9A-F]{4,8})\s*$/) {
+ # handle entry from a ftp.unicode.org mapping file
+ $used{hex($4)} .= "[${setname}{$2}]" unless $remove;
+ delete $used{hex($4)} if $remove;
+ } elsif (/^\s*(0X|U\+|U-)?([0-9A-F]{4,8})(\s*-\s*|\s*\.\.\s*|\s+)(0X|U\+|U-)?([0-9A-F]{4,8})\s*$/) {
+ # handle interval specification
+ $first = hex($2);
+ $last = hex($5);
+ for ($i = $first; $i <= $last; $i++) {
+ $used{$i} .= "[${setname}]" unless $remove;
+ delete $used{$i} if $remove;
+ }
+ } elsif (/^\s*([0-9A-F]{2,6})(\s+[0-9A-F]{2},?|\s+[0-9A-F]{2}-[0-9A-F]{2},?)+/) {
+ # handle lines from P10 MES draft
+ $row = $1;
+ $cols = $_;
+ $cols =~ s/^\s*([0-9A-F]{2,6})\s*(.*)\s*$/$2/;
+ $cols =~ tr/,//d;
+ @cols = split(/\s+/, $cols);
+ for (@cols) {
+ if (/^(..)$/) {
+ $first = hex("$row$1");
+ $last = $first;
+ } elsif (/^(..)-(..)$/) {
+ $first = hex("$row$1");
+ $last = hex("$row$2");
+ } else {
+ die ("this should never happen '$_'");
+ }
+ for ($i = $first; $i <= $last; $i++) {
+ $used{$i} .= "[${setname}]" unless $remove;
+ delete $used{$i} if $remove;
+ }
+ }
+ } elsif (/^\s*(0X|U\+|U-)?([0-9A-F]{4,8})\s*/) {
+ # handle single character
+ $used{hex($2)} .= "[${setname}]" unless $remove;
+ delete $used{hex($2)} if $remove;
+ } else {
+ die("Syntax error in line $. in file '$setfile':\n'$_'\n") unless /^\s*(\#.*)?$/;
+ }
+ }
+ close SET;
+ } elsif (/^loadimages$/ || /^loadbigimages$/) {
+ if (/^loadimages$/) {
+ $prefix = "Small.Glyphs";
+ } else {
+ $prefix = "Glyphs";
+ }
+ $total = 0;
+ for $i (keys(%used)) {
+ next if ($name{$i} eq "<control>");
+ $total++;
+ }
+ $count = 0;
+ $| = 1;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ $count++;
+ $j = sprintf("%04X", $i);
+ $j =~ /(..)(..)/;
+ $gif = "http://charts.unicode.org/Unicode.charts/$prefix/$1/U$j.gif";
+ print("\r$count/$total: $gif");
+ system("mkdir -p $prefix/$1; cd $prefix/$1; webcopy -u -s $gif &");
+ select(undef, undef, undef, 0.2);
+ }
+ print("\n");
+ exit 0;
+ } elsif (/^giftable/) {
+ # form a table of glyphs (requires pbmtools installed)
+ $count = 0;
+ for $i (keys(%used)) {
+ $count++ unless $name{$i} eq "<control>";
+ }
+ $width = int(sqrt($count/sqrt(2)) + 0.5);
+ $width = $1 if /^giftable([0-9]+)$/;
+ system("rm -f tmp-*.pnm table.pnm~ table.pnm");
+ $col = 0;
+ $row = 0;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ $j = sprintf("%04X", $i);
+ $j =~ /(..)(..)/;
+ $gif = "Small.Glyphs/$1/U$j.gif";
+ $pnm = sprintf("tmp-%02x.pnm", $col);
+ $fallback = "Small.Glyphs/FF/UFFFD.gif";
+ system("giftopnm $gif >$pnm || { rm $pnm ; giftopnm $fallback >$pnm ; }");
+ if (++$col == $width) {
+ system("pnmcat -lr tmp-*.pnm | cat >tmp-row.pnm");
+ if ($row == 0) {
+ system("mv tmp-row.pnm table.pnm");
+ } else {
+ system("mv table.pnm table.pnm~; pnmcat -tb table.pnm~ tmp-row.pnm >table.pnm");
+ }
+ $row++;
+ $col = 0;
+ system("rm -f tmp-*.pnm table.pnm~");
+ }
+ }
+ if ($col > 0) {
+ system("pnmcat -lr tmp-*.pnm | cat >tmp-row.pnm");
+ if ($row == 0) {
+ system("mv tmp-row.pnm table.pnm");
+ } else {
+ system("mv table.pnm table.pnm~; pnmcat -tb -jleft -black table.pnm~ tmp-row.pnm >table.pnm");
+ }
+ }
+ system("rm -f table.gif ; ppmtogif table.pnm > table.gif");
+ system("rm -f tmp-*.pnm table.pnm~ table.pnm");
+ } elsif (/^table$/) {
+ # go through all used names to print full table
+ print "<TABLE border=2>\n" if $html;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ if ($html) {
+ $sources = $used{$i};
+ $sources =~ s/\]\[/, /g;
+ $sources =~ s/^\[//g;
+ $sources =~ s/\]$//g;
+ $sources =~ s/\{(..)\}/<SUB>$1<\/SUB>/g;
+ $j = sprintf("%04X", $i);
+ $j =~ /(..)(..)/;
+ $gif = "Small.Glyphs/$1/U$j.gif";
+ print "<TR>";
+ print "<TD><img width=32 height=32 src=\"$gif\">" if $image;
+ printf("<TD>&#%d;", $i) if $adducs;
+ print "<TD><SAMP>$j</SAMP><TD><SAMP>" . name($i);
+ print " ($comment{$i})" if $comment{$i};
+ print "</SAMP><TD><SMALL>$sources</SMALL>\n";
+ } else {
+ printf("%04X \# ", $i);
+ print pack("U", $i) . " " if $adducs;
+ print name($i) ."\n";
+ }
+ }
+ print "</TABLE>\n" if $html;
+ } elsif (/^imgblock$/) {
+ $width = 16;
+ $width = $1 if /giftable([0-9]+)/;
+ $col = 0;
+ $subline = "";
+ print "\n<P><TABLE cellspacing=0 cellpadding=0>";
+ for $i (sort({$a <=> $b} keys(%used))) {
+ print "<TR>" if $col == 0;
+ $j = sprintf("%04X", $i);
+ $j =~ /(..)(..)/;
+ $gif = "Small.Glyphs/$1/U$j.gif";
+ $alt = name($i);
+ print "<TD><img width=32 height=32 src=\"$gif\" alt=\"$alt\">";
+ $subline .= "<TD><SMALL><SAMP>$j</SAMP></SMALL>";
+ if (++$col == $width) {
+ print "<TR align=center>$subline";
+ $col = 0;
+ $subline = "";
+ }
+ }
+ print "<TR align=center>$subline" if ($col > 0);
+ print "</TABLE>\n";
+ } elsif (/^sources$/) {
+ # count how many characters are attributed to the various source set combinations
+ print "<P>Number of occurences of source character set combinations:\n<TABLE border=2>" if $html;
+ for $i (keys(%used)) {
+ next if ($name{$i} eq "<control>");
+ $sources = $used{$i};
+ $sources =~ s/\]\[/, /g;
+ $sources =~ s/^\[//g;
+ $sources =~ s/\]$//g;
+ $sources =~ s/\{(..)\}//g;
+ $contribs{$sources} += 1;
+ }
+ for $j (keys(%contribs)) {
+ print "<TR><TD>$contribs{$j}<TD>$j\n" if $html;
+ }
+ print "</TABLE>\n" if $html;
+ } elsif (/^compact$/) {
+ # print compact table in P10 MES format
+ print "<P>Compact representation of this character set:\n<TABLE border=2>" if $html;
+ print "<TR><TD><B>Rows</B><TD><B>Positions (Cells)</B>" if $html;
+ print "\n# Plane 00\n# Rows\tPositions (Cells)\n" unless $html;
+ $current_row = '';
+ $start_col = '';
+ $last_col = '';
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ $row = sprintf("%02X", $i >> 8);
+ $col = sprintf("%02X", $i & 0xff);
+ if ($row ne $current_row) {
+ if (($last_col ne '') and ($last_col ne $start_col)) {
+ print "-$last_col";
+ print "</SAMP>" if $html;
+ }
+ print "<TR><TD><SAMP>$row</SAMP><TD><SAMP>" if $html;
+ print "\n $row\t" unless $html;
+ $len = 0;
+ $current_row = $row;
+ $start_col = '';
+ }
+ if ($start_col eq '') {
+ print "$col";
+ $len += 2;
+ $start_col = $col;
+ $last_col = $col;
+ } elsif (hex($col) == hex($last_col) + 1) {
+ $last_col = $col;
+ } else {
+ if ($last_col ne $start_col) {
+ print "-$last_col";
+ $len += 3;
+ }
+ if ($len > 60 && !$html) {
+ print "\n $row\t";
+ $len = 0;
+ };
+ print " " if $len;
+ print "$col";
+ $len += 2 + !! $len;
+ $start_col = $col;
+ $last_col = $col;
+ }
+ }
+ if (($last_col ne '') and ($last_col ne $start_col)) {
+ print "-$last_col";
+ print "</SAMP>" if $html;
+ }
+ print "\n" if ($current_row ne '');
+ print "</TABLE>\n" if $html;
+ print "\n";
+ } elsif (/^c$/) {
+ # print table as C interval array
+ print "{";
+ $last_i = '';
+ $columns = 3;
+ $col = $columns;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ if ($last_i eq '') {
+ if (++$col > $columns) { $col = 1; print "\n "; }
+ printf(" { 0x%04X, ", $i);
+ $last_i = $i;
+ } elsif ($i == $last_i + 1) {
+ $last_i = $i;
+ } else {
+ printf("0x%04X },", $last_i);
+ if (++$col > $columns) { $col = 1; print "\n "; }
+ printf(" { 0x%04X, ", $i);
+ $last_i = $i;
+ }
+ }
+ if ($last_i ne '') {
+ printf("0x%04X }", $last_i);
+ }
+ print "\n};\n";
+ } elsif (/^utf8-list$/) {
+ $col = 0;
+ $block = 0;
+ $last = -1;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ while ($blockend[$block] < $i && $block < $blocks - 1) {
+ $block++;
+ }
+ if ($last <= $blockend[$block-1] &&
+ $i < $blockstart[$block]) {
+ print "\n" if ($col);
+ printf "\nFree block (U+%04X-U+%04X):\n\n",
+ $blockend[$block-1] + 1, $blockstart[$block] - 1;
+ $col = 0;
+ }
+ if ($last < $blockstart[$block] && $i >= $blockstart[$block]) {
+ print "\n" if ($col);
+ printf "\n$blockname[$block] (U+%04X-U+%04X):\n\n",
+ $blockstart[$block], $blockend[$block];
+ $col = 0;
+ }
+ if ($category{$i} eq 'Mn') {
+ # prefix non-spacing character with U+25CC DOTTED CIRCLE
+ print "\x{25CC}";
+ } elsif ($category{$i} eq 'Me') {
+ # prefix enclosing non-spacing character with space
+ print " ";
+ }
+ print pack("U", $i);
+ $col += 1 + iswide($i);
+ if ($col >= 64) {
+ print "\n";
+ $col = 0;
+ }
+ $last = $i;
+ }
+ print "\n" if ($col);
+ } elsif (/^collections$/) {
+ $block = 0;
+ $last = -1;
+ for $i (sort({$a <=> $b} keys(%used))) {
+ next if ($name{$i} eq "<control>");
+ while ($blockend[$block] < $i && $block < $blocks - 1) {
+ $block++;
+ }
+ if ($last < $blockstart[$block] && $i >= $blockstart[$block]) {
+ print $blockname[$block],
+ " " x (40 - length($blockname[$block]));
+ printf "%04X-%04X\n",
+ $blockstart[$block], $blockend[$block];
+ }
+ $last = $i;
+ }
+ } elsif (/^nr$/) {
+ print "<P>" if $html;
+ print "# " unless $html;
+ print "Number of characters in above table: ";
+ $count = 0;
+ for $i (keys(%used)) {
+ $count++ unless $name{$i} eq "<control>";
+ }
+ print $count;
+ print "\n";
+ } elsif (/^clean$/) {
+ # remove characters from set that are not in $unicodedata
+ for $i (keys(%used)) {
+ delete $used{$i} unless is_unicode($i);
+ }
+ } elsif (/^unknown$/) {
+ # remove characters from set that are in $unicodedata
+ for $i (keys(%used)) {
+ delete $used{$i} if is_unicode($i);
+ }
+ } else {
+ die("Unknown command line command '$_'");
+ };
+}
--
2.13.2
[-- Attachment #3: 0002-generated-width-data-included-in-repository-because-.patch --]
[-- Type: text/plain, Size: 22167 bytes --]
From 00c7da38274b433f952a87732e58f2e22fc5229e Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 22:00:44 +0200
Subject: [PATCH 2/4] generated width data, included in repository because of
long creation time
---
newlib/libc/string/WIDTH-A | 569 +++++++++++++++++++++++++++++++++++++++++
newlib/libc/string/ambiguous.t | 61 +++++
newlib/libc/string/combining.t | 107 ++++++++
newlib/libc/string/wide.t | 33 +++
4 files changed, 770 insertions(+)
create mode 100644 newlib/libc/string/WIDTH-A
create mode 100644 newlib/libc/string/ambiguous.t
create mode 100644 newlib/libc/string/combining.t
create mode 100644 newlib/libc/string/wide.t
diff --git a/newlib/libc/string/WIDTH-A b/newlib/libc/string/WIDTH-A
new file mode 100644
index 0000000..51e8f23
--- /dev/null
+++ b/newlib/libc/string/WIDTH-A
@@ -0,0 +1,569 @@
+# UAX #11: East Asian Ambiguous
+
+# Plane 00
+# Rows Positions (Cells)
+
+ 00 A1 A4 A7-A8 AA AD-AE B0-B4 B6-BA BC-BF C6 D0 D7-D8 DE-E1 E6 E8-EA
+ 00 EC-ED F0 F2-F3 F7-FA FC FE
+ 01 01 11 13 1B 26-27 2B 31-33 38 3F-42 44 48-4B 4D 52-53 66-67 6B
+ 01 CE D0 D2 D4 D6 D8 DA DC
+ 02 51 61 C4 C7 C9-CB CD D0 D8-DB DD DF
+ 03 00-6F 91-A1 A3-A9 B1-C1 C3-C9
+ 04 01 10-4F 51
+ 20 10 13-16 18-19 1C-1D 20-22 24-27 30 32-33 35 3B 3E 74 7F 81-84
+ 20 AC
+ 21 03 05 09 13 16 21-22 26 2B 53-54 5B-5E 60-6B 70-79 89 90-99 B8-B9
+ 21 D2 D4 E7
+ 22 00 02-03 07-08 0B 0F 11 15 1A 1D-20 23 25 27-2C 2E 34-37 3C-3D
+ 22 48 4C 52 60-61 64-67 6A-6B 6E-6F 82-83 86-87 95 99 A5 BF
+ 23 12
+ 24 60-E9 EB-FF
+ 25 00-4B 50-73 80-8F 92-95 A0-A1 A3-A9 B2-B3 B6-B7 BC-BD C0-C1 C6-C8
+ 25 CB CE-D1 E2-E5 EF
+ 26 05-06 09 0E-0F 1C 1E 40 42 60-61 63-65 67-6A 6C-6D 6F 9E-9F BF
+ 26 C6-CD CF-D3 D5-E1 E3 E8-E9 EB-F1 F4 F6-F9 FB-FC FE-FF
+ 27 3D 76-7F
+ 2B 56-59
+ 32 48-4F
+ E0 00-FF
+ E1 00-FF
+ E2 00-FF
+ E3 00-FF
+ E4 00-FF
+ E5 00-FF
+ E6 00-FF
+ E7 00-FF
+ E8 00-FF
+ E9 00-FF
+ EA 00-FF
+ EB 00-FF
+ EC 00-FF
+ ED 00-FF
+ EE 00-FF
+ EF 00-FF
+ F0 00-FF
+ F1 00-FF
+ F2 00-FF
+ F3 00-FF
+ F4 00-FF
+ F5 00-FF
+ F6 00-FF
+ F7 00-FF
+ F8 00-FF
+ FE 00-0F
+ FF FD
+ 1F1 00-0A 10-2D 30-69 70-8D 8F-90 9B-AC
+ E01 00-EF
+ F00 00-FF
+ F01 00-FF
+ F02 00-FF
+ F03 00-FF
+ F04 00-FF
+ F05 00-FF
+ F06 00-FF
+ F07 00-FF
+ F08 00-FF
+ F09 00-FF
+ F0A 00-FF
+ F0B 00-FF
+ F0C 00-FF
+ F0D 00-FF
+ F0E 00-FF
+ F0F 00-FF
+ F10 00-FF
+ F11 00-FF
+ F12 00-FF
+ F13 00-FF
+ F14 00-FF
+ F15 00-FF
+ F16 00-FF
+ F17 00-FF
+ F18 00-FF
+ F19 00-FF
+ F1A 00-FF
+ F1B 00-FF
+ F1C 00-FF
+ F1D 00-FF
+ F1E 00-FF
+ F1F 00-FF
+ F20 00-FF
+ F21 00-FF
+ F22 00-FF
+ F23 00-FF
+ F24 00-FF
+ F25 00-FF
+ F26 00-FF
+ F27 00-FF
+ F28 00-FF
+ F29 00-FF
+ F2A 00-FF
+ F2B 00-FF
+ F2C 00-FF
+ F2D 00-FF
+ F2E 00-FF
+ F2F 00-FF
+ F30 00-FF
+ F31 00-FF
+ F32 00-FF
+ F33 00-FF
+ F34 00-FF
+ F35 00-FF
+ F36 00-FF
+ F37 00-FF
+ F38 00-FF
+ F39 00-FF
+ F3A 00-FF
+ F3B 00-FF
+ F3C 00-FF
+ F3D 00-FF
+ F3E 00-FF
+ F3F 00-FF
+ F40 00-FF
+ F41 00-FF
+ F42 00-FF
+ F43 00-FF
+ F44 00-FF
+ F45 00-FF
+ F46 00-FF
+ F47 00-FF
+ F48 00-FF
+ F49 00-FF
+ F4A 00-FF
+ F4B 00-FF
+ F4C 00-FF
+ F4D 00-FF
+ F4E 00-FF
+ F4F 00-FF
+ F50 00-FF
+ F51 00-FF
+ F52 00-FF
+ F53 00-FF
+ F54 00-FF
+ F55 00-FF
+ F56 00-FF
+ F57 00-FF
+ F58 00-FF
+ F59 00-FF
+ F5A 00-FF
+ F5B 00-FF
+ F5C 00-FF
+ F5D 00-FF
+ F5E 00-FF
+ F5F 00-FF
+ F60 00-FF
+ F61 00-FF
+ F62 00-FF
+ F63 00-FF
+ F64 00-FF
+ F65 00-FF
+ F66 00-FF
+ F67 00-FF
+ F68 00-FF
+ F69 00-FF
+ F6A 00-FF
+ F6B 00-FF
+ F6C 00-FF
+ F6D 00-FF
+ F6E 00-FF
+ F6F 00-FF
+ F70 00-FF
+ F71 00-FF
+ F72 00-FF
+ F73 00-FF
+ F74 00-FF
+ F75 00-FF
+ F76 00-FF
+ F77 00-FF
+ F78 00-FF
+ F79 00-FF
+ F7A 00-FF
+ F7B 00-FF
+ F7C 00-FF
+ F7D 00-FF
+ F7E 00-FF
+ F7F 00-FF
+ F80 00-FF
+ F81 00-FF
+ F82 00-FF
+ F83 00-FF
+ F84 00-FF
+ F85 00-FF
+ F86 00-FF
+ F87 00-FF
+ F88 00-FF
+ F89 00-FF
+ F8A 00-FF
+ F8B 00-FF
+ F8C 00-FF
+ F8D 00-FF
+ F8E 00-FF
+ F8F 00-FF
+ F90 00-FF
+ F91 00-FF
+ F92 00-FF
+ F93 00-FF
+ F94 00-FF
+ F95 00-FF
+ F96 00-FF
+ F97 00-FF
+ F98 00-FF
+ F99 00-FF
+ F9A 00-FF
+ F9B 00-FF
+ F9C 00-FF
+ F9D 00-FF
+ F9E 00-FF
+ F9F 00-FF
+ FA0 00-FF
+ FA1 00-FF
+ FA2 00-FF
+ FA3 00-FF
+ FA4 00-FF
+ FA5 00-FF
+ FA6 00-FF
+ FA7 00-FF
+ FA8 00-FF
+ FA9 00-FF
+ FAA 00-FF
+ FAB 00-FF
+ FAC 00-FF
+ FAD 00-FF
+ FAE 00-FF
+ FAF 00-FF
+ FB0 00-FF
+ FB1 00-FF
+ FB2 00-FF
+ FB3 00-FF
+ FB4 00-FF
+ FB5 00-FF
+ FB6 00-FF
+ FB7 00-FF
+ FB8 00-FF
+ FB9 00-FF
+ FBA 00-FF
+ FBB 00-FF
+ FBC 00-FF
+ FBD 00-FF
+ FBE 00-FF
+ FBF 00-FF
+ FC0 00-FF
+ FC1 00-FF
+ FC2 00-FF
+ FC3 00-FF
+ FC4 00-FF
+ FC5 00-FF
+ FC6 00-FF
+ FC7 00-FF
+ FC8 00-FF
+ FC9 00-FF
+ FCA 00-FF
+ FCB 00-FF
+ FCC 00-FF
+ FCD 00-FF
+ FCE 00-FF
+ FCF 00-FF
+ FD0 00-FF
+ FD1 00-FF
+ FD2 00-FF
+ FD3 00-FF
+ FD4 00-FF
+ FD5 00-FF
+ FD6 00-FF
+ FD7 00-FF
+ FD8 00-FF
+ FD9 00-FF
+ FDA 00-FF
+ FDB 00-FF
+ FDC 00-FF
+ FDD 00-FF
+ FDE 00-FF
+ FDF 00-FF
+ FE0 00-FF
+ FE1 00-FF
+ FE2 00-FF
+ FE3 00-FF
+ FE4 00-FF
+ FE5 00-FF
+ FE6 00-FF
+ FE7 00-FF
+ FE8 00-FF
+ FE9 00-FF
+ FEA 00-FF
+ FEB 00-FF
+ FEC 00-FF
+ FED 00-FF
+ FEE 00-FF
+ FEF 00-FF
+ FF0 00-FF
+ FF1 00-FF
+ FF2 00-FF
+ FF3 00-FF
+ FF4 00-FF
+ FF5 00-FF
+ FF6 00-FF
+ FF7 00-FF
+ FF8 00-FF
+ FF9 00-FF
+ FFA 00-FF
+ FFB 00-FF
+ FFC 00-FF
+ FFD 00-FF
+ FFE 00-FF
+ FFF 00-FD
+ 1000 00-FF
+ 1001 00-FF
+ 1002 00-FF
+ 1003 00-FF
+ 1004 00-FF
+ 1005 00-FF
+ 1006 00-FF
+ 1007 00-FF
+ 1008 00-FF
+ 1009 00-FF
+ 100A 00-FF
+ 100B 00-FF
+ 100C 00-FF
+ 100D 00-FF
+ 100E 00-FF
+ 100F 00-FF
+ 1010 00-FF
+ 1011 00-FF
+ 1012 00-FF
+ 1013 00-FF
+ 1014 00-FF
+ 1015 00-FF
+ 1016 00-FF
+ 1017 00-FF
+ 1018 00-FF
+ 1019 00-FF
+ 101A 00-FF
+ 101B 00-FF
+ 101C 00-FF
+ 101D 00-FF
+ 101E 00-FF
+ 101F 00-FF
+ 1020 00-FF
+ 1021 00-FF
+ 1022 00-FF
+ 1023 00-FF
+ 1024 00-FF
+ 1025 00-FF
+ 1026 00-FF
+ 1027 00-FF
+ 1028 00-FF
+ 1029 00-FF
+ 102A 00-FF
+ 102B 00-FF
+ 102C 00-FF
+ 102D 00-FF
+ 102E 00-FF
+ 102F 00-FF
+ 1030 00-FF
+ 1031 00-FF
+ 1032 00-FF
+ 1033 00-FF
+ 1034 00-FF
+ 1035 00-FF
+ 1036 00-FF
+ 1037 00-FF
+ 1038 00-FF
+ 1039 00-FF
+ 103A 00-FF
+ 103B 00-FF
+ 103C 00-FF
+ 103D 00-FF
+ 103E 00-FF
+ 103F 00-FF
+ 1040 00-FF
+ 1041 00-FF
+ 1042 00-FF
+ 1043 00-FF
+ 1044 00-FF
+ 1045 00-FF
+ 1046 00-FF
+ 1047 00-FF
+ 1048 00-FF
+ 1049 00-FF
+ 104A 00-FF
+ 104B 00-FF
+ 104C 00-FF
+ 104D 00-FF
+ 104E 00-FF
+ 104F 00-FF
+ 1050 00-FF
+ 1051 00-FF
+ 1052 00-FF
+ 1053 00-FF
+ 1054 00-FF
+ 1055 00-FF
+ 1056 00-FF
+ 1057 00-FF
+ 1058 00-FF
+ 1059 00-FF
+ 105A 00-FF
+ 105B 00-FF
+ 105C 00-FF
+ 105D 00-FF
+ 105E 00-FF
+ 105F 00-FF
+ 1060 00-FF
+ 1061 00-FF
+ 1062 00-FF
+ 1063 00-FF
+ 1064 00-FF
+ 1065 00-FF
+ 1066 00-FF
+ 1067 00-FF
+ 1068 00-FF
+ 1069 00-FF
+ 106A 00-FF
+ 106B 00-FF
+ 106C 00-FF
+ 106D 00-FF
+ 106E 00-FF
+ 106F 00-FF
+ 1070 00-FF
+ 1071 00-FF
+ 1072 00-FF
+ 1073 00-FF
+ 1074 00-FF
+ 1075 00-FF
+ 1076 00-FF
+ 1077 00-FF
+ 1078 00-FF
+ 1079 00-FF
+ 107A 00-FF
+ 107B 00-FF
+ 107C 00-FF
+ 107D 00-FF
+ 107E 00-FF
+ 107F 00-FF
+ 1080 00-FF
+ 1081 00-FF
+ 1082 00-FF
+ 1083 00-FF
+ 1084 00-FF
+ 1085 00-FF
+ 1086 00-FF
+ 1087 00-FF
+ 1088 00-FF
+ 1089 00-FF
+ 108A 00-FF
+ 108B 00-FF
+ 108C 00-FF
+ 108D 00-FF
+ 108E 00-FF
+ 108F 00-FF
+ 1090 00-FF
+ 1091 00-FF
+ 1092 00-FF
+ 1093 00-FF
+ 1094 00-FF
+ 1095 00-FF
+ 1096 00-FF
+ 1097 00-FF
+ 1098 00-FF
+ 1099 00-FF
+ 109A 00-FF
+ 109B 00-FF
+ 109C 00-FF
+ 109D 00-FF
+ 109E 00-FF
+ 109F 00-FF
+ 10A0 00-FF
+ 10A1 00-FF
+ 10A2 00-FF
+ 10A3 00-FF
+ 10A4 00-FF
+ 10A5 00-FF
+ 10A6 00-FF
+ 10A7 00-FF
+ 10A8 00-FF
+ 10A9 00-FF
+ 10AA 00-FF
+ 10AB 00-FF
+ 10AC 00-FF
+ 10AD 00-FF
+ 10AE 00-FF
+ 10AF 00-FF
+ 10B0 00-FF
+ 10B1 00-FF
+ 10B2 00-FF
+ 10B3 00-FF
+ 10B4 00-FF
+ 10B5 00-FF
+ 10B6 00-FF
+ 10B7 00-FF
+ 10B8 00-FF
+ 10B9 00-FF
+ 10BA 00-FF
+ 10BB 00-FF
+ 10BC 00-FF
+ 10BD 00-FF
+ 10BE 00-FF
+ 10BF 00-FF
+ 10C0 00-FF
+ 10C1 00-FF
+ 10C2 00-FF
+ 10C3 00-FF
+ 10C4 00-FF
+ 10C5 00-FF
+ 10C6 00-FF
+ 10C7 00-FF
+ 10C8 00-FF
+ 10C9 00-FF
+ 10CA 00-FF
+ 10CB 00-FF
+ 10CC 00-FF
+ 10CD 00-FF
+ 10CE 00-FF
+ 10CF 00-FF
+ 10D0 00-FF
+ 10D1 00-FF
+ 10D2 00-FF
+ 10D3 00-FF
+ 10D4 00-FF
+ 10D5 00-FF
+ 10D6 00-FF
+ 10D7 00-FF
+ 10D8 00-FF
+ 10D9 00-FF
+ 10DA 00-FF
+ 10DB 00-FF
+ 10DC 00-FF
+ 10DD 00-FF
+ 10DE 00-FF
+ 10DF 00-FF
+ 10E0 00-FF
+ 10E1 00-FF
+ 10E2 00-FF
+ 10E3 00-FF
+ 10E4 00-FF
+ 10E5 00-FF
+ 10E6 00-FF
+ 10E7 00-FF
+ 10E8 00-FF
+ 10E9 00-FF
+ 10EA 00-FF
+ 10EB 00-FF
+ 10EC 00-FF
+ 10ED 00-FF
+ 10EE 00-FF
+ 10EF 00-FF
+ 10F0 00-FF
+ 10F1 00-FF
+ 10F2 00-FF
+ 10F3 00-FF
+ 10F4 00-FF
+ 10F5 00-FF
+ 10F6 00-FF
+ 10F7 00-FF
+ 10F8 00-FF
+ 10F9 00-FF
+ 10FA 00-FF
+ 10FB 00-FF
+ 10FC 00-FF
+ 10FD 00-FF
+ 10FE 00-FF
+ 10FF 00-FD
+
diff --git a/newlib/libc/string/ambiguous.t b/newlib/libc/string/ambiguous.t
new file mode 100644
index 0000000..f8b7842
--- /dev/null
+++ b/newlib/libc/string/ambiguous.t
@@ -0,0 +1,61 @@
+{
+ { 0x00A1, 0x00A1 }, { 0x00A4, 0x00A4 }, { 0x00A7, 0x00A8 },
+ { 0x00AA, 0x00AA }, { 0x00AE, 0x00AE }, { 0x00B0, 0x00B4 },
+ { 0x00B6, 0x00BA }, { 0x00BC, 0x00BF }, { 0x00C6, 0x00C6 },
+ { 0x00D0, 0x00D0 }, { 0x00D7, 0x00D8 }, { 0x00DE, 0x00E1 },
+ { 0x00E6, 0x00E6 }, { 0x00E8, 0x00EA }, { 0x00EC, 0x00ED },
+ { 0x00F0, 0x00F0 }, { 0x00F2, 0x00F3 }, { 0x00F7, 0x00FA },
+ { 0x00FC, 0x00FC }, { 0x00FE, 0x00FE }, { 0x0101, 0x0101 },
+ { 0x0111, 0x0111 }, { 0x0113, 0x0113 }, { 0x011B, 0x011B },
+ { 0x0126, 0x0127 }, { 0x012B, 0x012B }, { 0x0131, 0x0133 },
+ { 0x0138, 0x0138 }, { 0x013F, 0x0142 }, { 0x0144, 0x0144 },
+ { 0x0148, 0x014B }, { 0x014D, 0x014D }, { 0x0152, 0x0153 },
+ { 0x0166, 0x0167 }, { 0x016B, 0x016B }, { 0x01CE, 0x01CE },
+ { 0x01D0, 0x01D0 }, { 0x01D2, 0x01D2 }, { 0x01D4, 0x01D4 },
+ { 0x01D6, 0x01D6 }, { 0x01D8, 0x01D8 }, { 0x01DA, 0x01DA },
+ { 0x01DC, 0x01DC }, { 0x0251, 0x0251 }, { 0x0261, 0x0261 },
+ { 0x02C4, 0x02C4 }, { 0x02C7, 0x02C7 }, { 0x02C9, 0x02CB },
+ { 0x02CD, 0x02CD }, { 0x02D0, 0x02D0 }, { 0x02D8, 0x02DB },
+ { 0x02DD, 0x02DD }, { 0x02DF, 0x02DF }, { 0x0391, 0x03A1 },
+ { 0x03A3, 0x03A9 }, { 0x03B1, 0x03C1 }, { 0x03C3, 0x03C9 },
+ { 0x0401, 0x0401 }, { 0x0410, 0x044F }, { 0x0451, 0x0451 },
+ { 0x2010, 0x2010 }, { 0x2013, 0x2016 }, { 0x2018, 0x2019 },
+ { 0x201C, 0x201D }, { 0x2020, 0x2022 }, { 0x2024, 0x2027 },
+ { 0x2030, 0x2030 }, { 0x2032, 0x2033 }, { 0x2035, 0x2035 },
+ { 0x203B, 0x203B }, { 0x203E, 0x203E }, { 0x2074, 0x2074 },
+ { 0x207F, 0x207F }, { 0x2081, 0x2084 }, { 0x20AC, 0x20AC },
+ { 0x2103, 0x2103 }, { 0x2105, 0x2105 }, { 0x2109, 0x2109 },
+ { 0x2113, 0x2113 }, { 0x2116, 0x2116 }, { 0x2121, 0x2122 },
+ { 0x2126, 0x2126 }, { 0x212B, 0x212B }, { 0x2153, 0x2154 },
+ { 0x215B, 0x215E }, { 0x2160, 0x216B }, { 0x2170, 0x2179 },
+ { 0x2189, 0x2189 }, { 0x2190, 0x2199 }, { 0x21B8, 0x21B9 },
+ { 0x21D2, 0x21D2 }, { 0x21D4, 0x21D4 }, { 0x21E7, 0x21E7 },
+ { 0x2200, 0x2200 }, { 0x2202, 0x2203 }, { 0x2207, 0x2208 },
+ { 0x220B, 0x220B }, { 0x220F, 0x220F }, { 0x2211, 0x2211 },
+ { 0x2215, 0x2215 }, { 0x221A, 0x221A }, { 0x221D, 0x2220 },
+ { 0x2223, 0x2223 }, { 0x2225, 0x2225 }, { 0x2227, 0x222C },
+ { 0x222E, 0x222E }, { 0x2234, 0x2237 }, { 0x223C, 0x223D },
+ { 0x2248, 0x2248 }, { 0x224C, 0x224C }, { 0x2252, 0x2252 },
+ { 0x2260, 0x2261 }, { 0x2264, 0x2267 }, { 0x226A, 0x226B },
+ { 0x226E, 0x226F }, { 0x2282, 0x2283 }, { 0x2286, 0x2287 },
+ { 0x2295, 0x2295 }, { 0x2299, 0x2299 }, { 0x22A5, 0x22A5 },
+ { 0x22BF, 0x22BF }, { 0x2312, 0x2312 }, { 0x2460, 0x24E9 },
+ { 0x24EB, 0x254B }, { 0x2550, 0x2573 }, { 0x2580, 0x258F },
+ { 0x2592, 0x2595 }, { 0x25A0, 0x25A1 }, { 0x25A3, 0x25A9 },
+ { 0x25B2, 0x25B3 }, { 0x25B6, 0x25B7 }, { 0x25BC, 0x25BD },
+ { 0x25C0, 0x25C1 }, { 0x25C6, 0x25C8 }, { 0x25CB, 0x25CB },
+ { 0x25CE, 0x25D1 }, { 0x25E2, 0x25E5 }, { 0x25EF, 0x25EF },
+ { 0x2605, 0x2606 }, { 0x2609, 0x2609 }, { 0x260E, 0x260F },
+ { 0x261C, 0x261C }, { 0x261E, 0x261E }, { 0x2640, 0x2640 },
+ { 0x2642, 0x2642 }, { 0x2660, 0x2661 }, { 0x2663, 0x2665 },
+ { 0x2667, 0x266A }, { 0x266C, 0x266D }, { 0x266F, 0x266F },
+ { 0x269E, 0x269F }, { 0x26BF, 0x26BF }, { 0x26C6, 0x26CD },
+ { 0x26CF, 0x26D3 }, { 0x26D5, 0x26E1 }, { 0x26E3, 0x26E3 },
+ { 0x26E8, 0x26E9 }, { 0x26EB, 0x26F1 }, { 0x26F4, 0x26F4 },
+ { 0x26F6, 0x26F9 }, { 0x26FB, 0x26FC }, { 0x26FE, 0x26FF },
+ { 0x273D, 0x273D }, { 0x2776, 0x277F }, { 0x2B56, 0x2B59 },
+ { 0x3248, 0x324F }, { 0xE000, 0xF8FF }, { 0xFFFD, 0xFFFD },
+ { 0x1F100, 0x1F10A }, { 0x1F110, 0x1F12D }, { 0x1F130, 0x1F169 },
+ { 0x1F170, 0x1F18D }, { 0x1F18F, 0x1F190 }, { 0x1F19B, 0x1F1AC },
+ { 0xF0000, 0xFFFFD }, { 0x100000, 0x10FFFD }
+};
diff --git a/newlib/libc/string/combining.t b/newlib/libc/string/combining.t
new file mode 100644
index 0000000..629d8f8
--- /dev/null
+++ b/newlib/libc/string/combining.t
@@ -0,0 +1,107 @@
+{
+ { 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+ { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+ { 0x05C7, 0x05C7 }, { 0x0600, 0x0605 }, { 0x0610, 0x061A },
+ { 0x061C, 0x061C }, { 0x064B, 0x065F }, { 0x0670, 0x0670 },
+ { 0x06D6, 0x06DD }, { 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 },
+ { 0x06EA, 0x06ED }, { 0x070F, 0x070F }, { 0x0711, 0x0711 },
+ { 0x0730, 0x074A }, { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 },
+ { 0x0816, 0x0819 }, { 0x081B, 0x0823 }, { 0x0825, 0x0827 },
+ { 0x0829, 0x082D }, { 0x0859, 0x085B }, { 0x08D4, 0x0902 },
+ { 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
+ { 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 },
+ { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
+ { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
+ { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
+ { 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 },
+ { 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
+ { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
+ { 0x0AE2, 0x0AE3 }, { 0x0AFA, 0x0AFF }, { 0x0B01, 0x0B01 },
+ { 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 },
+ { 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 },
+ { 0x0B82, 0x0B82 }, { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD },
+ { 0x0C00, 0x0C00 }, { 0x0C3E, 0x0C40 }, { 0x0C46, 0x0C48 },
+ { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, { 0x0C62, 0x0C63 },
+ { 0x0C81, 0x0C81 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF },
+ { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 },
+ { 0x0D00, 0x0D01 }, { 0x0D3B, 0x0D3C }, { 0x0D41, 0x0D44 },
+ { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 }, { 0x0DCA, 0x0DCA },
+ { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, { 0x0E31, 0x0E31 },
+ { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, { 0x0EB1, 0x0EB1 },
+ { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, { 0x0EC8, 0x0ECD },
+ { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, { 0x0F37, 0x0F37 },
+ { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, { 0x0F80, 0x0F84 },
+ { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 }, { 0x0F99, 0x0FBC },
+ { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 }, { 0x1032, 0x1037 },
+ { 0x1039, 0x103A }, { 0x103D, 0x103E }, { 0x1058, 0x1059 },
+ { 0x105E, 0x1060 }, { 0x1071, 0x1074 }, { 0x1082, 0x1082 },
+ { 0x1085, 0x1086 }, { 0x108D, 0x108D }, { 0x109D, 0x109D },
+ { 0x1160, 0x11FF }, { 0x135D, 0x135F }, { 0x1712, 0x1714 },
+ { 0x1732, 0x1734 }, { 0x1752, 0x1753 }, { 0x1772, 0x1773 },
+ { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 },
+ { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD }, { 0x180B, 0x180E },
+ { 0x1885, 0x1886 }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
+ { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
+ { 0x1A17, 0x1A18 }, { 0x1A1B, 0x1A1B }, { 0x1A56, 0x1A56 },
+ { 0x1A58, 0x1A5E }, { 0x1A60, 0x1A60 }, { 0x1A62, 0x1A62 },
+ { 0x1A65, 0x1A6C }, { 0x1A73, 0x1A7C }, { 0x1A7F, 0x1A7F },
+ { 0x1AB0, 0x1ABE }, { 0x1B00, 0x1B03 }, { 0x1B34, 0x1B34 },
+ { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C }, { 0x1B42, 0x1B42 },
+ { 0x1B6B, 0x1B73 }, { 0x1B80, 0x1B81 }, { 0x1BA2, 0x1BA5 },
+ { 0x1BA8, 0x1BA9 }, { 0x1BAB, 0x1BAD }, { 0x1BE6, 0x1BE6 },
+ { 0x1BE8, 0x1BE9 }, { 0x1BED, 0x1BED }, { 0x1BEF, 0x1BF1 },
+ { 0x1C2C, 0x1C33 }, { 0x1C36, 0x1C37 }, { 0x1CD0, 0x1CD2 },
+ { 0x1CD4, 0x1CE0 }, { 0x1CE2, 0x1CE8 }, { 0x1CED, 0x1CED },
+ { 0x1CF4, 0x1CF4 }, { 0x1CF8, 0x1CF9 }, { 0x1DC0, 0x1DF9 },
+ { 0x1DFB, 0x1DFF }, { 0x200B, 0x200F }, { 0x202A, 0x202E },
+ { 0x2060, 0x2064 }, { 0x2066, 0x206F }, { 0x20D0, 0x20F0 },
+ { 0x2CEF, 0x2CF1 }, { 0x2D7F, 0x2D7F }, { 0x2DE0, 0x2DFF },
+ { 0x302A, 0x302D }, { 0x3099, 0x309A }, { 0xA66F, 0xA672 },
+ { 0xA674, 0xA67D }, { 0xA69E, 0xA69F }, { 0xA6F0, 0xA6F1 },
+ { 0xA802, 0xA802 }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
+ { 0xA825, 0xA826 }, { 0xA8C4, 0xA8C5 }, { 0xA8E0, 0xA8F1 },
+ { 0xA926, 0xA92D }, { 0xA947, 0xA951 }, { 0xA980, 0xA982 },
+ { 0xA9B3, 0xA9B3 }, { 0xA9B6, 0xA9B9 }, { 0xA9BC, 0xA9BC },
+ { 0xA9E5, 0xA9E5 }, { 0xAA29, 0xAA2E }, { 0xAA31, 0xAA32 },
+ { 0xAA35, 0xAA36 }, { 0xAA43, 0xAA43 }, { 0xAA4C, 0xAA4C },
+ { 0xAA7C, 0xAA7C }, { 0xAAB0, 0xAAB0 }, { 0xAAB2, 0xAAB4 },
+ { 0xAAB7, 0xAAB8 }, { 0xAABE, 0xAABF }, { 0xAAC1, 0xAAC1 },
+ { 0xAAEC, 0xAAED }, { 0xAAF6, 0xAAF6 }, { 0xABE5, 0xABE5 },
+ { 0xABE8, 0xABE8 }, { 0xABED, 0xABED }, { 0xD7B0, 0xD7C6 },
+ { 0xD7CB, 0xD7FB }, { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F },
+ { 0xFE20, 0xFE2F }, { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB },
+ { 0x101FD, 0x101FD }, { 0x102E0, 0x102E0 }, { 0x10376, 0x1037A },
+ { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
+ { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x10AE5, 0x10AE6 },
+ { 0x11001, 0x11001 }, { 0x11038, 0x11046 }, { 0x1107F, 0x11081 },
+ { 0x110B3, 0x110B6 }, { 0x110B9, 0x110BA }, { 0x110BD, 0x110BD },
+ { 0x11100, 0x11102 }, { 0x11127, 0x1112B }, { 0x1112D, 0x11134 },
+ { 0x11173, 0x11173 }, { 0x11180, 0x11181 }, { 0x111B6, 0x111BE },
+ { 0x111CA, 0x111CC }, { 0x1122F, 0x11231 }, { 0x11234, 0x11234 },
+ { 0x11236, 0x11237 }, { 0x1123E, 0x1123E }, { 0x112DF, 0x112DF },
+ { 0x112E3, 0x112EA }, { 0x11300, 0x11301 }, { 0x1133C, 0x1133C },
+ { 0x11340, 0x11340 }, { 0x11366, 0x1136C }, { 0x11370, 0x11374 },
+ { 0x11438, 0x1143F }, { 0x11442, 0x11444 }, { 0x11446, 0x11446 },
+ { 0x114B3, 0x114B8 }, { 0x114BA, 0x114BA }, { 0x114BF, 0x114C0 },
+ { 0x114C2, 0x114C3 }, { 0x115B2, 0x115B5 }, { 0x115BC, 0x115BD },
+ { 0x115BF, 0x115C0 }, { 0x115DC, 0x115DD }, { 0x11633, 0x1163A },
+ { 0x1163D, 0x1163D }, { 0x1163F, 0x11640 }, { 0x116AB, 0x116AB },
+ { 0x116AD, 0x116AD }, { 0x116B0, 0x116B5 }, { 0x116B7, 0x116B7 },
+ { 0x1171D, 0x1171F }, { 0x11722, 0x11725 }, { 0x11727, 0x1172B },
+ { 0x11A01, 0x11A06 }, { 0x11A09, 0x11A0A }, { 0x11A33, 0x11A38 },
+ { 0x11A3B, 0x11A3E }, { 0x11A47, 0x11A47 }, { 0x11A51, 0x11A56 },
+ { 0x11A59, 0x11A5B }, { 0x11A8A, 0x11A96 }, { 0x11A98, 0x11A99 },
+ { 0x11C30, 0x11C36 }, { 0x11C38, 0x11C3D }, { 0x11C3F, 0x11C3F },
+ { 0x11C92, 0x11CA7 }, { 0x11CAA, 0x11CB0 }, { 0x11CB2, 0x11CB3 },
+ { 0x11CB5, 0x11CB6 }, { 0x11D31, 0x11D36 }, { 0x11D3A, 0x11D3A },
+ { 0x11D3C, 0x11D3D }, { 0x11D3F, 0x11D45 }, { 0x11D47, 0x11D47 },
+ { 0x16AF0, 0x16AF4 }, { 0x16B30, 0x16B36 }, { 0x16F8F, 0x16F92 },
+ { 0x1BC9D, 0x1BC9E }, { 0x1BCA0, 0x1BCA3 }, { 0x1D167, 0x1D169 },
+ { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
+ { 0x1D242, 0x1D244 }, { 0x1DA00, 0x1DA36 }, { 0x1DA3B, 0x1DA6C },
+ { 0x1DA75, 0x1DA75 }, { 0x1DA84, 0x1DA84 }, { 0x1DA9B, 0x1DA9F },
+ { 0x1DAA1, 0x1DAAF }, { 0x1E000, 0x1E006 }, { 0x1E008, 0x1E018 },
+ { 0x1E01B, 0x1E021 }, { 0x1E023, 0x1E024 }, { 0x1E026, 0x1E02A },
+ { 0x1E8D0, 0x1E8D6 }, { 0x1E944, 0x1E94A }, { 0xE0001, 0xE0001 },
+ { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
+};
diff --git a/newlib/libc/string/wide.t b/newlib/libc/string/wide.t
new file mode 100644
index 0000000..8d0e243
--- /dev/null
+++ b/newlib/libc/string/wide.t
@@ -0,0 +1,33 @@
+//# EastAsianWidth-10.0.0.txt
+//# Blocks-10.0.0.txt
+{
+ { 0x1100, 0x115F }, { 0x231A, 0x231B }, { 0x2329, 0x232A },
+ { 0x23E9, 0x23EC }, { 0x23F0, 0x23F0 }, { 0x23F3, 0x23F3 },
+ { 0x25FD, 0x25FE }, { 0x2614, 0x2615 }, { 0x2648, 0x2653 },
+ { 0x267F, 0x267F }, { 0x2693, 0x2693 }, { 0x26A1, 0x26A1 },
+ { 0x26AA, 0x26AB }, { 0x26BD, 0x26BE }, { 0x26C4, 0x26C5 },
+ { 0x26CE, 0x26CE }, { 0x26D4, 0x26D4 }, { 0x26EA, 0x26EA },
+ { 0x26F2, 0x26F3 }, { 0x26F5, 0x26F5 }, { 0x26FA, 0x26FA },
+ { 0x26FD, 0x26FD }, { 0x2705, 0x2705 }, { 0x270A, 0x270B },
+ { 0x2728, 0x2728 }, { 0x274C, 0x274C }, { 0x274E, 0x274E },
+ { 0x2753, 0x2755 }, { 0x2757, 0x2757 }, { 0x2795, 0x2797 },
+ { 0x27B0, 0x27B0 }, { 0x27BF, 0x27BF }, { 0x2B1B, 0x2B1C },
+ { 0x2B50, 0x2B50 }, { 0x2B55, 0x2B55 }, { 0x2E80, 0x303E },
+ { 0x3040, 0x321E }, { 0x3220, 0x3247 }, { 0x3250, 0x32FE },
+ { 0x3300, 0x4DBF }, { 0x4E00, 0xA4CF }, { 0xA960, 0xA97F },
+ { 0xAC00, 0xD7AF }, { 0xF900, 0xFAFF }, { 0xFE10, 0xFE1F },
+ { 0xFE30, 0xFE6F }, { 0xFF01, 0xFF60 }, { 0xFFE0, 0xFFE6 },
+ { 0x16FE0, 0x18AFF }, { 0x1B000, 0x1B12F }, { 0x1B170, 0x1B2FF },
+ { 0x1F004, 0x1F004 }, { 0x1F0CF, 0x1F0CF }, { 0x1F18E, 0x1F18E },
+ { 0x1F191, 0x1F19A }, { 0x1F200, 0x1F320 }, { 0x1F32D, 0x1F335 },
+ { 0x1F337, 0x1F37C }, { 0x1F37E, 0x1F393 }, { 0x1F3A0, 0x1F3CA },
+ { 0x1F3CF, 0x1F3D3 }, { 0x1F3E0, 0x1F3F0 }, { 0x1F3F4, 0x1F3F4 },
+ { 0x1F3F8, 0x1F43E }, { 0x1F440, 0x1F440 }, { 0x1F442, 0x1F4FC },
+ { 0x1F4FF, 0x1F53D }, { 0x1F54B, 0x1F54E }, { 0x1F550, 0x1F567 },
+ { 0x1F57A, 0x1F57A }, { 0x1F595, 0x1F596 }, { 0x1F5A4, 0x1F5A4 },
+ { 0x1F5FB, 0x1F64F }, { 0x1F680, 0x1F6C5 }, { 0x1F6CC, 0x1F6CC },
+ { 0x1F6D0, 0x1F6D2 }, { 0x1F6EB, 0x1F6EC }, { 0x1F6F4, 0x1F6F8 },
+ { 0x1F910, 0x1F93E }, { 0x1F940, 0x1F94C }, { 0x1F950, 0x1F96B },
+ { 0x1F980, 0x1F997 }, { 0x1F9C0, 0x1F9C0 }, { 0x1F9D0, 0x1F9E6 },
+ { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
+};
--
2.13.2
[-- Attachment #4: 0003-use-generated-width-data.patch --]
[-- Type: text/plain, Size: 9789 bytes --]
From 5d73691295b0013d78c1ce7c7ab0b0be0549d754 Mon Sep 17 00:00:00 2001
From: mintty <mintty@users.noreply.github.com>
Date: Mon, 14 Aug 2017 22:01:01 +0200
Subject: [PATCH 3/4] use generated width data
---
newlib/libc/string/wcwidth.c | 146 +++++++------------------------------------
1 file changed, 22 insertions(+), 124 deletions(-)
diff --git a/newlib/libc/string/wcwidth.c b/newlib/libc/string/wcwidth.c
index ac5c47f..73c036a 100644
--- a/newlib/libc/string/wcwidth.c
+++ b/newlib/libc/string/wcwidth.c
@@ -7,18 +7,18 @@ INDEX
ANSI_SYNOPSIS
#include <wchar.h>
- int wcwidth(const wchar_t <[wc]>);
+ int wcwidth(const wint_t <[wc]>);
TRAD_SYNOPSIS
#include <wchar.h>
int wcwidth(<[wc]>)
- wchar_t *<[wc]>;
+ wint_t *<[wc]>;
DESCRIPTION
The <<wcwidth>> function shall determine the number of column
positions required for the wide character <[wc]>. The application
shall ensure that the value of <[wc]> is a character representable
- as a wchar_t, and is a wide-character code corresponding to a
+ as a wint_t, and is a wide-character code corresponding to a
valid character in the current locale.
RETURNS
@@ -174,112 +174,18 @@ _DEFUN (__wcwidth, (ucs),
#ifdef _MB_CAPABLE
/* sorted list of non-overlapping intervals of East Asian Ambiguous
* characters, generated by "uniset +WIDTH-A -cat=Me -cat=Mn -cat=Cf c" */
- static const struct interval ambiguous[] = {
- { 0x00A1, 0x00A1 }, { 0x00A4, 0x00A4 }, { 0x00A7, 0x00A8 },
- { 0x00AA, 0x00AA }, { 0x00AE, 0x00AE }, { 0x00B0, 0x00B4 },
- { 0x00B6, 0x00BA }, { 0x00BC, 0x00BF }, { 0x00C6, 0x00C6 },
- { 0x00D0, 0x00D0 }, { 0x00D7, 0x00D8 }, { 0x00DE, 0x00E1 },
- { 0x00E6, 0x00E6 }, { 0x00E8, 0x00EA }, { 0x00EC, 0x00ED },
- { 0x00F0, 0x00F0 }, { 0x00F2, 0x00F3 }, { 0x00F7, 0x00FA },
- { 0x00FC, 0x00FC }, { 0x00FE, 0x00FE }, { 0x0101, 0x0101 },
- { 0x0111, 0x0111 }, { 0x0113, 0x0113 }, { 0x011B, 0x011B },
- { 0x0126, 0x0127 }, { 0x012B, 0x012B }, { 0x0131, 0x0133 },
- { 0x0138, 0x0138 }, { 0x013F, 0x0142 }, { 0x0144, 0x0144 },
- { 0x0148, 0x014B }, { 0x014D, 0x014D }, { 0x0152, 0x0153 },
- { 0x0166, 0x0167 }, { 0x016B, 0x016B }, { 0x01CE, 0x01CE },
- { 0x01D0, 0x01D0 }, { 0x01D2, 0x01D2 }, { 0x01D4, 0x01D4 },
- { 0x01D6, 0x01D6 }, { 0x01D8, 0x01D8 }, { 0x01DA, 0x01DA },
- { 0x01DC, 0x01DC }, { 0x0251, 0x0251 }, { 0x0261, 0x0261 },
- { 0x02C4, 0x02C4 }, { 0x02C7, 0x02C7 }, { 0x02C9, 0x02CB },
- { 0x02CD, 0x02CD }, { 0x02D0, 0x02D0 }, { 0x02D8, 0x02DB },
- { 0x02DD, 0x02DD }, { 0x02DF, 0x02DF }, { 0x0391, 0x03A1 },
- { 0x03A3, 0x03A9 }, { 0x03B1, 0x03C1 }, { 0x03C3, 0x03C9 },
- { 0x0401, 0x0401 }, { 0x0410, 0x044F }, { 0x0451, 0x0451 },
- { 0x2010, 0x2010 }, { 0x2013, 0x2016 }, { 0x2018, 0x2019 },
- { 0x201C, 0x201D }, { 0x2020, 0x2022 }, { 0x2024, 0x2027 },
- { 0x2030, 0x2030 }, { 0x2032, 0x2033 }, { 0x2035, 0x2035 },
- { 0x203B, 0x203B }, { 0x203E, 0x203E }, { 0x2074, 0x2074 },
- { 0x207F, 0x207F }, { 0x2081, 0x2084 }, { 0x20AC, 0x20AC },
- { 0x2103, 0x2103 }, { 0x2105, 0x2105 }, { 0x2109, 0x2109 },
- { 0x2113, 0x2113 }, { 0x2116, 0x2116 }, { 0x2121, 0x2122 },
- { 0x2126, 0x2126 }, { 0x212B, 0x212B }, { 0x2153, 0x2154 },
- { 0x215B, 0x215E }, { 0x2160, 0x216B }, { 0x2170, 0x2179 },
- { 0x2190, 0x2199 }, { 0x21B8, 0x21B9 }, { 0x21D2, 0x21D2 },
- { 0x21D4, 0x21D4 }, { 0x21E7, 0x21E7 }, { 0x2200, 0x2200 },
- { 0x2202, 0x2203 }, { 0x2207, 0x2208 }, { 0x220B, 0x220B },
- { 0x220F, 0x220F }, { 0x2211, 0x2211 }, { 0x2215, 0x2215 },
- { 0x221A, 0x221A }, { 0x221D, 0x2220 }, { 0x2223, 0x2223 },
- { 0x2225, 0x2225 }, { 0x2227, 0x222C }, { 0x222E, 0x222E },
- { 0x2234, 0x2237 }, { 0x223C, 0x223D }, { 0x2248, 0x2248 },
- { 0x224C, 0x224C }, { 0x2252, 0x2252 }, { 0x2260, 0x2261 },
- { 0x2264, 0x2267 }, { 0x226A, 0x226B }, { 0x226E, 0x226F },
- { 0x2282, 0x2283 }, { 0x2286, 0x2287 }, { 0x2295, 0x2295 },
- { 0x2299, 0x2299 }, { 0x22A5, 0x22A5 }, { 0x22BF, 0x22BF },
- { 0x2312, 0x2312 }, { 0x2460, 0x24E9 }, { 0x24EB, 0x254B },
- { 0x2550, 0x2573 }, { 0x2580, 0x258F }, { 0x2592, 0x2595 },
- { 0x25A0, 0x25A1 }, { 0x25A3, 0x25A9 }, { 0x25B2, 0x25B3 },
- { 0x25B6, 0x25B7 }, { 0x25BC, 0x25BD }, { 0x25C0, 0x25C1 },
- { 0x25C6, 0x25C8 }, { 0x25CB, 0x25CB }, { 0x25CE, 0x25D1 },
- { 0x25E2, 0x25E5 }, { 0x25EF, 0x25EF }, { 0x2605, 0x2606 },
- { 0x2609, 0x2609 }, { 0x260E, 0x260F }, { 0x2614, 0x2615 },
- { 0x261C, 0x261C }, { 0x261E, 0x261E }, { 0x2640, 0x2640 },
- { 0x2642, 0x2642 }, { 0x2660, 0x2661 }, { 0x2663, 0x2665 },
- { 0x2667, 0x266A }, { 0x266C, 0x266D }, { 0x266F, 0x266F },
- { 0x273D, 0x273D }, { 0x2776, 0x277F }, { 0xE000, 0xF8FF },
- { 0xFFFD, 0xFFFD }, { 0xF0000, 0xFFFFD }, { 0x100000, 0x10FFFD }
- };
+ static const struct interval ambiguous[] =
+#include "ambiguous.t"
+
/* sorted list of non-overlapping intervals of non-spacing characters */
- /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
- static const struct interval combining[] = {
- { 0x0300, 0x036F }, { 0x0483, 0x0486 }, { 0x0488, 0x0489 },
- { 0x0591, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
- { 0x05C4, 0x05C5 }, { 0x05C7, 0x05C7 }, { 0x0600, 0x0603 },
- { 0x0610, 0x0615 }, { 0x064B, 0x065E }, { 0x0670, 0x0670 },
- { 0x06D6, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
- { 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
- { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0901, 0x0902 },
- { 0x093C, 0x093C }, { 0x0941, 0x0948 }, { 0x094D, 0x094D },
- { 0x0951, 0x0954 }, { 0x0962, 0x0963 }, { 0x0981, 0x0981 },
- { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 }, { 0x09CD, 0x09CD },
- { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 }, { 0x0A3C, 0x0A3C },
- { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 }, { 0x0A4B, 0x0A4D },
- { 0x0A70, 0x0A71 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
- { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
- { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C },
- { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 }, { 0x0B4D, 0x0B4D },
- { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 }, { 0x0BC0, 0x0BC0 },
- { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 }, { 0x0C46, 0x0C48 },
- { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, { 0x0CBC, 0x0CBC },
- { 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD },
- { 0x0CE2, 0x0CE3 }, { 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D },
- { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
- { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
- { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
- { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
- { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
- { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 },
- { 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
- { 0x1032, 0x1032 }, { 0x1036, 0x1037 }, { 0x1039, 0x1039 },
- { 0x1058, 0x1059 }, { 0x1160, 0x11FF }, { 0x135F, 0x135F },
- { 0x1712, 0x1714 }, { 0x1732, 0x1734 }, { 0x1752, 0x1753 },
- { 0x1772, 0x1773 }, { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD },
- { 0x17C6, 0x17C6 }, { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD },
- { 0x180B, 0x180D }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
- { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
- { 0x1A17, 0x1A18 }, { 0x1B00, 0x1B03 }, { 0x1B34, 0x1B34 },
- { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C }, { 0x1B42, 0x1B42 },
- { 0x1B6B, 0x1B73 }, { 0x1DC0, 0x1DCA }, { 0x1DFE, 0x1DFF },
- { 0x200B, 0x200F }, { 0x202A, 0x202E }, { 0x2060, 0x2063 },
- { 0x206A, 0x206F }, { 0x20D0, 0x20EF }, { 0x302A, 0x302F },
- { 0x3099, 0x309A }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
- { 0xA825, 0xA826 }, { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F },
- { 0xFE20, 0xFE23 }, { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB },
- { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
- { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x1D167, 0x1D169 },
- { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
- { 0x1D242, 0x1D244 }, { 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F },
- { 0xE0100, 0xE01EF }
- };
+ static const struct interval combining[] =
+#include "combining.t"
+
+ /* sorted list of non-overlapping intervals of wide characters,
+ ranges extended to Blocks where possible
+ */
+ static const struct interval wide[] =
+#include "wide.t"
/* Test for NUL character */
if (ucs == 0)
@@ -310,20 +216,12 @@ _DEFUN (__wcwidth, (ucs),
/* if we arrive here, ucs is not a combining or C0/C1 control character */
- return 1 +
- (ucs >= 0x1100 &&
- (ucs <= 0x115f || /* Hangul Jamo init. consonants */
- ucs == 0x2329 || ucs == 0x232a ||
- (ucs >= 0x2e80 && ucs <= 0xa4cf &&
- ucs != 0x303f) || /* CJK ... Yi */
- (ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
- (ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
- (ucs >= 0xfe10 && ucs <= 0xfe19) || /* Vertical forms */
- (ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
- (ucs >= 0xff00 && ucs <= 0xff60) || /* Fullwidth Forms */
- (ucs >= 0xffe0 && ucs <= 0xffe6) ||
- (ucs >= 0x20000 && ucs <= 0x2fffd) ||
- (ucs >= 0x30000 && ucs <= 0x3fffd)));
+ /* binary search in table of wide character codes */
+ if (bisearch(ucs, wide,
+ sizeof(wide) / sizeof(struct interval) - 1))
+ return 2;
+ else
+ return 1;
#else /* !_MB_CAPABLE */
if (iswprint (ucs))
return 1;
@@ -333,9 +231,9 @@ _DEFUN (__wcwidth, (ucs),
#endif /* _MB_CAPABLE */
}
-int
+int
_DEFUN (wcwidth, (wc),
- _CONST wchar_t wc)
+ _CONST wint_t wc)
{
wint_t wi = wc;
--
2.13.2
^ permalink raw reply [flat|nested] 15+ messages in thread
* Ping: Unicode update of width and other character properties
2017-08-06 5:36 Unicode update of width and other character properties Thomas Wolff
2017-08-07 10:31 ` Corinna Vinschen
@ 2017-12-02 11:25 ` Thomas Wolff
1 sibling, 0 replies; 15+ messages in thread
From: Thomas Wolff @ 2017-12-02 11:25 UTC (permalink / raw)
To: newlib
Hi,
this is to remind of may patch for wcwidth Unicode consistence, as
requested.
Thomas
-------- Weitergeleitete Nachricht --------
Betreff: Unicode update of width and other character properties
Datum: Sun, 6 Aug 2017 07:36:10 +0200
Von: Thomas Wolff <towo@towo.net>
An: newlib@sourceware.org
Hi,
this is a proposal to update wcwidth and the character properties
functions isw*/towupper/towlower to Unicode 10.0, as discussed in the
mail thread https://cygwin.com/ml/cygwin/2017-07/msg00366.html,
as well as to simplify automatic generation of respective tables for an
easier update step.
Table size is moderate (using ranges for character properties) but there
is still an option to reduce the two big tables in size.
The patch can be retrieved from http://towo.net/cygwin/charprops10.zip .
The Makefile.widthdata does not yet distinguish the two subdirectories
(libc/string, libc/ctypw) as it comes from a common development directory.
There is a test program in which comparison for isw*/tow* functions
between current and patched implementation can be compared.
I also provide a log of deviations of the new approach to the current
implementation, based on Unicode 5.2 data, to compare and check.
If there are any disputable cases, I would consider that of course.
My main aim was actually to get the wcwidth data updated, for which the
change is more obviously clear.
Thanks
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-08-17 11:03 ` Thomas Wolff
@ 2017-12-03 14:07 ` Corinna Vinschen
2017-12-03 17:31 ` Thomas Wolff
0 siblings, 1 reply; 15+ messages in thread
From: Corinna Vinschen @ 2017-12-03 14:07 UTC (permalink / raw)
To: newlib
[-- Attachment #1: Type: text/plain, Size: 1874 bytes --]
Sorry for the late reply, I forgot this patch.
On Aug 17 07:53, Thomas Wolff wrote:
> [...]
> I'm attaching my patches here for assessment.
> I have revised table handling further, using gcc bit struct packing. The two
> big tables have a total size of 14340 bytes now, for Unicode 10.0.
> I have fixed locale handling in the isw* and tow* functions, but I've not
> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> are not as straight-forward to be employed as I thought, because the work on
> multi-byte representations.
> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> Likewise obscure is the relation between newlib/iconvdata and
> newlib/libc/iconv.
This is really old stuff. I wonder if anybody is still using it with
Unicode around for a long time...
> To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> If the #warning is ignored or removed, the non-Cygwin build should work as
> before, fixing just locale handling.
>
> I'm attaching the wcwidth part here, all patches are available at
> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> mailbox size limit).
So why don't you use git send-email (ideally with a cover letter, see
`git format-patch --cover-letter') instead of attaching the patches to a
single email? This is the correct way of sending patch series and it
gets you around the size limit.
The below patches are missing a patch, last one is patch 3/4.
Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
an extra Makefile? This should be merged into string/Makefile.am.
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-12-03 14:07 ` Corinna Vinschen
@ 2017-12-03 17:31 ` Thomas Wolff
2017-12-03 17:33 ` Jon Turney
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Thomas Wolff @ 2017-12-03 17:31 UTC (permalink / raw)
To: newlib
> Sorry for the late reply, I forgot this patch.
>
> On Aug 17 07:53, Thomas Wolff wrote:
>> [...]
>> I'm attaching my patches here for assessment.
>> I have revised table handling further, using gcc bit struct packing. The two
>> big tables have a total size of 14340 bytes now, for Unicode 10.0.
>> I have fixed locale handling in the isw* and tow* functions, but I've not
>> yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
>> are not as straight-forward to be employed as I thought, because the work on
>> multi-byte representations.
>> Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
>> handled in iconvdata (JIS-201/208/212) is a little bit obscure.
>> Likewise obscure is the relation between newlib/iconvdata and
>> newlib/libc/iconv.
> This is really old stuff. I wonder if anybody is still using it with
> Unicode around for a long time...
>
>> To be on the safe side, Iâm leaving the actual jp2uc conversion untouched
>> for now, and Iâve just added a dummy back-conversion uc2jp with a #warning.
>> If the #warning is ignored or removed, the non-Cygwin build should work as
>> before, fixing just locale handling.
>>
>> I'm attaching the wcwidth part here, all patches are available at
>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>> mailbox size limit).
> So why don't you use git send-email (ideally with a cover letter, see
> `git format-patch --cover-letter') instead of attaching the patches to a
> single email? This is the correct way of sending patch series and it
> gets you around the size limit.
Because of:
LC_ALL=C git send-email
git: 'send-email' is not a git command. See 'git --help'.
Are there any working instructions for newlib contributions to be found
anywhere?
> The below patches are missing a patch, last one is patch 3/4.
>
> Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> an extra Makefile? This should be merged into string/Makefile.am.
I was awaiting feedback before doing further integration work. The
Makefile includes test stuff and also the table generation. Would the
generation be invoked every time or rather called manually?
Thomas
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-12-03 17:31 ` Thomas Wolff
@ 2017-12-03 17:33 ` Jon Turney
2017-12-04 7:32 ` Brian Inglis
2017-12-04 9:05 ` Corinna Vinschen
2 siblings, 0 replies; 15+ messages in thread
From: Jon Turney @ 2017-12-03 17:33 UTC (permalink / raw)
To: Thomas Wolff, newlib
On 03/12/2017 14:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:
>>> I'm attaching the wcwidth part here, all patches are available at
>>> http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
>>> mailbox size limit).
>> So why don't you use git send-email (ideally with a cover letter, see
>> `git format-patch --cover-letter') instead of attaching the patches to a
>> single email? This is the correct way of sending patch series and it
>> gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?
Due to extra deps, git-send-email is usually packaged separately.
On Cygwin, the package name is 'git-email' ('Email tools for Git')
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-12-03 17:31 ` Thomas Wolff
2017-12-03 17:33 ` Jon Turney
@ 2017-12-04 7:32 ` Brian Inglis
2017-12-04 9:05 ` Corinna Vinschen
2 siblings, 0 replies; 15+ messages in thread
From: Brian Inglis @ 2017-12-04 7:32 UTC (permalink / raw)
To: newlib
On 2017-12-03 07:07, Thomas Wolff wrote:
>> On Aug 17 07:53, Thomas Wolff wrote:
>> So why don't you use git send-email (ideally with a cover letter, see `git
>> format-patch --cover-letter') instead of attaching the patches to a single
>> email? This is the correct way of sending patch series and it gets you
>> around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
You need to install git-email, and if you run X you might also want git-gui.
Ensure $GIT_EDITOR|$VISUAL|$EDITOR stays in foreground so that you can edit
commit messages, emails, interactive rebases, merges, etc. e.g.:
git config --global core.editor 'gvim -f'
You should not need LC_ALL=C most places these days, except to get sort, join,
uniq to play well together.
> Are there any working instructions for newlib contributions to be found
> anywhere?
Everyone assumes you are comfortable with git and understand its model.
Advice I git:
cd .../repo
git checkout master
git pull
git checkout -b BRANCH
$VISUAL FILE
git add FILE
git commit FILE
...
git format-patch -o PATH/ --stat --cover-letter -#commits
$VISUAL PATH/0000-*cover-letter.patch
git send-email --compose PATH/000?-*.patch
Update branch to the latest upstream master:
git checkout master
git pull
git rebase [-i] master BRANCH
...
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-12-03 17:31 ` Thomas Wolff
2017-12-03 17:33 ` Jon Turney
2017-12-04 7:32 ` Brian Inglis
@ 2017-12-04 9:05 ` Corinna Vinschen
2018-02-25 17:14 ` Thomas Wolff
2 siblings, 1 reply; 15+ messages in thread
From: Corinna Vinschen @ 2017-12-04 9:05 UTC (permalink / raw)
To: newlib
[-- Attachment #1: Type: text/plain, Size: 2717 bytes --]
On Dec 3 15:07, Thomas Wolff wrote:
> > On Aug 17 07:53, Thomas Wolff wrote:
> > > [...]
> > > I have fixed locale handling in the isw* and tow* functions, but I've not
> > > yet changed JP conversion. Unfortunately, the routines from newlib/iconvdata
> > > are not as straight-forward to be employed as I thought, because the work on
> > > multi-byte representations.
> > > Also the mapping of ctype charsets (JIS, SJIS, EUC-JP) to the subsets
> > > handled in iconvdata (JIS-201/208/212) is a little bit obscure.
> > > Likewise obscure is the relation between newlib/iconvdata and
> > > newlib/libc/iconv.
> > This is really old stuff. I wonder if anybody is still using it with
> > Unicode around for a long time...
I forgot to mention, I think your approach to keep this is the best
one for now so as not to break anything for small targets.
> > > To be on the safe side, I’m leaving the actual jp2uc conversion untouched
> > > for now, and I’ve just added a dummy back-conversion uc2jp with a #warning.
> > > If the #warning is ignored or removed, the non-Cygwin build should work as
> > > before, fixing just locale handling.
> > >
> > > I'm attaching the wcwidth part here, all patches are available at
> > > http://towo.net/cygwin/Unicode_and_locale_tweaks.zip (don't fit in the
> > > mailbox size limit).
> > So why don't you use git send-email (ideally with a cover letter, see
> > `git format-patch --cover-letter') instead of attaching the patches to a
> > single email? This is the correct way of sending patch series and it
> > gets you around the size limit.
> Because of:
> LC_ALL=C git send-email
> git: 'send-email' is not a git command. See 'git --help'.
>
> Are there any working instructions for newlib contributions to be found
> anywhere?
Jon and Brian answered that.
> > The below patches are missing a patch, last one is patch 3/4.
> >
> > Patches 2 and 3 are ok, afaics, but as for patch 1, why did you create
> > an extra Makefile? This should be merged into string/Makefile.am.
> I was awaiting feedback before doing further integration work. The Makefile
> includes test stuff and also the table generation. Would the generation be
> invoked every time or rather called manually?
Keeping generated files in the repos is frowned upon these days, but
we're doing this for a pretty long time already and don't want everybody
having to do these, mostly awkward steps. So, yeah, keeping the tables
in the repo and manually calling the generation targets sounds right to
me. Only maintainers (or interested parties) need to do this once in a
while.
Thanks,
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2017-12-04 9:05 ` Corinna Vinschen
@ 2018-02-25 17:14 ` Thomas Wolff
2018-02-26 17:20 ` Corinna Vinschen
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Wolff @ 2018-02-25 17:14 UTC (permalink / raw)
To: newlib
I have finally revamped, manually rebased, and repackaged my Unicode
data patches which I'll send in separate mail.
However, as I don't have a command-line sendmail set up (and apparently
it's not as easy as it used to be),
I'll send zip archives which contain git-patch files.
There are two patches:
libc/string: wcwidth using generated width data, with data generated
from Unicode 10.0
libc/ctype: isw* and tow* functions using generated case conversion and
character class data, with Unicode 10.0 data
For both, generation script and a Makefile.widthdata / Makefile.chardata
is included. As these are to be used in the source directory,
not the binary target directory, in case of future Unicode update, they
are not related to the other Makefiles.
In ctype/, there is one new source (categories.c) which should be
compiled separately but although I tried to include it in Makefile.am,
I could not get the build process to compile it. So the current solution
is to include it from one of the other sources (the one that also
maintains the case conversion table).
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2018-02-25 17:14 ` Thomas Wolff
@ 2018-02-26 17:20 ` Corinna Vinschen
2018-02-26 20:02 ` Thomas Wolff
0 siblings, 1 reply; 15+ messages in thread
From: Corinna Vinschen @ 2018-02-26 17:20 UTC (permalink / raw)
To: Thomas Wolff; +Cc: newlib
[-- Attachment #1.1: Type: text/plain, Size: 2158 bytes --]
On Feb 25 18:14, Thomas Wolff wrote:
> I have finally revamped, manually rebased, and repackaged my Unicode data
> patches which I'll send in separate mail.
> However, as I don't have a command-line sendmail set up (and apparently it's
> not as easy as it used to be),
> I'll send zip archives which contain git-patch files.
No, sorry, but no. It's not that tricky to send standard git patch
series, we're all doing this. If your MUA doesn't fit, use another MUA
or *attach* the patches, one per mail.
> There are two patches:
> libc/string: wcwidth using generated width data, with data generated from
> Unicode 10.0
> libc/ctype: isw* and tow* functions using generated case conversion and
> character class data, with Unicode 10.0 data
> For both, generation script and a Makefile.widthdata / Makefile.chardata is
> included. As these are to be used in the source directory,
> not the binary target directory, in case of future Unicode update, they are
> not related to the other Makefiles.
Eh, what? If you read back, I had no problems with your patches 2 and
3, only with patch 1 adding new makefiles. So the only thing I actually
asked for was to integrate the creation of the generated tables into
Makefile.am and now you're telling me this is not what you changed...?
> In ctype/, there is one new source (categories.c) which should be compiled
> separately but although I tried to include it in Makefile.am,
> I could not get the build process to compile it. So the current solution is
> to include it from one of the other sources (the one that also maintains the
> case conversion table).
That's a workaround, not a solution. When you change Makefile.am you
have to regenerate Makefile.in, obviously.
However, since regenerating Makefile.in for newlib is (unfortunately,
for historical reasons) non-obvious, you can just go ahead and manually
add categories.* to Makefile.in where it belongs, kind of like the
attached. A later regeneration run by one of the maintainers will fix
the formatting so that's nothing to worry about.
Corinna
--
Corinna Vinschen
Cygwin Maintainer
Red Hat
[-- Attachment #1.2: x --]
[-- Type: text/plain, Size: 2902 bytes --]
diff --git a/newlib/libc/ctype/Makefile.am b/newlib/libc/ctype/Makefile.am
index 898693571bd1..fa6a70d3a1bf 100644
--- a/newlib/libc/ctype/Makefile.am
+++ b/newlib/libc/ctype/Makefile.am
@@ -24,6 +24,7 @@ if ELIX_LEVEL_1
ELIX_SOURCES =
else
ELIX_SOURCES = \
+ categories.c \
isalnum_l.c \
isalpha_l.c \
isascii.c \
diff --git a/newlib/libc/ctype/Makefile.in b/newlib/libc/ctype/Makefile.in
index 2b2331767a0f..9932a9494b09 100644
--- a/newlib/libc/ctype/Makefile.in
+++ b/newlib/libc/ctype/Makefile.in
@@ -79,7 +79,8 @@ am__objects_1 = lib_a-ctype_.$(OBJEXT) lib_a-isalnum.$(OBJEXT) \
lib_a-ispunct.$(OBJEXT) lib_a-isspace.$(OBJEXT) \
lib_a-isxdigit.$(OBJEXT) lib_a-tolower.$(OBJEXT) \
lib_a-toupper.$(OBJEXT)
-@ELIX_LEVEL_1_FALSE@am__objects_2 = lib_a-isalnum_l.$(OBJEXT) \
+@ELIX_LEVEL_1_FALSE@am__objects_2 = lib_a-categories.$(OBJEXT) \
+@ELIX_LEVEL_1_FALSE@ lib_a-isalnum_l.$(OBJEXT) \
@ELIX_LEVEL_1_FALSE@ lib_a-isalpha_l.$(OBJEXT) \
@ELIX_LEVEL_1_FALSE@ lib_a-isascii.$(OBJEXT) \
@ELIX_LEVEL_1_FALSE@ lib_a-isascii_l.$(OBJEXT) \
@@ -142,7 +143,7 @@ libctype_la_LIBADD =
am__objects_3 = ctype_.lo isalnum.lo isalpha.lo iscntrl.lo isdigit.lo \
islower.lo isupper.lo isprint.lo ispunct.lo isspace.lo \
isxdigit.lo tolower.lo toupper.lo
-@ELIX_LEVEL_1_FALSE@am__objects_4 = isalnum_l.lo isalpha_l.lo \
+@ELIX_LEVEL_1_FALSE@am__objects_4 = categories.lo isalnum_l.lo isalpha_l.lo \
@ELIX_LEVEL_1_FALSE@ isascii.lo isascii_l.lo isblank.lo \
@ELIX_LEVEL_1_FALSE@ isblank_l.lo iscntrl_l.lo isdigit_l.lo \
@ELIX_LEVEL_1_FALSE@ islower_l.lo isupper_l.lo isprint_l.lo \
@@ -351,6 +352,7 @@ GENERAL_SOURCES = \
toupper.c
@ELIX_LEVEL_1_FALSE@ELIX_SOURCES = \
+@ELIX_LEVEL_1_FALSE@ categories.c \
@ELIX_LEVEL_1_FALSE@ isalnum_l.c \
@ELIX_LEVEL_1_FALSE@ isalpha_l.c \
@ELIX_LEVEL_1_FALSE@ isascii.c \
@@ -609,6 +611,12 @@ lib_a-toupper.o: toupper.c
lib_a-toupper.obj: toupper.c
$(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(lib_a_CFLAGS) $(CFLAGS) -c -o lib_a-toupper.obj `if test -f 'toupper.c'; then $(CYGPATH_W) 'toupper.c'; else $(CYGPATH_W) '$(srcdir)/toupper.c'; fi`
+lib_a-categories.o: categories.c
+ $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(lib_a_CFLAGS) $(CFLAGS) -c -o lib_a-categories.o `test -f 'categories.c' || echo '$(srcdir)/'`categories.c
+
+lib_a-categories.obj: categories.c
+ $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(lib_a_CFLAGS) $(CFLAGS) -c -o lib_a-categories.obj `if test -f 'categories.c'; then $(CYGPATH_W) 'categories.c'; else $(CYGPATH_W) '$(srcdir)/categories.c'; fi`
+
lib_a-isalnum_l.o: isalnum_l.c
$(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(lib_a_CFLAGS) $(CFLAGS) -c -o lib_a-isalnum_l.o `test -f 'isalnum_l.c' || echo '$(srcdir)/'`isalnum_l.c
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2018-02-26 17:20 ` Corinna Vinschen
@ 2018-02-26 20:02 ` Thomas Wolff
2018-02-26 20:25 ` Hans-Bernhard Bröker
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Wolff @ 2018-02-26 20:02 UTC (permalink / raw)
To: newlib
Am 26.02.2018 um 18:20 schrieb Corinna Vinschen:
> On Feb 25 18:14, Thomas Wolff wrote:
>> I have finally revamped, manually rebased, and repackaged my Unicode data
>> patches which I'll send in separate mail.
>> ...
> ...
> or *attach* the patches, one per mail.
That will do, thanks.
>> There are two patches:
>> libc/string: wcwidth using generated width data, with data generated from
>> Unicode 10.0
>> libc/ctype: isw* and tow* functions using generated case conversion and
>> character class data, with Unicode 10.0 data
>> For both, generation script and a Makefile.widthdata / Makefile.chardata is
>> included. As these are to be used in the source directory,
>> not the binary target directory, in case of future Unicode update, they are
>> not related to the other Makefiles.
> Eh, what? If you read back, I had no problems with your patches 2 and
> 3, only with patch 1 adding new makefiles. So the only thing I actually
> asked for was to integrate the creation of the generated tables into
> Makefile.am and now you're telling me this is not what you changed...?
First I added an include to the generation makefile into Makefile.am,
then it occurred to me that the generated makefile resides in the target
hierarchy while the generation should probably be invoked in the source
directory, so I removed it again.
I'm not sure about the best or preferred invocation interface for such a
step. Maybe I should just provide the generation scripts (mk*) and leave
it up to you to integrate them into the Makefile.am.
>> In ctype/, there is one new source (categories.c) which should be compiled
>> separately but although I tried to include it in Makefile.am,
>> I could not get the build process to compile it. So the current solution is
>> to include it from one of the other sources (the one that also maintains the
>> case conversion table).
> That's a workaround, not a solution. When you change Makefile.am you
> have to regenerate Makefile.in, obviously.
>
> However, since regenerating Makefile.in for newlib is (unfortunately,
> for historical reasons) non-obvious, you can just go ahead and manually
> add categories.* to Makefile.in where it belongs, kind of like the
> attached.
Thanks for the patch.
I'll resubmit the wcwidth patch soon, maybe you can tell me how you'd
like the data generation to be invoked, or I can submit it just with the
script. I'll submit the ctype patch later; all works fine on my Windows
10 system but there is some obscure trouble on a Windows 7 system which
I'd like to check out first.
And there's also a locale patch which I presented a separate mail.
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Unicode update of width and other character properties
2018-02-26 20:02 ` Thomas Wolff
@ 2018-02-26 20:25 ` Hans-Bernhard Bröker
0 siblings, 0 replies; 15+ messages in thread
From: Hans-Bernhard Bröker @ 2018-02-26 20:25 UTC (permalink / raw)
To: newlib
Am 26.02.2018 um 21:02 schrieb Thomas Wolff:
> First I added an include to the generation makefile into Makefile.am,
> then it occurred to me that the generated makefile resides in the target
> hierarchy while the generation should probably be invoked in the source
> directory, so I removed it again.
Which directory the tool is to be invoked in is irrelevant for this.
Anything you put into a Makefile.am will automatically go into
Makefile.in and Makefile, anyway. Well, assuming you actually run the
autotools, that is.
If you need to run things from a inside the src tree, you're supposed to
apply something like
cd $(srcdir) &&
cd $(top_srcdir)/some/where &&
to them just like, e.g. the auto-generated production rules for
Makefile.in and configure do it.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-02-26 20:25 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-06 5:36 Unicode update of width and other character properties Thomas Wolff
2017-08-07 10:31 ` Corinna Vinschen
2017-08-07 19:18 ` Thomas Wolff
2017-08-08 8:30 ` Corinna Vinschen
2017-08-17 11:03 ` Thomas Wolff
2017-12-03 14:07 ` Corinna Vinschen
2017-12-03 17:31 ` Thomas Wolff
2017-12-03 17:33 ` Jon Turney
2017-12-04 7:32 ` Brian Inglis
2017-12-04 9:05 ` Corinna Vinschen
2018-02-25 17:14 ` Thomas Wolff
2018-02-26 17:20 ` Corinna Vinschen
2018-02-26 20:02 ` Thomas Wolff
2018-02-26 20:25 ` Hans-Bernhard Bröker
2017-12-02 11:25 ` Ping: " Thomas Wolff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).