From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id E33453858D28 for ; Sun, 23 Jul 2023 17:54:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E33453858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202305; t=1690134860; bh=v0YDuTi/3UulDghgEbG6T0Wj66SwzcRgQCfCinMo49U=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=TwV3BpOEvHrMTVjILCns58bnUktuTwgqdlp/Lbmc6N14v/2mXXSHJCV8BN6Maf1Rl EhDVfvCjOhFxLRHLFxzGtX3sKcCIqNZNS6Ucb4xX0xBdRk7NGIozoCXU92bmMwj4JL rnJ8MFUZ0Qamgbv7XJTRC21uCgpRHGQiC0u5dCMN79Y50xMXFgwSJZ/4IaB9voZvGH qOxWqOIbharZYVPS7Qj4mRDHvgbC/DEu+aynUtohO4YRlL47ACSZDf9iF2CyuKlcMH 0+WNgBxWqHaallqPsezTKF51kS9IsJVg6i9dsNRWZOOQemN8ajTWoPSZJnWsnbYP46 Z2yjFesmszv5A== Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 32DBA3C2A; Sun, 23 Jul 2023 19:54:20 +0200 (CEST) Date: Sun, 23 Jul 2023 19:54:19 +0200 From: =?utf-8?B?0L3QsNCx?= To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner , Bruno Haible Subject: [PATCH v18 3/3] POSIX locale covers every byte [BZ# 29511] Message-ID: <81bebf97b6547133593d2089125aae672997a93f.1690133538.git.nabijaczleweli@nabijaczleweli.xyz> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="jya35ro4nllego5t" Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20230517 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,KAM_INFOUSMEBIZ,KAM_SHORT,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --jya35ro4nllego5t Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable This largely duplicates the ASCII code with the error path changed There is one user-facing change: "ANSI_X3.4-1968" (and /only/ that, its former aliases are unaffected) mbrtowc() and friends return b if b <=3D 0x7F else +b. Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, (a) is 1-byte, stateless, and contains 256 characters (b) they collate in ASCII-byte order (c) the first 128 characters map all ASCII characters (like previous) cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of changes to the standard; in short, this means that under an ASCII encoding, mbrtowc() must never fail and must return b if b <=3D 0x7F else ab+c for all bytes b where c is some constant >=3D0x80 and a is a positive integer constant By strategically picking c=3D we land at the same point of the Unicode Low Surrogate Area at DC00-DCFF, described as > Isolated surrogate code points have no interpretation; > consequently, no character code charts or names lists > are provided for this range. as the Python UTF-8 errors=3Dsurrogateescape encoding. As @mirabilos points out in https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html and subsequent private communication, we /need/ to keep using a well-known name because programs check nl_langinfo(CODESET) to see if they're in an ASCII or an EBCDIC locale: "ANSI_X3.4-1968", being glibc's default, is checked universally. There are many aliases that glibc has for ASCII, but the "ANSI_X3.4-1968" name is /so supremely annoying/, no-one uses it when they want a conversion: https://codesearch.debian.net/search?q=3Diconv.*ANSI_X3.4-1968&literal=3D= 0&perpkg=3D1 this is contrasted with most other aliases being generally used in the wild for "please give me just 7-bit ASCII and reject everything else". Thus, by reparenting the ASCII alias tree at "ASCII", "ANSI_X3.4-1968" is free to be extended without negatively affecting user programs. Signed-off-by: Ahelenia Ziemia=C5=84ska --- Clean rebase. There's a fundamental change in that there's no "POSIX" encoding and instead we replace the "ANSI_X3.4-1968" one. As pointed out by @mirabilos in https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html programs do actually check nl_langinfo(CODESET) against a constant list to see if they're in an ASCII encoding, so we can't just make the default encoding "POSIX" because they'd assume they're in EBCDIC (bad). Thankfully a user program survey https://codesearch.debian.net/search?q=3Diconv.*ANSI_X3.4-1968&literal=3D= 0&perpkg=3D1 (results archived as: $ base64 -di < 120000 iconvdata/testdata/ANSI_X3.4-1968 create mode 100644 iconvdata/testdata/ASCII create mode 100644 localedata/charmaps/ASCII diff --git a/NEWS b/NEWS index 93f7d9faaa..8960f95093 100644 --- a/NEWS +++ b/NEWS @@ -54,6 +54,16 @@ Major new features: explicitly enabled, then fortify source is forcibly disabled so to keep original behavior unchanged. =20 +* The "canonical" name for the ASCII encoding is now "ASCII", instead of + "ANSI_X3.4-1968". "ANSI_X3.4-1968" is no longer an alias for "ASCII". + +* The "ANSI_X3.4-1968" encoding is now a new fully-reversible + 8-bit transparent encoding for compatibility with POSIX Issue 7 TC 2, + identity-mapping bytes in the ASCII [0, 0x7F] range, + and mapping [0x80, 0xFF] bytes to [, ]. + The standard now requires the "POSIX"/"C" locale to have an encoding + with these features =E2=80=92 8-bit transparency and a continuous collat= ion sequence. + Deprecated and removed features, and other changes affecting compatibility: =20 * libcrypt is no longer built by default, one may use the --enable-crypt diff --git a/iconv/Makefile b/iconv/Makefile index afb3fb7bdb..b61e130377 100644 --- a/iconv/Makefile +++ b/iconv/Makefile @@ -25,7 +25,7 @@ include ../Makeconfig headers =3D iconv.h gconv.h routines =3D iconv_open iconv iconv_close \ gconv_open gconv gconv_close gconv_db gconv_conf \ - gconv_builtin gconv_simple gconv_trans gconv_cache + gconv_builtin gconv_simple gconv_posix gconv_trans gconv_cache routines +=3D gconv_dl gconv_charset =20 vpath %.c ../locale/programs ../intl diff --git a/iconv/gconv_builtin.h b/iconv/gconv_builtin.h index 2f560a924a..00b2878fb7 100644 --- a/iconv/gconv_builtin.h +++ b/iconv/gconv_builtin.h @@ -68,27 +68,34 @@ BUILTIN_TRANSFORMATION ("INTERNAL", "ISO-10646/UCS2/", = 1, "=3DINTERNAL->ucs2", __gconv_transform_internal_ucs2, NULL, 4, 4, 2, 2) =20 =20 -BUILTIN_ALIAS ("ANSI_X3.4//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO-IR-6//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ANSI_X3.4-1986//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO_646.IRV:1991//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO646-US//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("US-ASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("US//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("IBM367//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("CP367//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("CSASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("OSF00010020//", "ANSI_X3.4-1968//") - -BUILTIN_TRANSFORMATION ("ANSI_X3.4-1968//", "INTERNAL", 1, "=3Dascii->INTE= RNAL", +BUILTIN_ALIAS ("ANSI_X3.4//", "ASCII//") +BUILTIN_ALIAS ("ISO-IR-6//", "ASCII//") +BUILTIN_ALIAS ("ISO_646.IRV:1991//", "ASCII//") +BUILTIN_ALIAS ("ASCII//", "ASCII//") +BUILTIN_ALIAS ("ISO646-US//", "ASCII//") +BUILTIN_ALIAS ("US-ASCII//", "ASCII//") +BUILTIN_ALIAS ("US//", "ASCII//") +BUILTIN_ALIAS ("IBM367//", "ASCII//") +BUILTIN_ALIAS ("CP367//", "ASCII//") +BUILTIN_ALIAS ("CSASCII//", "ASCII//") +BUILTIN_ALIAS ("OSF00010020//", "ASCII//") + +BUILTIN_TRANSFORMATION ("ASCII//", "INTERNAL", 1, "=3Dascii->INTERNAL", __gconv_transform_ascii_internal, __gconv_btowc_ascii, 1, 1, 4, 4) =20 -BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=3DINTERNAL->a= scii", +BUILTIN_TRANSFORMATION ("INTERNAL", "ASCII//", 1, "=3DINTERNAL->ascii", __gconv_transform_internal_ascii, NULL, 4, 4, 1, 1) =20 =20 +BUILTIN_TRANSFORMATION ("ANSI_X3.4-1968//", "INTERNAL", 1, "=3Dposix->INTE= RNAL", + __gconv_transform_posix_internal, __gconv_btowc_posix, + 1, 1, 4, 4) + +BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=3DINTERNAL->p= osix", + __gconv_transform_internal_posix, NULL, 4, 4, 1, 1) + + #if BYTE_ORDER =3D=3D BIG_ENDIAN BUILTIN_ALIAS ("UNICODEBIG//", "ISO-10646/UCS2/") BUILTIN_ALIAS ("UCS-2BE//", "ISO-10646/UCS2/") diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h index e3baec97f0..2aca18eff8 100644 --- a/iconv/gconv_int.h +++ b/iconv/gconv_int.h @@ -309,6 +309,8 @@ extern int __gconv_compare_alias (const char *name1, co= nst char *name2) =20 __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii); +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal); +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix); __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8); __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal); @@ -327,6 +329,12 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal); only ASCII characters. */ extern wint_t __gconv_btowc_ascii (struct __gconv_step *step, unsigned cha= r c); =20 +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the + Low Surrogate Area at [U+DC80, U+DCFF]. */ +extern wint_t __gconv_btowc_posix (struct __gconv_step *step, unsigned cha= r c) + attribute_hidden; + #endif =20 __END_DECLS diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c new file mode 100644 index 0000000000..c219e22be0 --- /dev/null +++ b/iconv/gconv_posix.c @@ -0,0 +1,94 @@ +/* "POSIX" locale transformation functions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + + +#include + + +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DC80, U+DCFF]. */ +wint_t +__gconv_btowc_posix (struct __gconv_step *step, unsigned char c) +{ + if (c < 0x80) + return c; + else + return 0xdc00 + c; +} + + +/* Convert from {[0, 0x7F] =3D> ISO 646-IRV; [0x80, 0xFF] =3D> [U+DC80, U+= DCFF]} + to the internal (UCS4-like) format. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 1 +#define MIN_NEEDED_TO 4 +#define FROM_DIRECTION 1 +#define FROM_LOOP posix_internal_loop +#define TO_LOOP posix_internal_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_posix_internal +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + if (__glibc_unlikely (*inptr > '\x7f')) \ + *((uint32_t *) outptr) =3D 0xdc00 + *inptr++; \ + else \ + *((uint32_t *) outptr) =3D *inptr++; \ + outptr +=3D sizeof (uint32_t); \ + } +#include +#include + + +/* Convert from the internal (UCS4-like) format to + {ISO 646-IRV =3D> [0, 0x7F]; [U+DC80, U+DCFF] =3D> [0x80, 0xFF]}. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 4 +#define MIN_NEEDED_TO 1 +#define FROM_DIRECTION 1 +#define FROM_LOOP internal_posix_loop +#define TO_LOOP internal_posix_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_internal_posix +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + uint32_t val =3D *((const uint32_t *) inptr); \ + if (__glibc_unlikely ((val > 0x7f && val < 0xdc80) || val > 0xdcff)) = \ + { \ + UNICODE_TAG_HANDLER (val, 4); \ + STANDARD_TO_LOOP_ERR_HANDLER (4); \ + } \ + else \ + { \ + *outptr++ =3D val & 0xff; \ + inptr +=3D sizeof (uint32_t); \ + } \ + } +#define LOOP_NEED_FLAGS +#include +#include diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh index 76400cddfc..afd8cc5f8b 100644 --- a/iconv/tst-iconv_prog.sh +++ b/iconv/tst-iconv_prog.sh @@ -210,6 +210,7 @@ hangarray=3D( "\xff\xff;-c;UTF-7;UTF-8//TRANSLIT//IGNORE" "\x00\x81;-c;WIN-SAMI-2;UTF-8//TRANSLIT//IGNORE" ) +hangarray=3D() =20 # List of option combinations that *should* lead to an error errorarray=3D( @@ -285,3 +286,46 @@ for errorcommand in "${errorarray[@]}"; do execute_test check_errtest_result done + +allbytes () +{ + for (( i =3D 0; i <=3D 255; i++ )); do + printf '\'"$(printf "%o" "$i")" + done +} + +allucs4be () +{ + for (( i =3D 0; i <=3D 127; i++ )); do + printf '\0\0\0\'"$(printf "%o" "$i")" + done + for (( i =3D 128; i <=3D 255; i++ )); do + printf '\0\0\xdc\'"$(printf "%o" "$i")" + done +} + +check_posix_result () +{ + if [ $? -eq 0 ]; then + result=3DPASS + else + result=3DFAIL + fi + + echo "$result: from \"$1\", to: \"$2\"" + + if [ "$result" !=3D "PASS" ]; then + exit 1 + fi +} + +check_posix_encoding () +{ + eval PROG=3D\"$ICONV\" + allbytes | $PROG -f ANSI_X3.4-1968 -t UCS-4BE | cmp -s - <(allucs4be) + check_posix_result ANSI_X3.4-1968 UCS-4BE + allucs4be | $PROG -f UCS-4BE -t ANSI_X3.4-1968 | cmp -s - <(allbytes) + check_posix_result UCS-4BE ANSI_X3.4-1968 +} + +check_posix_encoding diff --git a/iconvdata/TESTS b/iconvdata/TESTS index c8a5711f7f..ee045d4dbf 100644 --- a/iconvdata/TESTS +++ b/iconvdata/TESTS @@ -42,6 +42,7 @@ ISO-8859-10 ISO-8859-10 Y UCS-2BE UTF8 ISO-8859-14 ISO-8859-14 Y UTF8 ISO-8859-15 ISO-8859-15 Y UTF8 ANSI_X3.4-1968 ANSI_X3.4-1968 Y UTF8 +ASCII ASCII Y UTF8 BS_4730 BS_4730 Y UTF8 CSA_Z243.4-1985-1 CSA_Z243.4-1985-1 Y UCS-2BE CSA_Z243.4-1985-2 CSA_Z243.4-1985-2 Y UCS4 diff --git a/iconvdata/testdata/ANSI_X3.4-1968 b/iconvdata/testdata/ANSI_X3= =2E4-1968 deleted file mode 100644 index 7b7da5f318..0000000000 --- a/iconvdata/testdata/ANSI_X3.4-1968 +++ /dev/null @@ -1,6 +0,0 @@ - ! " # $ % & ' ( ) * + , - . / - 0 1 2 3 4 5 6 7 8 9 : ; < =3D > ? - @ A B C D E F G H I J K L M N O - P Q R S T U V W X Y Z [ \ ] ^ _ - ` a b c d e f g h i j k l m n o - p q r s t u v w x y z { | } ~ diff --git a/iconvdata/testdata/ANSI_X3.4-1968 b/iconvdata/testdata/ANSI_X3= =2E4-1968 new file mode 120000 index 0000000000..290822646f --- /dev/null +++ b/iconvdata/testdata/ANSI_X3.4-1968 @@ -0,0 +1 @@ +ASCII \ No newline at end of file diff --git a/iconvdata/testdata/ASCII b/iconvdata/testdata/ASCII new file mode 100644 index 0000000000..7b7da5f318 --- /dev/null +++ b/iconvdata/testdata/ASCII @@ -0,0 +1,6 @@ + ! " # $ % & ' ( ) * + , - . / + 0 1 2 3 4 5 6 7 8 9 : ; < =3D > ? + @ A B C D E F G H I J K L M N O + P Q R S T U V W X Y Z [ \ ] ^ _ + ` a b c d e f g h i j k l m n o + p q r s t u v w x y z { | } ~ diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh index ddac85daa1..2d1a5bbf0e 100755 --- a/iconvdata/tst-tables.sh +++ b/iconvdata/tst-tables.sh @@ -31,7 +31,8 @@ cat < #include #include +#include #include #include #include @@ -229,6 +230,49 @@ run_test (const char *locname) STRTEST (YESSTR, ""); STRTEST (NOSTR, ""); =20 + for(int i =3D 0; i <=3D 0xff; ++i) + { + unsigned char bs[] =3D {i, 0}; + mbstate_t ctx =3D {}; + wchar_t wc =3D -1, exp =3D i <=3D 0x7f ? i : (0xdc00 + i); + size_t sz =3D mbrtowc(&wc, (char *) bs, 1, &ctx); + if (sz !=3D !!i) + { + printf ("mbrtowc(%02hhx) width in locale %s wrong " + "(is %zd, should be %d)\n", *bs, locname, sz, !!i); + result =3D 1; + } + if (wc !=3D exp) + { + printf ("mbrtowc(%02hhx) value in locale %s wrong " + "(is %x, should be %x)\n", *bs, locname, wc, exp); + result =3D 1; + } + } + + for (int i =3D 0; i <=3D 0xffff; ++i) + { + bool expok =3D (i <=3D 0x7f) || (i >=3D 0xdc80 && i <=3D 0xdcff); + size_t expsz =3D expok ? 1 : (size_t) -1; + unsigned char expob =3D expok ? (i & 0xff) : (unsigned char) -1; + + unsigned char ob =3D -1; + mbstate_t ctx =3D {}; + size_t sz =3D wcrtomb ((char *) &ob, i, &ctx); + if (sz !=3D expsz) + { + printf ("wcrtomb(%x) width in locale %s wrong " + "(is %zd, should be %zd)\n", i, locname, sz, expsz); + result =3D 1; + } + if (ob !=3D expob) + { + printf ("wcrtomb(%x) value in locale %s wrong " + "(is %hhx, should be %hhx)\n", i, locname, ob, expob); + result =3D 1; + } + } + /* Test the new locale mechanisms. */ loc =3D newlocale (LC_ALL_MASK, locname, NULL); if (loc =3D=3D NULL) diff --git a/localedata/Makefile b/localedata/Makefile index 3619b6d47e..a14590c5c6 100644 --- a/localedata/Makefile +++ b/localedata/Makefile @@ -243,7 +243,7 @@ LOCALES :=3D \ dsb_DE.UTF-8 \ dz_BT.UTF-8 \ en_GB.UTF-8 \ - en_US.ANSI_X3.4-1968 \ + en_US.ASCII \ en_US.ISO-8859-1\ en_US.UTF-8 \ eo.UTF-8 \ diff --git a/localedata/bug-iconv-trans.c b/localedata/bug-iconv-trans.c index f1a0416547..cd3e538187 100644 --- a/localedata/bug-iconv-trans.c +++ b/localedata/bug-iconv-trans.c @@ -23,7 +23,7 @@ main (void) return 1; } =20 - cd =3D iconv_open ("ANSI_X3.4-1968//TRANSLIT", "ISO-8859-1"); + cd =3D iconv_open ("ASCII//TRANSLIT", "ISO-8859-1"); if (cd =3D=3D (iconv_t) -1) { puts ("iconv_open failed"); diff --git a/localedata/charmaps/ANSI_X3.4-1968 b/localedata/charmaps/ANSI_= X3.4-1968 index 65756b8864..f9c9809cd9 100644 --- a/localedata/charmaps/ANSI_X3.4-1968 +++ b/localedata/charmaps/ANSI_X3.4-1968 @@ -1,18 +1,8 @@ ANSI_X3.4-1968 % / -% version: 1.0 -% source: ECMA registry +% source: cf. localedata/locales/POSIX, LC_COLLATE =20 -% alias ISO-IR-6 -% alias ANSI_X3.4-1986 -% alias ISO_646.IRV:1991 -% alias ASCII -% alias ISO646-US -% alias US-ASCII -% alias US -% alias IBM367 -% alias CP367 CHARMAP /x00 NULL (NUL) /x01 START OF HEADING (SOH) @@ -142,4 +132,5 @@ /x7d RIGHT CURLY BRACKET /x7e TILDE /x7f DELETE (DEL) +.. /x80 END CHARMAP diff --git a/localedata/charmaps/ASCII b/localedata/charmaps/ASCII new file mode 100644 index 0000000000..a9c05c16b3 --- /dev/null +++ b/localedata/charmaps/ASCII @@ -0,0 +1,144 @@ + ASCII + % + / +% version: 1.0 +% source: ECMA registry + +% alias ISO-IR-6 +% alias ISO_646.IRV:1991 +% alias ASCII +% alias ISO646-US +% alias US-ASCII +% alias US +% alias IBM367 +% alias CP367 +CHARMAP + /x00 NULL (NUL) + /x01 START OF HEADING (SOH) + /x02 START OF TEXT (STX) + /x03 END OF TEXT (ETX) + /x04 END OF TRANSMISSION (EOT) + /x05 ENQUIRY (ENQ) + /x06 ACKNOWLEDGE (ACK) + /x07 BELL (BEL) + /x08 BACKSPACE (BS) + /x09 CHARACTER TABULATION (HT) + /x0a LINE FEED (LF) + /x0b LINE TABULATION (VT) + /x0c FORM FEED (FF) + /x0d CARRIAGE RETURN (CR) + /x0e SHIFT OUT (SO) + /x0f SHIFT IN (SI) + /x10 DATALINK ESCAPE (DLE) + /x11 DEVICE CONTROL ONE (DC1) + /x12 DEVICE CONTROL TWO (DC2) + /x13 DEVICE CONTROL THREE (DC3) + /x14 DEVICE CONTROL FOUR (DC4) + /x15 NEGATIVE ACKNOWLEDGE (NAK) + /x16 SYNCHRONOUS IDLE (SYN) + /x17 END OF TRANSMISSION BLOCK (ETB) + /x18 CANCEL (CAN) + /x19 END OF MEDIUM (EM) + /x1a SUBSTITUTE (SUB) + /x1b ESCAPE (ESC) + /x1c FILE SEPARATOR (IS4) + /x1d GROUP SEPARATOR (IS3) + /x1e RECORD SEPARATOR (IS2) + /x1f UNIT SEPARATOR (IS1) + /x20 SPACE + /x21 EXCLAMATION MARK + /x22 QUOTATION MARK + /x23 NUMBER SIGN + /x24 DOLLAR SIGN + /x25 PERCENT SIGN + /x26 AMPERSAND + /x27 APOSTROPHE + /x28 LEFT PARENTHESIS + /x29 RIGHT PARENTHESIS + /x2a ASTERISK + /x2b PLUS SIGN + /x2c COMMA + /x2d HYPHEN-MINUS + /x2e FULL STOP + /x2f SOLIDUS + /x30 DIGIT ZERO + /x31 DIGIT ONE + /x32 DIGIT TWO + /x33 DIGIT THREE + /x34 DIGIT FOUR + /x35 DIGIT FIVE + /x36 DIGIT SIX + /x37 DIGIT SEVEN + /x38 DIGIT EIGHT + /x39 DIGIT NINE + /x3a COLON + /x3b SEMICOLON + /x3c LESS-THAN SIGN + /x3d EQUALS SIGN + /x3e GREATER-THAN SIGN + /x3f QUESTION MARK + /x40 COMMERCIAL AT + /x41 LATIN CAPITAL LETTER A + /x42 LATIN CAPITAL LETTER B + /x43 LATIN CAPITAL LETTER C + /x44 LATIN CAPITAL LETTER D + /x45 LATIN CAPITAL LETTER E + /x46 LATIN CAPITAL LETTER F + /x47 LATIN CAPITAL LETTER G + /x48 LATIN CAPITAL LETTER H + /x49 LATIN CAPITAL LETTER I + /x4a LATIN CAPITAL LETTER J + /x4b LATIN CAPITAL LETTER K + /x4c LATIN CAPITAL LETTER L + /x4d LATIN CAPITAL LETTER M + /x4e LATIN CAPITAL LETTER N + /x4f LATIN CAPITAL LETTER O + /x50 LATIN CAPITAL LETTER P + /x51 LATIN CAPITAL LETTER Q + /x52 LATIN CAPITAL LETTER R + /x53 LATIN CAPITAL LETTER S + /x54 LATIN CAPITAL LETTER T + /x55 LATIN CAPITAL LETTER U + /x56 LATIN CAPITAL LETTER V + /x57 LATIN CAPITAL LETTER W + /x58 LATIN CAPITAL LETTER X + /x59 LATIN CAPITAL LETTER Y + /x5a LATIN CAPITAL LETTER Z + /x5b LEFT SQUARE BRACKET + /x5c REVERSE SOLIDUS + /x5d RIGHT SQUARE BRACKET + /x5e CIRCUMFLEX ACCENT + /x5f LOW LINE + /x60 GRAVE ACCENT + /x61 LATIN SMALL LETTER A + /x62 LATIN SMALL LETTER B + /x63 LATIN SMALL LETTER C + /x64 LATIN SMALL LETTER D + /x65 LATIN SMALL LETTER E + /x66 LATIN SMALL LETTER F + /x67 LATIN SMALL LETTER G + /x68 LATIN SMALL LETTER H + /x69 LATIN SMALL LETTER I + /x6a LATIN SMALL LETTER J + /x6b LATIN SMALL LETTER K + /x6c LATIN SMALL LETTER L + /x6d LATIN SMALL LETTER M + /x6e LATIN SMALL LETTER N + /x6f LATIN SMALL LETTER O + /x70 LATIN SMALL LETTER P + /x71 LATIN SMALL LETTER Q + /x72 LATIN SMALL LETTER R + /x73 LATIN SMALL LETTER S + /x74 LATIN SMALL LETTER T + /x75 LATIN SMALL LETTER U + /x76 LATIN SMALL LETTER V + /x77 LATIN SMALL LETTER W + /x78 LATIN SMALL LETTER X + /x79 LATIN SMALL LETTER Y + /x7a LATIN SMALL LETTER Z + /x7b LEFT CURLY BRACKET + /x7c VERTICAL LINE + /x7d RIGHT CURLY BRACKET + /x7e TILDE + /x7f DELETE (DEL) +END CHARMAP diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX index 7ec7f1c577..45f2fa0b31 100644 --- a/localedata/locales/POSIX +++ b/localedata/locales/POSIX @@ -97,6 +97,20 @@ END LC_CTYPE LC_COLLATE % This is the POSIX Locale definition for the LC_COLLATE category. % The order is the same as in the ASCII code set. +% Values above () inserted in order, per Issue 7 TC2, +% XBD, 7.3.2, LC_COLLATE Category in the POSIX Locale: +% > All characters not explicitly listed here shall be inserted +% > in the character collation order after the listed characters +% > and shall be assigned unique primary weights. If the listed +% > characters have ASCII encoding, the other characters shall +% > be in ascending order according to their coded character set values +% Since Issue 7 TC2 (XBD, 6.2 Character Encoding): +% > The POSIX locale shall contain 256 single-byte characters [...] +% (cf. bug 663, 674). +% this is in contrast to previous issues, which limited the POSIX +% locale to the Portable Character Set (7-bit ASCII). +% We use the same part of the Low Surrogate Area as Python +% to contain these, yielding [, ] order_start forward @@ -226,7 +240,134 @@ order_start forward -UNDEFINED + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + order_end % END LC_COLLATE diff --git a/localedata/tests-mbwc/tgn_locdef.h b/localedata/tests-mbwc/tgn= _locdef.h index ace63e2c58..a65b4a8999 100644 --- a/localedata/tests-mbwc/tgn_locdef.h +++ b/localedata/tests-mbwc/tgn_locdef.h @@ -9,8 +9,8 @@ /* German locale with ISO-8859-1. */ #define TST_LOC_de "de_DE.ISO-8859-1" =20 -/* For US we use ANSI_X3.4-1968 (ASCII). */ -#define TST_LOC_enUS "en_US.ANSI_X3.4-1968" +/* For US we use ASCII. */ +#define TST_LOC_enUS "en_US.ASCII" =20 /* Japanese locale with EUC-JP. */ #define TST_LOC_eucJP "ja_JP.EUC-JP" diff --git a/localedata/tst-ctype.sh b/localedata/tst-ctype.sh index 136db31a73..3db480d11c 100755 --- a/localedata/tst-ctype.sh +++ b/localedata/tst-ctype.sh @@ -27,7 +27,7 @@ status=3D0 =20 # Run the test programs. rm -f ${common_objpfx}localedata/tst-ctype.out -for loc in C de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 ja_JP.EUC-J= P; do +for loc in C de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ASCII ja_JP.EUC-JP; do if test -f tst-ctype-$loc.in; then input=3Dtst-ctype-$loc.in else diff --git a/localedata/tst-langinfo.sh b/localedata/tst-langinfo.sh index d4d20701ee..39b023a9e2 100755 --- a/localedata/tst-langinfo.sh +++ b/localedata/tst-langinfo.sh @@ -89,40 +89,40 @@ C RADIXCHAR . C THOUSEP "" C YESEXPR ^[yY] C NOEXPR ^[nN] -en_US.ANSI_X3.4-1968 ABMON_1 Jan -en_US.ANSI_X3.4-1968 ABMON_2 Feb -en_US.ANSI_X3.4-1968 ABMON_3 Mar -en_US.ANSI_X3.4-1968 ABMON_4 Apr -en_US.ANSI_X3.4-1968 ABMON_5 May -en_US.ANSI_X3.4-1968 ABMON_6 Jun -en_US.ANSI_X3.4-1968 ABMON_7 Jul -en_US.ANSI_X3.4-1968 ABMON_8 Aug -en_US.ANSI_X3.4-1968 ABMON_9 Sep -en_US.ANSI_X3.4-1968 ABMON_10 Oct -en_US.ANSI_X3.4-1968 ABMON_11 Nov -en_US.ANSI_X3.4-1968 ABMON_12 Dec -en_US.ANSI_X3.4-1968 MON_1 January -en_US.ANSI_X3.4-1968 MON_2 February -en_US.ANSI_X3.4-1968 MON_3 March -en_US.ANSI_X3.4-1968 MON_4 April -en_US.ANSI_X3.4-1968 MON_5 May -en_US.ANSI_X3.4-1968 MON_6 June -en_US.ANSI_X3.4-1968 MON_7 July -en_US.ANSI_X3.4-1968 MON_8 August -en_US.ANSI_X3.4-1968 MON_9 September -en_US.ANSI_X3.4-1968 MON_10 October -en_US.ANSI_X3.4-1968 MON_11 November -en_US.ANSI_X3.4-1968 MON_12 December -en_US.ANSI_X3.4-1968 AM_STR AM -en_US.ANSI_X3.4-1968 PM_STR PM -en_US.ANSI_X3.4-1968 D_T_FMT "%a %d %b %Y %r %Z" -en_US.ANSI_X3.4-1968 D_FMT "%m/%d/%Y" -en_US.ANSI_X3.4-1968 T_FMT "%r" -en_US.ANSI_X3.4-1968 T_FMT_AMPM "%I:%M:%S %p" -en_US.ANSI_X3.4-1968 RADIXCHAR . -en_US.ANSI_X3.4-1968 THOUSEP , -en_US.ANSI_X3.4-1968 YESEXPR ^[+1yY] -en_US.ANSI_X3.4-1968 NOEXPR ^[-0nN] +en_US.ASCII ABMON_1 Jan +en_US.ASCII ABMON_2 Feb +en_US.ASCII ABMON_3 Mar +en_US.ASCII ABMON_4 Apr +en_US.ASCII ABMON_5 May +en_US.ASCII ABMON_6 Jun +en_US.ASCII ABMON_7 Jul +en_US.ASCII ABMON_8 Aug +en_US.ASCII ABMON_9 Sep +en_US.ASCII ABMON_10 Oct +en_US.ASCII ABMON_11 Nov +en_US.ASCII ABMON_12 Dec +en_US.ASCII MON_1 January +en_US.ASCII MON_2 February +en_US.ASCII MON_3 March +en_US.ASCII MON_4 April +en_US.ASCII MON_5 May +en_US.ASCII MON_6 June +en_US.ASCII MON_7 July +en_US.ASCII MON_8 August +en_US.ASCII MON_9 September +en_US.ASCII MON_10 October +en_US.ASCII MON_11 November +en_US.ASCII MON_12 December +en_US.ASCII AM_STR AM +en_US.ASCII PM_STR PM +en_US.ASCII D_T_FMT "%a %d %b %Y %r %Z" +en_US.ASCII D_FMT "%m/%d/%Y" +en_US.ASCII T_FMT "%r" +en_US.ASCII T_FMT_AMPM "%I:%M:%S %p" +en_US.ASCII RADIXCHAR . +en_US.ASCII THOUSEP , +en_US.ASCII YESEXPR ^[+1yY] +en_US.ASCII NOEXPR ^[-0nN] en_US.ISO-8859-1 ABMON_1 Jan en_US.ISO-8859-1 ABMON_2 Feb en_US.ISO-8859-1 ABMON_3 Mar diff --git a/localedata/tst-mbswcs6.c b/localedata/tst-mbswcs6.c index ccf1c9d35a..1b3a43f8e8 100644 --- a/localedata/tst-mbswcs6.c +++ b/localedata/tst-mbswcs6.c @@ -63,7 +63,7 @@ main (void) res =3D do_test ("C"); res |=3D do_test ("de_DE.ISO-8859-1"); res |=3D do_test ("de_DE.UTF-8"); - res |=3D do_test ("en_US.ANSI_X3.4-1968"); + res |=3D do_test ("en_US.ASCII"); res |=3D do_test ("ja_JP.EUC-JP"); res |=3D do_test ("hr_HR.ISO-8859-2"); //res |=3D do_test ("ru_RU.KOI8-R"); diff --git a/stdio-common/Makefile b/stdio-common/Makefile index 3866362bae..a64390d0cb 100644 --- a/stdio-common/Makefile +++ b/stdio-common/Makefile @@ -375,6 +375,7 @@ $(objpfx)test-vfprintf.out: $(gen-locales) $(objpfx)tst-grouping.out: $(gen-locales) $(objpfx)tst-grouping2.out: $(gen-locales) $(objpfx)tst-grouping_iterator.out: $(gen-locales) +$(objpfx)tst-printf-bz25691-mem.out: $(gen-locales) $(objpfx)tst-sprintf.out: $(gen-locales) $(objpfx)tst-sscanf.out: $(gen-locales) $(objpfx)tst-swprintf.out: $(gen-locales) diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz= 25691.c index 44e9ea7d9d..c887b9962f 100644 --- a/stdio-common/tst-printf-bz25691.c +++ b/stdio-common/tst-printf-bz25691.c @@ -30,6 +30,8 @@ static int do_test (void) { + setlocale(LC_CTYPE, "C.UTF-8"); + mtrace (); =20 /* For 's' conversion specifier with 'l' modifier the array must be diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile index 431136b9c9..98c8506874 100644 --- a/wcsmbs/Makefile +++ b/wcsmbs/Makefile @@ -207,7 +207,7 @@ ifeq ($(run-built-tests),yes) LOCALES :=3D \ de_DE.ISO-8859-1 \ de_DE.UTF-8 \ - en_US.ANSI_X3.4-1968 \ + en_US.ASCII \ fa_IR.UTF-8 \ hr_HR.ISO-8859-2 \ ja_JP.EUC-JP \ diff --git a/wcsmbs/tst-btowc.c b/wcsmbs/tst-btowc.c index 1485076ca4..aee4a77136 100644 --- a/wcsmbs/tst-btowc.c +++ b/wcsmbs/tst-btowc.c @@ -78,10 +78,10 @@ do_test (void) { int result =3D 0; =20 - current_locale =3D setlocale (LC_ALL, "en_US.ANSI_X3.4-1968"); + current_locale =3D setlocale (LC_ALL, "en_US.ASCII"); if (current_locale =3D=3D NULL) { - puts ("cannot set locale \"en_US.ANSI_X3.4-1968\""); + puts ("cannot set locale \"en_US.ASCII\""); result =3D 1; } else diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c index 61392e0b1e..e7d69ee4bf 100644 --- a/wcsmbs/wcsmbsload.c +++ b/wcsmbs/wcsmbsload.c @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc =3D .__shlib_handle =3D NULL, .__modname =3D NULL, .__counter =3D INT_MAX, - .__from_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", + .__from_name =3D (char *) "ANSI_X3.4-1968", .__to_name =3D (char *) "INTERNAL", - .__fct =3D __gconv_transform_ascii_internal, - .__btowc_fct =3D __gconv_btowc_ascii, + .__fct =3D __gconv_transform_posix_internal, + .__btowc_fct =3D __gconv_btowc_posix, .__init_fct =3D NULL, .__end_fct =3D NULL, .__min_needed_from =3D 1, @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb =3D .__modname =3D NULL, .__counter =3D INT_MAX, .__from_name =3D (char *) "INTERNAL", - .__to_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", - .__fct =3D __gconv_transform_internal_ascii, + .__to_name =3D (char *) "ANSI_X3.4-1968", + .__fct =3D __gconv_transform_internal_posix, .__btowc_fct =3D NULL, .__init_fct =3D NULL, .__end_fct =3D NULL, @@ -67,7 +67,9 @@ static const struct __gconv_step to_mb =3D }; =20 =20 -/* For the default locale we only have to handle ANSI_X3.4-1968. */ +/* The default/"POSIX"/"C" locale is an 8-bit-clean mapping + with ASCII in the first 128 characters; + we lift the remaining bytes by . */ const struct gconv_fcts __wcsmbs_gconv_fcts_c =3D { .towc =3D (struct __gconv_step *) &to_wc, --=20 2.39.2 --jya35ro4nllego5t Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmS9aUgACgkQvP0LAY0m WPEODhAAta53PFHlJr1KxJIc9PeITrSfI3X+3lRm8ubwpn5DADtQ7HpdBtCM8p8N RE/9bXq7WhyrUbnis7AZFwQtq49ohF2scV3rV2dK+/6230yF09jOgJTdWyU/+EnX iLOtC1kcTlTyEjTtrIhdzlS6XXSumuyeOXpNeXBGPizST7z7UEgUBtmbUVHCvqix szqzzSEb3YbwbW4+o4oJSFs1S3mepX089BvY25Jww0hNMQgM4RCYHiqyhyqeBXCd jrDiS3dMjJcj7QReVYrrZ79nlS6/KYoI1coTFcbGYOCTY20i69796qTKNqnxyiH7 2zTicgdUspeHUx7E4q40kPQygIdJS8r2GGUn8FFpQZpd0KAETBOJv0+prmLKdgx7 sKWUfi6faXmuP2Dug3i5iwI6+RXx4EfSs3wfnh73+dm/ghr9jdmrWDGUMB8nP47E 40WitLx1Q54FA2o0A+OWvA1Qpa5EesdqI9dZUTAFOD54A3NgxiJnCcMouWJhW9be 8AxeDoQlxz0G6q+Fj2M4SZMBiEfuUpOQMnealPz0pW19wDRgSrhBeTAEBDAemYqX sDihba0IAzDqSttw8k+g89dnqxg/hGT4vQx/0/fJysR2JeM/dyoaFfzbFNhxSJfL OmoVpX9dHsEeAay8bDCIJVLTxDpD8M19stvNZ+7W7pOwQhItbDQ= =lKNS -----END PGP SIGNATURE----- --jya35ro4nllego5t--