Hi! On Tue, Sep 06, 2022 at 04:19:01PM +0200, Florian Weimer wrote: > * наб via Libc-alpha: > > > This is a trivial patch, largely duplicating the extant ASCII code > > > > There are two user-facing changes: > > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > > * mbrtowc() and friends return b if b <= 0x7F else +b > > > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively: > > (a) is 1-byte, stateless, and contains 256 characters > > (b) which collate in byte order > > (c) the first 128 characters are equivalent to ASCII (like previous) > > cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of > > changes to the standard; > > in short, this means that mbrtowc() must never fail and must return > > b if b <= 0x7F else ab+c for all bytes b > > where c is some constant >=0x80 > > and a is a positive integer constant > > > > By strategically picking c= we land at the tail-end of the > > Unicode Low Surrogate Area at DC00-DFFF, described as > > > Isolated surrogate code points have no interpretation; > > > consequently, no character code charts or names lists > > > are provided for this range. > > and match musl > > We don't match Python and its surrogateescape encoding (PEP 838). 404? > It > maps invalid bytes in the 0x80…0xff range to U+DC80…U+DCFF. (The same as musl.) > It may make > more sense to align with that. With a=1 and c=, assuming it's as you say, we very much do? $ printf '\x80\xff' | output/elf/ld.so --library-path output/ output/iconv/iconv_prog -fPOSIX -tUCS4 | hd 00000000 00 00 df 80 00 00 df ff |........| 00000008 > Anyway, regarding mechanics, we'll need a new localedata/charmaps/POSIX > charmap, I think. This charmap then can be tested against the gconv > converter. Hm, the problem with that is tst-tables -> tst-table -> tst-table-from (and -to) convert by constructing a UTF-8 sequence. The problem with this approach is that glibc rejects unpaired surrogates. The output for tst-table-from UTF-8 is: ... 0xED9FBE 0xD7FE 0xED9FBF 0xD7FF 0xEE8080 0xE000 0xEE8081 0xE001 ... i.e. there's a gap for the surrogates; and, indeed, the charmap reads /xed/x9f/xbb HANGUL JONGSEONG PHIEUPH-THIEUTH % /xed/xa0/x80 % /xed/xad/xbf % /xed/xae/x80 % /xed/xaf/xbf % /xed/xb0/x80 % /xed/xbf/xbf .. /xee/x80/x80 with the surrogate range commented-out; this dates back to the inclusion of UTF-8 generator scripts in 2015 (4a4839c94a4c93ffc0d5b95c69a08b02a57007f2), these exclusions are deliberate (grep for surrog in localedata/unicode-gen/utf8_gen.py). Given this limitation, expanding the charmap to ANSI_X3.4-1968 + .. doesn't actually test much: having them as separate codepoints will always fail tests, and dot-notation lines are ignored when generating the comparison tables, so this particular type of test just proves that POSIX is the same as ANSI_X3.4-1968 for the first 128 characters. There's already an exhaustive iconv_prog-based testsuite (cf. additions to iconv/tst-iconv_prog.sh), though. > You should put the new converters into a separate file (not > iconv/gconv_simple.c), then the s390x version will use that > automatically. Oh, of course! Moved to iconv/gconv_posix.c. > > diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX > > index 7ec7f1c577..fc34a6abc1 100644 > > --- a/localedata/locales/POSIX > > +++ b/localedata/locales/POSIX > > @@ -97,6 +97,20 @@ END LC_CTYPE > > LC_COLLATE > > % This is the POSIX Locale definition for the LC_COLLATE category. > > Isn't this just the C locale? Yes, C is defined to be POSIX. > We don't have a separate file for that. Yes, we very obviously do, seeing as this patch edits it? Nothing consumes it AFAICT, but. > > diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c > > index 0f0f55f9ed..f87099bcf5 100644 > > --- a/wcsmbs/wcsmbsload.c > > +++ b/wcsmbs/wcsmbsload.c > > @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc = > > .__shlib_handle = NULL, > > .__modname = NULL, > > .__counter = INT_MAX, > > - .__from_name = (char *) "ANSI_X3.4-1968//TRANSLIT", > > + .__from_name = (char *) "POSIX", > > .__to_name = (char *) "INTERNAL", > > - .__fct = __gconv_transform_ascii_internal, > > - .__btowc_fct = __gconv_btwoc_ascii, > > + .__fct = __gconv_transform_posix_internal, > > + .__btowc_fct = __gconv_btwoc_posix, > > .__init_fct = NULL, > > .__end_fct = NULL, > > .__min_needed_from = 1, > > @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb = > > .__modname = NULL, > > .__counter = INT_MAX, > > .__from_name = (char *) "INTERNAL", > > - .__to_name = (char *) "ANSI_X3.4-1968//TRANSLIT", > > - .__fct = __gconv_transform_internal_ascii, > > + .__to_name = (char *) "POSIX", > > + .__fct = __gconv_transform_internal_posix, > > .__btowc_fct = NULL, > > .__init_fct = NULL, > > .__end_fct = NULL, > > This makes the comment on __wcsmbs_gconv_fcts_c in the same file > obsolete. Comment fixed. > Thanks, > Florian New patchset in followup. Best, наб