From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 2155) id B55003858D1E; Tue, 14 Feb 2023 12:09:53 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B55003858D1E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1676376593; bh=CVMvxak5U2CwQ3K+hlx6/q4wRus9Kv0ADlLuGUA/Cag=; h=From:To:Subject:Date:From; b=poKbi1BoNVHMgxDQHB8Ra2rDEr2WrjVIwrwLmGHyFiqjJoBlOl3YwdDcmVsX6SRED vDX6jSqnR4r8/PBVcStWRQlxhPrrPEesClAQ2j9jCOUOuaIWWhMLk1ZGPKr3YaXDad EXG0WyPT71990q4qbsoUqKrkUEvDj/ZYOQCx6EiE= Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable From: Corinna Vinschen To: cygwin-cvs@sourceware.org Subject: [newlib-cygwin/main] Cygwin: mbrtowi: define replacement for mbrtowc, returning UTF-32 value X-Act-Checkin: newlib-cygwin X-Git-Author: Corinna Vinschen X-Git-Refname: refs/heads/main X-Git-Oldrev: 210eca1b31090d4c93c22a3152f1faa795dfd775 X-Git-Newrev: 60c25da90d015f27c5697c6db7ab0557585d09aa Message-Id: <20230214120953.B55003858D1E@sourceware.org> Date: Tue, 14 Feb 2023 12:09:53 +0000 (GMT) List-Id: https://sourceware.org/git/gitweb.cgi?p=3Dnewlib-cygwin.git;h=3D60c25da90d0= 15f27c5697c6db7ab0557585d09aa commit 60c25da90d015f27c5697c6db7ab0557585d09aa Author: Corinna Vinschen AuthorDate: Tue Feb 14 12:20:20 2023 +0100 Commit: Corinna Vinschen CommitDate: Tue Feb 14 12:20:20 2023 +0100 Cygwin: mbrtowi: define replacement for mbrtowc, returning UTF-32 value =20 Given how UTF-16 isn't capable to hold all Unicode chars in a single wchar_t, we need a function returning a wint_t value representing a UTF-32 value for comparison functions. Fortunately the important wide character functions like towupper/towlower, isw, iswctype, etc, already take wint_t values and newlib handles them as UTF-32. =20 If only we had switched wchar_t to 32 bit way back when... sigh. =20 Signed-off-by: Corinna Vinschen Diff: --- winsup/cygwin/local_includes/wchar.h | 4 ++++ winsup/cygwin/strfuncs.cc | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/winsup/cygwin/local_includes/wchar.h b/winsup/cygwin/local_inc= ludes/wchar.h index b2ddd457568f..3d746c29b9bf 100644 --- a/winsup/cygwin/local_includes/wchar.h +++ b/winsup/cygwin/local_includes/wchar.h @@ -39,6 +39,10 @@ extern wctomb_f __utf8_wctomb; =20 #define __WCTOMB (__get_current_locale ()->wctomb) =20 +/* replacement function for mbrtowc, returning a wint_t representing + a UTF-32 value. Defined in strfuncs.cc */ +extern wint_t mbrtowi (wint_t *, const char *, size_t, mbstate_t *); + #ifdef __cplusplus } #endif diff --git a/winsup/cygwin/strfuncs.cc b/winsup/cygwin/strfuncs.cc index 0ab2290539a8..0b9d8ac1f639 100644 --- a/winsup/cygwin/strfuncs.cc +++ b/winsup/cygwin/strfuncs.cc @@ -112,6 +112,38 @@ transform_chars_af_unix (PWCHAR out, const char *path,= __socklen_t len) return out; } =20 +/* replacement function for mbrtowc, returning a wint_t representing + a UTF-32 value. */ +extern "C" wint_t +mbrtowi (wint_t *pwi, const char *s, size_t n, mbstate_t *ps) +{ + size_t len, len2; + wchar_t w1, w2; + + len =3D mbrtowc (&w1, s, n, ps); + if (len =3D=3D (size_t) -1 || len =3D=3D (size_t) -2) + return len; + *pwi =3D w1; + /* Convert surrogate pair to wint_t value */ + if (len > 0 && w1 >=3D 0xd800 && w1 <=3D 0xdbff) + { + s +=3D len; + n -=3D len; + len2 =3D mbrtowc (&w2, s, n, ps); + if (len2 > 0 && w2 >=3D 0xdc00 && w2 <=3D 0xdfff) + { + len +=3D len2; + *pwi =3D (((w1 & 0x3ff) << 10) | (w2 & 0x3ff)) + 0x10000; + } + else + { + len =3D (size_t) -1; + errno =3D EILSEQ; + } + } + return len; +} + /* The SJIS, JIS and eucJP conversion in newlib does not use UTF as wchar_t character representation. That's unfortunate for us since we require UTF for the OS. What we do here is to have our own