From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 46204386F02B; Sat, 6 Jun 2020 21:53:43 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 46204386F02B From: "johannes at sipsolutions dot net" To: glibc-bugs@sourceware.org Subject: [Bug locale/2373] Restrict UTF-8 to 17 planes, as required by RFC 3629 Date: Sat, 06 Jun 2020 21:53:42 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: locale X-Bugzilla-Version: 2.3.6 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: johannes at sipsolutions dot net X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Jun 2020 21:53:43 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D2373 --- Comment #8 from Johannes Berg --- Oh, ok. The original comment here seemed to imply that ISO was the last one= to hold out for more space than the others. To carry over some discussion from the bug I originally filed (which was si= nce closed as duplicate in favour of this one): This came up because Python does this conversion using mbstowcs() and/or mbrtowc(), but then later goes to check that valid characters were returned. The python discussion is here: https://bugs.python.org/issue35883 Note that this isn't just about the range, but also the RFC prohibits the surrogate pair reservations: RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. (Python internally may actually allow using this in an UTF-8-like encoded string [that they call utf-8b] to carry arbitrary bytes around.) --=20 You are receiving this mail because: You are on the CC list for the bug.=