From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 46FA5385EC56 for ; Fri, 30 Apr 2021 04:18:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 46FA5385EC56 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-370-XU_wusQhP6KLzafOG6SdeA-1; Fri, 30 Apr 2021 00:18:30 -0400 X-MC-Unique: XU_wusQhP6KLzafOG6SdeA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 0F34210054F6 for ; Fri, 30 Apr 2021 04:18:30 +0000 (UTC) Received: from oldenburg.str.redhat.com (ovpn-115-124.ams2.redhat.com [10.36.115.124]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0BF5F60CC5; Fri, 30 Apr 2021 04:18:25 +0000 (UTC) From: Florian Weimer To: Carlos O'Donell Cc: libc-alpha@sourceware.org Subject: Re: [PATCH v4 2/4] Update UTF-8 charmap processing. References: <20210428130033.3196848-1-carlos@redhat.com> <20210428130033.3196848-3-carlos@redhat.com> <87mtthfakh.fsf@oldenburg.str.redhat.com> Date: Fri, 30 Apr 2021 06:18:38 +0200 In-Reply-To: (Carlos O'Donell's message of "Thu, 29 Apr 2021 17:02:04 -0400") Message-ID: <87h7joe76p.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Apr 2021 04:18:34 -0000 * Carlos O'Donell: > On 4/29/21 10:07 AM, Florian Weimer wrote: >> * Carlos O'Donell: >> >>> def convert_to_hex(code_point): >>> '''Converts a code point to a hexadecimal UTF-8 representation >>> - like /x**/x**/x**.''' >>> - # Getting UTF8 of Unicode characters. >>> - # In Python3, .encode('UTF-8') does not work for >>> - # surrogates. Therefore, we use this conversion table >>> - surrogates = { >>> - 0xD800: '/xed/xa0/x80', >>> - 0xDB7F: '/xed/xad/xbf', >>> - 0xDB80: '/xed/xae/x80', >>> - 0xDBFF: '/xed/xaf/xbf', >>> - 0xDC00: '/xed/xb0/x80', >>> - 0xDFFF: '/xed/xbf/xbf', >>> - } >>> - if code_point in surrogates: >>> - return surrogates[code_point] >>> - return ''.join([ >>> - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8') >>> - ]) >>> + ready for use in a locale character map specification e.g. >>> + /xc2/xaf for MACRON. >>> + >>> + ''' >>> + cp_locale = '' >>> + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass') >>> + for byte in cp_bytes: >>> + cp_locale += ''.join('/x{:02x}'.format(byte)) >>> + return cp_locale >> >> I think you should keep the list comprehension. That ''.join() is >> unnecessary. > > Like this? > > return ''.join(['/x{:02x}'.format(c) \ > for c in chr(code_point).encode('UTF-8', 'surrogatepass')]) > > (tested works fine and produces the same results) Yes, exactly. Thanks. The patch should be fine with this. Florian