From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 27B77385416E for ; Mon, 4 Jul 2022 19:54:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 27B77385416E Received: from mail-io1-f72.google.com (mail-io1-f72.google.com [209.85.166.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-374-Jy44q1ZWPdCvhs5b9PCyBA-1; Mon, 04 Jul 2022 15:54:10 -0400 X-MC-Unique: Jy44q1ZWPdCvhs5b9PCyBA-1 Received: by mail-io1-f72.google.com with SMTP id h7-20020a05660224c700b0067898a33ceaso639356ioe.13 for ; Mon, 04 Jul 2022 12:54:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:organization:in-reply-to :content-transfer-encoding; bh=w1ESSzPfUv194EJkA6cLQ6OlBRXO4sHgad1lbwUBtJ8=; b=IkEAcDU2WeaOS6d4IxzlKncBBLILSLNDRlfogWM2W2Vlk90HN4MS7zj4Xk306u47gL w3mzTgL6PrSFKTo5jH/zSFFOmR0jorxNZrQ403tcnPNWerobUEtDKzjUGF0PM5HbFGiz Uzts7Lz06o0TEdqjQEavNUig8e4XA4tpql+8FGdG/oDo83+tcoURichgoTYliileOLWl TrSOtCAiDGBV0f/GrXkDzFzcYQTsrNTNNB+omlk7lpPZxZ3gMzG5XXLjc+P3nTzDyqTC JTk77EowRbpRWFvc7f+VDdd+p1F2L271QUETqD7o06Wy7cRN7vRxUEfDdbnOQDzW9MYy A0jA== X-Gm-Message-State: AJIora/1iROC54W/SRzhGAYQhl2gAIvjrPZSbn605QeUCkk5rO3bD8NH tIe2GXkWungFfLmZt8KhjvWsUeUbQ5WO2+z5PWOrwaBHKVHbU4Dny0UgzOhu+LGbi2tsU9yE9gu y6QOGmJvKbxTLnpYCXovv X-Received: by 2002:a6b:e309:0:b0:675:242:d63e with SMTP id u9-20020a6be309000000b006750242d63emr16758371ioc.57.1656964450029; Mon, 04 Jul 2022 12:54:10 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sCjq9AbL0wwIjaxkDZ3z7Xyixy8L9Cjhnx+MApFGR63TxkRUqi7gGEsfnkU8ge24jIKVsY6w== X-Received: by 2002:a6b:e309:0:b0:675:242:d63e with SMTP id u9-20020a6be309000000b006750242d63emr16758363ioc.57.1656964449784; Mon, 04 Jul 2022 12:54:09 -0700 (PDT) Received: from [192.168.0.241] (135-23-175-80.cpe.pppoe.ca. [135.23.175.80]) by smtp.gmail.com with ESMTPSA id f17-20020a02a111000000b00339bae1dab9sm13867270jag.40.2022.07.04.12.54.08 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Jul 2022 12:54:09 -0700 (PDT) Message-ID: <706a1d98-1d50-ccfd-47ac-4b402fe19454@redhat.com> Date: Mon, 4 Jul 2022 15:54:08 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH 0/5] Assume UTF-8 encoding for localedef input files To: Florian Weimer , libc-alpha@sourceware.org References: From: Carlos O'Donell Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jul 2022 19:54:13 -0000 On 5/19/22 17:06, Florian Weimer via Libc-alpha wrote: > This is a backwards-compatible change because of two localedef bugs that > cause bytes outside the ASCII range to produce unpredictable results: > > If char is signed, conversion from the assumed ISO-8859-1 input format > to a UCS-4 codepoint does not produce the correct result. > > If the output character set is not overlapping ISO-8859-1 in the > characters used in the locale, the required character set conversion > is not applied. > > This is why I think we can switch to UTF-8 without impacting backwards > compatibility, and there is no need for an option to restore the old > behavior. I can agree with that. In some sense I think the parsing of locale files is something that can require developers and users to adjust the syntax, though we'd like for it to be backwards compatible. In this case it couldn't have worked. Thank you for working on this to make locales easier to use. I particularly appreciate the example conversion of de_DE. Overall the series looks good and we should commit this ahead of glibc 2.36 so we can get any new strings translated for the TP project. This series particularly adds some error messages for the use of UTF-8 in the locale sources. Again, I really appreciate that this makes it easier for natural language speakers to write, adjust, and review locale sources. In cases where disambiguation is required we still have the capacity to write it differently if we need to. This continues the early work to convert from U-codes to ASCII. Just like last time we had this discussion the idea that glibc would support compiling locale sources on a system that lacks UTF-8 is no longer a requirement that we should have for the library. > Tested on i686-linux-gnu and x86_64-linux-gnu. > > Thanks, > Florian > > Florian Weimer (5): > locale: Turn ADDC and ADDS into functions in linereader.c > locale: Fix signed char bug in lr_getc > locale: Introduce translate_unicode_codepoint into linereader.c > locale: localdef input files are now encoded in UTF-8 > de_DE: Convert to UTF-8 > > NEWS | 4 + > locale/programs/linereader.c | 504 ++++++++++++++++++++++------------- > locale/programs/linereader.h | 2 +- > localedata/locales/de_DE | 32 +-- > 4 files changed, 338 insertions(+), 204 deletions(-) > > > base-commit: 2d5ec6692f5746ccb11db60976a6481ef8e9d74f -- Cheers, Carlos.