From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTPS id CE9EF3858C60 for ; Tue, 2 Nov 2021 16:53:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE9EF3858C60 Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-451-1LYnaiDpPUG6aCcacrzw8A-1; Tue, 02 Nov 2021 12:53:01 -0400 X-MC-Unique: 1LYnaiDpPUG6aCcacrzw8A-1 Received: by mail-qk1-f198.google.com with SMTP id h2-20020a05620a10a200b00462c87635cdso11548270qkk.15 for ; Tue, 02 Nov 2021 09:53:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:organization:in-reply-to :content-transfer-encoding; bh=sgbbr19kpbmco+O+c2PzQjz/4O4GeXy0/8xozVSwvmU=; b=Nxh52m86LAxY29f75fpyyqxOwKeLLwjYqW7nrtVGi3yOBx7mHvciz6Jd8fPAtV1Iwx nOMNMszcWMZT9VfOBb0Ey6JkJfC0+g+FT6YCt5Y5uFckz5uDYuPxN+WCuZiAuEF7fYtv wI0YiboBESV2ezrvYGBr8JUOsmZnN329UvTgn5hpDVH/vPG5JdH38E50U5OyW0LnhHMc iRaUR0Xd17T3TUpJ9uqNzyM5eiJ25BKYu2RbnClKz4eITUN8u3UW389nloKShIqPWP5x dMfU8haXuuZEFPkEendZgj9Mi787XNYs0p7MOQMxf4NNu/3izjyK4d6sENMcn9jbOtLC 2bhg== X-Gm-Message-State: AOAM530SCQirsA4DWqUGvkck1KpEi3mpwmvfouyIcBqD33CBr9hcrBw0 pI7EzYIrG+u+ZRq139mpYHkwQXbAmhgz2uRJ86TF9Rm+CBTn83QioZufGsarmvmbedS9q55TISN kxwc8z6I2EJnVPQq/DGA= X-Received: by 2002:ac8:7e93:: with SMTP id w19mr39311595qtj.192.1635871980538; Tue, 02 Nov 2021 09:53:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJhtNttlox33lPL2lhE8Bqzw9SmHiFuwEbP5LyRKO9KDYF1B702IK1JADn1n7vXRluILRoUg== X-Received: by 2002:ac8:7e93:: with SMTP id w19mr39311569qtj.192.1635871980254; Tue, 02 Nov 2021 09:53:00 -0700 (PDT) Received: from [192.168.1.16] (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id y14sm13297240qtw.68.2021.11.02.09.52.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 02 Nov 2021 09:52:59 -0700 (PDT) Message-ID: <4d7eb2e9-7a41-97ce-1041-795e266803ce@redhat.com> Date: Tue, 2 Nov 2021 12:52:58 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0 Subject: Re: [idea] Update ISO 14651 file in locales to the latest standard version To: Florian Weimer , Alexander Bantyev Cc: libc-help@sourceware.org, mfabian@redhat.com References: <878rz0zwvu.fsf@mid.deneb.enyo.de> From: Carlos O'Donell Organization: Red Hat In-Reply-To: <878rz0zwvu.fsf@mid.deneb.enyo.de> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Nov 2021 16:53:11 -0000 On 10/10/21 12:07, Florian Weimer wrote: > * Alexander Bantyev: > >> The file localedef/locales/iso14651_t1_common is, as far as I can tell, >> supposed to be taken from . >> However, >> the version in glibc repository is quite old (from 2016, I think) and is >> missing some new Unicode codepoints. There have been new editions to the >> standard, the newest being edition 6 from 2020: >> >> >> Perhaps the file in the glibc repository can be updated to match the >> latest standard? > > I think it's scary to update this file because it alters the result of > bracket patterns in regular expressions. The file is no longer fully > automatically generated, I think. Implementing rational ranges where > it counts in glibc would be one way forward here. > > Cc:ing Mike and Carlos, who have more details. (1) Where does glibc's ISO 14651 data come from? We use ISO 14651 in glibc for collation weights. We do not use ISO 14651 in glibc for collation element ordering (CEO). (2) Is glibc's ISO 14651 data updated in an automated fashion? No. Importing new ISO 14651 data is a manual and difficult process that involves harmonizing with all existing locale and their collation tailorings. This is difficult and requires reviewing the tailorings and harmoizning them with the updates from ISO 14651. (3) What about regexp ranges? Regular expression ranges rely on "collation element ordering" (not weights) and so after importing ISO 14651 updates we must update the element orders to retain rational ranges for English language speaker expectations for ranges e.g. [a-z], [A-Z], and [0-9]. (4) When was the ISO 14651 data last updated for glibc? In 2018 we updated to ISO 14651 4th Edition which was harmonized with Unicode 9.0.0. We have not updated to 5th or 6th Edition yet. I've filed the following bug to track this: Bug 28528 - Update to ISO 14651 6th Edition 2020. https://sourceware.org/bugzilla/show_bug.cgi?id=28528 Hopefully this answers your questions. -- Cheers, Carlos.