From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 3B6303858428 for ; Mon, 6 Sep 2021 04:16:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3B6303858428 Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-548-JbQFc_noOJWbzpI9PuNLJw-1; Mon, 06 Sep 2021 00:16:01 -0400 X-MC-Unique: JbQFc_noOJWbzpI9PuNLJw-1 Received: by mail-qt1-f199.google.com with SMTP id o22-20020ac872d60000b029029817302575so8553453qtp.10 for ; Sun, 05 Sep 2021 21:16:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=CJ/YCkgCOc+u1QJWJ/bU2PmzjhHU9b80o9VJzbHJ0UQ=; b=P4QgcVAcGfhyT3mc/3j5Jjutko12AVQ4kpP5XffbWOTSc6yWaeledCDcm5ZdHh7LA8 ZF96gjY6plR/15YNzvjkVWDn9oK91Z+0yICdCay4RyOJyq1vuxIFuzGdwjChRiensBQZ l752iN2+Mo0f59MeYOuJ8MWXTOzWz5pO41OwalpN3GrVGU9ANfHNXhx4efRr7ZyEy2A5 43+AIjaNoJx+iYvbvv4InJTsCksMBH8dwFqD9vag31v0rRHJNEU+rPESwOJ4JclbROOf n7aXrNoGXpZg0JA6FSm/JkEPcT84ij0oaxguMOtyc/rbdOpHkeAc4ev93W3MmybNNZCu Yz3Q== X-Gm-Message-State: AOAM530ZDNMrR9PmKgSOxFLdMAtj8+roSHKSf8B2J8dL4BrF79BqlgqC KdgrkboyNisaSIEvlT4iE4dvrAu+l7LOJ72t2NzO5nSdWJiH0g5VjkXBdY1KnXX82FtDb1nPvQ/ NJWg1IraeAixruDaLaX9beQDkvjYSS3FePfrAa6IY8DxHAwZqC1yQB4QYk81lRzcs76kqBw== X-Received: by 2002:ae9:ef01:: with SMTP id d1mr9401748qkg.423.1630901760588; Sun, 05 Sep 2021 21:16:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxri5jhOl92QXj20bR1SvV97L7FbEN3CIJ4/JK7nLzDkpS4QKCqMc3R0EBcVG11Z52dMY4cdA== X-Received: by 2002:ae9:ef01:: with SMTP id d1mr9401737qkg.423.1630901760353; Sun, 05 Sep 2021 21:16:00 -0700 (PDT) Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id h17sm4598288qtu.68.2021.09.05.21.15.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Sep 2021 21:15:59 -0700 (PDT) From: Carlos O'Donell To: libc-alpha@sourceware.org, fweimer@redhat.com Subject: [PATCH v10 0/2] C.UTF-8 Date: Mon, 6 Sep 2021 00:15:55 -0400 Message-Id: <20210906041557.2470672-1-carlos@redhat.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Sep 2021 04:16:05 -0000 The following changes implement a minimally sized C.UTF-8. First we implement the 'codepoint_collation' directive. Then we implement C.UTF-8 with an LC_COLLATE that uses the 'codepoint_collation' directive to support using strcmp or wcscmp for collation i.e. code point sorting. The final C.UTF-8 is only ~396KiB with the largest ~346KiB in LC_CTYPE for all of Unicode. v10 fixes a defect in the transbug.c test. v9 is rebased against the changes to remove ISO-8859-1 characters from the bug-regex1.c test (69623c0db0a540f26ee537bae09446d3dcdf1f80). v8 includes a NEWS entry for the updated C.UTF-8. v7 fixed the regressions detected in Fedora Rawhide here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421, but does so by generating identity tables for _NL_COLLATE_COLLSEQMB, and _NL_COLLATE_COLLSEQWC to provide mappings for ASCII characters. This ensures that static applications using the new C.UTF-8 have a functioning fnmatch, regcomp, and regexec for ASCII ranges. This raises the size of LC_COLLATE from 92 to 1406 bytes. Valgrind reports no errors using the tables with C.UTF-8 under tst-fnmatch. v7 also corrected collation sequence byte ordering on BE targets, and I verified this by building crossed locales with localedef --big-endian and confirming that s390x built native C.UTF-8 is the same as an x86_64 C.UTF-8 built wtih --big-endian. The fixes that were in v4 for nrules == 0 will be included in the next release of glibc, and when those are proven correct they can be backported to provide dyanmic or newly compiled static applications with the ability to use all code points in ranges. ** BLURB HERE *** Carlos O'Donell (2): Add 'codepoint_collation' support for LC_COLLATE. Add generic C.UTF-8 locale (Bug 17318) NEWS | 10 +- iconv/Makefile | 22 +- iconv/tst-iconv9.c | 87 +++++ locale/C-collate-seq.c | 101 ++++++ locale/C-collate.c | 78 +---- locale/programs/ld-collate.c | 36 +- locale/programs/locfile-kw.gperf | 1 + locale/programs/locfile-kw.h | 299 ++++++++--------- locale/programs/locfile-token.h | 1 + localedata/C.UTF-8.in | 157 +++++++++ localedata/Makefile | 2 + localedata/SUPPORTED | 1 + localedata/locales/C | 194 +++++++++++ posix/Makefile | 16 +- posix/bug-regex1.c | 20 ++ posix/bug-regex19.c | 22 +- posix/bug-regex4.c | 25 ++ posix/bug-regex6.c | 2 +- posix/transbug.c | 24 +- posix/tst-fnmatch.input | 549 ++++++++++++++++++++++++++++++- posix/tst-regcomp-truncated.c | 1 + posix/tst-regex.c | 25 +- 22 files changed, 1414 insertions(+), 259 deletions(-) create mode 100644 iconv/tst-iconv9.c create mode 100644 locale/C-collate-seq.c create mode 100644 localedata/C.UTF-8.in create mode 100644 localedata/locales/C -- 2.31.1