From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 40AB838654B5 for ; Sun, 18 Jul 2021 22:24:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 40AB838654B5 Received: from mail-io1-f71.google.com (mail-io1-f71.google.com [209.85.166.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-231-6cOd3ynpOhm4yGCQGT7V5g-1; Sun, 18 Jul 2021 18:24:21 -0400 X-MC-Unique: 6cOd3ynpOhm4yGCQGT7V5g-1 Received: by mail-io1-f71.google.com with SMTP id v2-20020a5d94020000b02905058dc6c376so11099794ion.6 for ; Sun, 18 Jul 2021 15:24:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZeJ3GQ0nA3kt3dGtM3KB7S6hbQR3HGu40xgClD+OHjI=; b=gLl9DYxzcX/j9UGSed+g/YzuJAuf0T5FsVDP4222JbIErKvYMW24unULt+wpXzFGs9 dpCfXFX9mWQqTrcW5mLp6DzZ02hQ04edEMyoPWtwy3e7d5H/u/LzXshXHFG7/nWuVP9i SWcrjLfGjLn22VC+DSUmZnUARACJe331wyMosGayHpUaDSAuWlpWItJJwmL5Z7QcHUbh qCw4txz63oWeX8owFK2QFekX8Vv/N1g6ghTyyqkLwfr8GL2JAIcMKaMi8czJELLgDGsG PQBiIP0UnVKZ9e5PEgo/N/tD94qx//5S3Qmuj8O1PvFxC7p0H4TaryAG4vQkbEfiUQtp nx7w== X-Gm-Message-State: AOAM533SSaAXbDjfp//T0hz22iE6RrK2j3aWl20oSBl2T5YFWZJItfbE Ya9TGDnEMaU6tneg5u0tnBz87PF+HwvBMAPP+cpBOg8m1CdYx7J9bIY2az3dx37CnTiV8+NqIm6 SBu6MZkerqeI5IUpL6N7XbSc= X-Received: by 2002:a63:4242:: with SMTP id p63mr22274370pga.185.1626647050256; Sun, 18 Jul 2021 15:24:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJymRMBH4SsjYv1g9hqF6cyVBoKXX5QPFuNFvtf1uSMxVAqUt/4aJG5KOfrygVcTB/hi24BDCM4ecDFcmWr+rlc= X-Received: by 2002:a63:4242:: with SMTP id p63mr22274353pga.185.1626647049976; Sun, 18 Jul 2021 15:24:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jason Merrill Date: Sun, 18 Jul 2021 15:23:59 -0700 Message-ID: Subject: Re: rust frontend and UTF-8/unicode processing/properties To: Ian Lance Taylor Cc: Mark Wielaard , gcc Mailing List , gcc-rust@gcc.gnu.org X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, HTML_MESSAGE, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=unavailable autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jul 2021 22:24:24 -0000 On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc wrote: > On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard wrote: > > > > For the gcc rust frontend I was thinking of importing a couple of > > gnulib modules to help with UTF-8 processing, conversion to/from > > unicode codepoints and determining various properties of those > > codepoints. But it seems gcc doesn't yet have any gnulib modules > > imported, and maybe other frontends already have helpers to this that > > the gcc rust frontend could reuse. > > > > Rust only accepts valid UTF-8 encoded source files, which may or may > > not start with UTF-8 BOM character. Whitespace is any codepoint with > > the Pattern_White_Space property. Identifiers can start with any > > codepoint with the XID_start property plus zero or one codepoints with > > XID_continue property. It isn't required, but highly desirable to > > detect confusable identifiers according to tr39/Confusable_Detection. > > > > Other names might be constraint to Alphabetic and/or Number categories > > (Nd, Nl, No), textual types can only contain Unicode Scalar Values > > (any Unicode codepoint except high-surrogate and low-surrogates), > > strings in source code can contain unicode escapes (24 bit, up to 6 > > digits codepoints) but are internally stored as UTF-8 (and must not > > encode any surrogates). > > > > Do other gcc frontends handle any of the above already in a way that > > might be reusable for other frontends? > > I don't know that this is particularly helpful, but the Go frontend > has this kind of code in gcc/go/gofrontend/lex.cc. E.g., > Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, > unicode_digits, unicode_letters, Lex::is_unicode_space, etc. But you > probably won't be able to use the code directly, and the code in the > gofrontend directory is also shared with GoLLVM so it can't trivially > be moved. > I believe the UTF-8 handling for the C family front ends is all in libcpp; I don't think it's factored in a way to be useful to other front ends. Jason