From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12c.google.com (mail-lf1-x12c.google.com [IPv6:2a00:1450:4864:20::12c]) by sourceware.org (Postfix) with ESMTPS id 087E53855028 for ; Sun, 18 Jul 2021 20:12:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 087E53855028 Received: by mail-lf1-x12c.google.com with SMTP id a12so26096841lfb.7 for ; Sun, 18 Jul 2021 13:12:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rNUQzx+hQuKQM8URFcowBJSEpN/cj+uC8EciZtWbtMA=; b=ELLBoEFSpZR28b2Jniu/pQpP7u9MRzuOAEOd3+gqh4h5d39AKKkf9DeROKGb3dfMGu cfpv87ZuPzXYvOvNS8w2kpokBd/vXU0yI+8hvJIqTxvjWn2cL30jYKVmSAzaunXyXOma gSSA6Tzk3Nxm6HU4kUacqsorMk/l+XgUeZlzLnx/bhpyPT8llS/5KAczpfXI8cPEE+1P fuGEEt5EYOtC2htDFzivmvlYcTBrb4Oaffncz6VHCzvfMAPdyY4/4wsD+gpvVl0C8o96 F73IVu2GsTxdXPecpD2Hk1QhAVpP3ItqR/i+PyG+2mNjUiRrTfRzLQzmcqydKkBMB9bb N3lw== X-Gm-Message-State: AOAM533l47mLVCvX6uf20tUSIlYFaxLxklcq1khs3fNWC1RbFF4e/Jb2 2cu804szs0aVXH5uqLsHZjETaLoJf4PWYXJPaLu0zlk3mvyy+A== X-Google-Smtp-Source: ABdhPJyqIco5aAaLFP0LpO6jWtgjJAnocXl46DcC/jdDBjJgbWBWNJQ7VOnmwF3Nrf0oDh3vADyUAMMt8Z4nGCFWu4Y= X-Received: by 2002:a19:8c07:: with SMTP id o7mr16215721lfd.637.1626639142924; Sun, 18 Jul 2021 13:12:22 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ian Lance Taylor Date: Sun, 18 Jul 2021 13:12:01 -0700 Message-ID: Subject: Re: rust frontend and UTF-8/unicode processing/properties To: Mark Wielaard Cc: gcc@gcc.gnu.org, gcc-rust@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.7 required=5.0 tests=BAYES_00, DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jul 2021 20:12:27 -0000 On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard wrote: > > For the gcc rust frontend I was thinking of importing a couple of > gnulib modules to help with UTF-8 processing, conversion to/from > unicode codepoints and determining various properties of those > codepoints. But it seems gcc doesn't yet have any gnulib modules > imported, and maybe other frontends already have helpers to this that > the gcc rust frontend could reuse. > > Rust only accepts valid UTF-8 encoded source files, which may or may > not start with UTF-8 BOM character. Whitespace is any codepoint with > the Pattern_White_Space property. Identifiers can start with any > codepoint with the XID_start property plus zero or one codepoints with > XID_continue property. It isn't required, but highly desirable to > detect confusable identifiers according to tr39/Confusable_Detection. > > Other names might be constraint to Alphabetic and/or Number categories > (Nd, Nl, No), textual types can only contain Unicode Scalar Values > (any Unicode codepoint except high-surrogate and low-surrogates), > strings in source code can contain unicode escapes (24 bit, up to 6 > digits codepoints) but are internally stored as UTF-8 (and must not > encode any surrogates). > > Do other gcc frontends handle any of the above already in a way that > might be reusable for other frontends? I don't know that this is particularly helpful, but the Go frontend has this kind of code in gcc/go/gofrontend/lex.cc. E.g., Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, unicode_digits, unicode_letters, Lex::is_unicode_space, etc. But you probably won't be able to use the code directly, and the code in the gofrontend directory is also shared with GoLLVM so it can't trivially be moved. Ian