From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112]) by sourceware.org (Postfix) with ESMTPS id C4AB53858D29 for ; Thu, 23 Sep 2021 20:53:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C4AB53858D29 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=klomp.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org Received: from reform (deer0x0a.wildebeest.org [172.31.17.140]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id E80DD300071F; Thu, 23 Sep 2021 22:53:31 +0200 (CEST) Received: by reform (Postfix, from userid 1000) id F19C92E82F0F; Thu, 23 Sep 2021 22:53:30 +0200 (CEST) Date: Thu, 23 Sep 2021 22:53:30 +0200 From: Mark Wielaard To: Arthur Cohen Cc: Philip Herron , gcc-rust@gcc.gnu.org, Thomas Schwinge Subject: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Message-ID: References: <20210921225430.166550-1-mark@klomp.org> <87k0j9ym7r.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Sep 2021 20:53:35 -0000 On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > Something I was thinking about outside of the scope of that patch was > about the utf8 how do they get represented? Is it some kind of wchar_t? > > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode > Scalar Values which are 4 bytes wide. > > From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html] > > So I'm assuming they could be represented as `int32_t`s which would also > make sense for the check Yes, for rust characters a 32bit type (I would pick uint32_t or maybe char32_t) makes sense, since chars in rust are (almost) equal to unicode code points (technically only 21 bits are used). But not really, it is a Unicode scalar value, which excludes high-surrogate and low-surrogate code points and so the only valid values are 0x0 to 0xD7FF and 0xE000 to 0x10FFFF. We should not use the C type wchar_t, because wchar_t is implementation defined and can be 16 or 32 bits. See also https://doc.rust-lang.org/reference/types/textual.html But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8 chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8 We can use c++ strings made up of (8 bit) chars for that. Our lexer should make sure we only accept valid rust characters or utf-8 sequences. Note that the above doesn't hold for "byte chars" (b'A') or "byte strings" (b"abc"). Those are really just u8 or [u8] arrays which hold bytes (0x0 to 0xff). We currently get the type for byte strings wrong. We pretend they are &str, but they really should be &[u8]. I tried to fix that with the following: diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typecheck/rust-hir-type-check-expr.h index fe8973a4d81..b0dd1c3ff2c 100644 --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h @@ -609,15 +609,42 @@ public: break; case HIR::Literal::LitType::BYTE_STRING: { - /* We just treat this as a string, but it really is an arraytype of - u8. It isn't in UTF-8, but really just a byte array. */ - TyTy::BaseType *base = nullptr; - auto ok = context->lookup_builtin ("str", &base); + /* This is an arraytype of u8 reference (&[u8;size]). It isn't in + UTF-8, but really just a byte array. Code to construct the array + reference copied from ArrayElemsValues and ArrayType. */ + TyTy::BaseType *u8; + auto ok = context->lookup_builtin ("u8", &u8); rust_assert (ok); + auto crate_num = mappings->get_current_crate (); + Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID, + mappings->get_next_hir_id (crate_num), + UNKNOWN_LOCAL_DEFID); + + /* Capacity is the size of the string (number of chars). + It is a constant, but for fold it to get a BExpression. */ + std::string capacity_str = std::to_string (expr.as_string ().size ()); + HIR::LiteralExpr literal_capacity (mapping, capacity_str, + HIR::Literal::LitType::INT, + PrimitiveCoreType::CORETYPE_USIZE, + expr.get_locus ()); + + // mark the type for this implicit node + context->insert_type (mapping, + new TyTy::USizeType (mapping.get_hirid ())); + + Bexpression *capacity + = ConstFold::ConstFoldExpr::fold (&literal_capacity); + + TyTy::ArrayType *array + = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), capacity, + TyTy::TyVar (u8->get_ref ())); + + context->insert_type (expr.get_mappings (), array); + infered = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (), - TyTy::TyVar (base->get_ref ()), false); + TyTy::TyVar (array->get_ref ()), false); } break; But that looks more complicated than is probably necessary and it doesn't work. When the type checker wants to print this type ReferenceType.as_string () goes into a loop for some reason. Can anybody see what is wrong with the above? Cheers, Mark