Hi Mark, This is really useful information, will this mean that the lexer token will need to represent strings differently as well? Or is the std::string in the lexer still ok? The change you made above has the problem that reference types like, arrays are forms of what rust calls covariant types since they might contain an inference variable, so they require lookup to determine the base type. Its likely there is a reference cycle here. Though this change will not be correct for type checking purposes. The design of the type system is purely about rust type checking and inferring types. So for example this change will break the case of: ``` let a:str = "test"; ``` Since the TypePath of str can't know the size of the expected array at compilation time. And the error message will end up with something like "expected str got [i8, 4]"; As for the string implementation, I did some experimentation this morning and it looks as though strings in rust are a kind of slice that we still need to support so, for example, you can see in this gdb session the implicit struct for the slice: ``` Temporary breakpoint 1, test::main () at test.rs:8 8 let a = "Hello World %i\n"; (gdb) n 9 let b = a as *const str; (gdb) p a $1 = "Hello World %i\n" (gdb) p b No symbol 'b' in current context (gdb) n 10 let c = b as *const i8; (gdb) p b $2 = *const str {data_ptr: 0x555555588000 "Hello World %i\n\000", length: 15} (gdb) p b.data_ptr $3 = (*mut u8) 0x555555588000 "Hello World %i\n\000" ``` So to me, this is something that we will need to do in the backend with a bunch of implicit code and types, it seems as though for this rust code: ``` unsafe { let a:&str = "Hello World %i\n"; let b = a as *const str; let c = b as *const i8; printf(c, 123); } ``` we would be creating something like: ``` struct Str { data_ptr: i8*; length: usize; } const char *const_string = "hello world %i\n"; Str& a = &Slice{data_ptr: const_string, l5}; Str* b = a; const unsigned char* c = b->data_ptr; ``` I think we should be able to fix this when we get slices working in the compiler. What do you think? --Phil On Thu, 23 Sept 2021 at 21:53, Mark Wielaard wrote: > On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > > Something I was thinking about outside of the scope of that patch was > > about the utf8 how do they get represented? Is it some kind of wchar_t? > > > > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode > > Scalar Values which are 4 bytes wide. > > > > From the docs over here: [ > https://doc.rust-lang.org/std/primitive.char.html] > > > > So I'm assuming they could be represented as `int32_t`s which would also > > make sense for the check > > Yes, for rust characters a 32bit type (I would pick uint32_t or maybe > char32_t) makes sense, since chars in rust are (almost) equal to > unicode code points (technically only 21 bits are used). But not > really, it is a Unicode scalar value, which excludes high-surrogate > and low-surrogate code points and so the only valid values are 0x0 to > 0xD7FF and 0xE000 to 0x10FFFF. > > We should not use the C type wchar_t, because wchar_t is > implementation defined and can be 16 or 32 bits. > > See also https://doc.rust-lang.org/reference/types/textual.html > > But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8 > chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8 > We can use c++ strings made up of (8 bit) chars for that. > > Our lexer should make sure we only accept valid rust characters or > utf-8 sequences. > > Note that the above doesn't hold for "byte chars" (b'A') or "byte > strings" (b"abc"). Those are really just u8 or [u8] arrays which hold > bytes (0x0 to 0xff). > > We currently get the type for byte strings wrong. We pretend they are > &str, but they really should be &[u8]. > > I tried to fix that with the following: > > diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h > b/gcc/rust/typecheck/rust-hir-type-check-expr.h > index fe8973a4d81..b0dd1c3ff2c 100644 > --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h > +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h > @@ -609,15 +609,42 @@ public: > break; > > case HIR::Literal::LitType::BYTE_STRING: { > - /* We just treat this as a string, but it really is an arraytype > of > - u8. It isn't in UTF-8, but really just a byte array. */ > - TyTy::BaseType *base = nullptr; > - auto ok = context->lookup_builtin ("str", &base); > + /* This is an arraytype of u8 reference (&[u8;size]). It isn't in > + UTF-8, but really just a byte array. Code to construct the > array > + reference copied from ArrayElemsValues and ArrayType. */ > + TyTy::BaseType *u8; > + auto ok = context->lookup_builtin ("u8", &u8); > rust_assert (ok); > > + auto crate_num = mappings->get_current_crate (); > + Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID, > + mappings->get_next_hir_id > (crate_num), > + UNKNOWN_LOCAL_DEFID); > + > + /* Capacity is the size of the string (number of chars). > + It is a constant, but for fold it to get a BExpression. */ > + std::string capacity_str = std::to_string (expr.as_string > ().size ()); > + HIR::LiteralExpr literal_capacity (mapping, capacity_str, > + HIR::Literal::LitType::INT, > + > PrimitiveCoreType::CORETYPE_USIZE, > + expr.get_locus ()); > + > + // mark the type for this implicit node > + context->insert_type (mapping, > + new TyTy::USizeType (mapping.get_hirid > ())); > + > + Bexpression *capacity > + = ConstFold::ConstFoldExpr::fold (&literal_capacity); > + > + TyTy::ArrayType *array > + = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), > capacity, > + TyTy::TyVar (u8->get_ref ())); > + > + context->insert_type (expr.get_mappings (), array); > + > infered > = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (), > - TyTy::TyVar (base->get_ref ()), > false); > + TyTy::TyVar (array->get_ref ()), > false); > } > break; > > But that looks more complicated than is probably necessary and it > doesn't work. When the type checker wants to print this type > ReferenceType.as_string () goes into a loop for some reason. > > Can anybody see what is wrong with the above? > > Cheers, > > Mark > >