From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by sourceware.org (Postfix) with ESMTPS id F31D03858402 for ; Fri, 24 Sep 2021 11:01:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F31D03858402 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com Received: by mail-ed1-x534.google.com with SMTP id y89so23817689ede.2 for ; Fri, 24 Sep 2021 04:01:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8VIemgSmIn2dHu1vMNwCBtkN9JnWhwyUgjM6Jej4HR0=; b=SgHcu3NLG5W/NFwYmxsvaXhGqQLyEMSiCEjsVm4TR3kAV1z9ThLOv9KhD+8YzGjtPT 0xKRQJmkSlrVrquW3qLScHV6ZqdPHOtMtp8kaY7g/hsIi4WafjylBra519f+vzOvFMzJ V5PfFzGnGUjeDTf2xNimKamuPwTLlRcyuhjb59Wo+Lmr56XBX/g/X/GlxaU7HG6knCbb d+JhB8EYdeh0wDENXoLn/5gFAnf8iGoPLuysPqyomDzOT6VlYuFXuLE8bpnCpT+0mYTe A+wXSjuxHmwY4pBF/V2qzAkRr1tsMMjHX4cgJyhekgJ4pAZeXtM411xsLo4Oc2IDX6Kp eISg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8VIemgSmIn2dHu1vMNwCBtkN9JnWhwyUgjM6Jej4HR0=; b=kMCPcIDb50VlPHHkg7VxG/I+7oBitIA5KnngwcVMztG4aDhrHaAeY1bQpOd/q1IDKx HNLQkhE85ScbebyyQcG9cw4FwfqkHXpy01PwSeRaaHCQq1I46j2b3enQCei4UEB8qscL mM0+ElRVKjlt1f2a6+PCWxx+RNfuqopNExW679IyM9AfJYIXxoEk1lV9qdGaP6LDf/0i N9352zjp5lPtDNY0Akaa1x+IfvuU+agP2mhXWgB3QpAsv2yhSpaG+WEXi0BfNTWaaxNB aVbFEgqVmR0Y+1fZz7fKMRqzHkYt37lAoGOTntux/L0J+07aDEVlPL8FCF4BHFFXve/l L+Sw== X-Gm-Message-State: AOAM53283JouFYVr6B4kbGM6cdBHKVM8nzbRR9MlonSA/XTZB92bety3 YMgR9ca8Ws1mlWadWuHmPAvU+MfHQ/B919fk2rz6oQ== X-Google-Smtp-Source: ABdhPJxrwT41U+4zv67L5yUTHZhJrd9QJUH/F5Ny7gZpzIYbnFytmfr/dEqzyxyYL/BT3VYR+iHgsslQ5vIQxOIAe5Q= X-Received: by 2002:aa7:da93:: with SMTP id q19mr4403206eds.206.1632481313948; Fri, 24 Sep 2021 04:01:53 -0700 (PDT) MIME-Version: 1.0 References: <20210921225430.166550-1-mark@klomp.org> <87k0j9ym7r.fsf@euler.schwinge.homeip.net> In-Reply-To: From: Philip Herron Date: Fri, 24 Sep 2021 12:01:42 +0100 Message-ID: Subject: Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) To: Mark Wielaard Cc: Arthur Cohen , gcc-rust@gcc.gnu.org, Thomas Schwinge Content-Type: multipart/alternative; boundary="000000000000d2176f05ccbbace9" X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Sep 2021 11:01:57 -0000 --000000000000d2176f05ccbbace9 Content-Type: text/plain; charset="UTF-8" Hi Mark, This is really useful information, will this mean that the lexer token will need to represent strings differently as well? Or is the std::string in the lexer still ok? The change you made above has the problem that reference types like, arrays are forms of what rust calls covariant types since they might contain an inference variable, so they require lookup to determine the base type. Its likely there is a reference cycle here. Though this change will not be correct for type checking purposes. The design of the type system is purely about rust type checking and inferring types. So for example this change will break the case of: ``` let a:str = "test"; ``` Since the TypePath of str can't know the size of the expected array at compilation time. And the error message will end up with something like "expected str got [i8, 4]"; As for the string implementation, I did some experimentation this morning and it looks as though strings in rust are a kind of slice that we still need to support so, for example, you can see in this gdb session the implicit struct for the slice: ``` Temporary breakpoint 1, test::main () at test.rs:8 8 let a = "Hello World %i\n"; (gdb) n 9 let b = a as *const str; (gdb) p a $1 = "Hello World %i\n" (gdb) p b No symbol 'b' in current context (gdb) n 10 let c = b as *const i8; (gdb) p b $2 = *const str {data_ptr: 0x555555588000 "Hello World %i\n\000", length: 15} (gdb) p b.data_ptr $3 = (*mut u8) 0x555555588000 "Hello World %i\n\000" ``` So to me, this is something that we will need to do in the backend with a bunch of implicit code and types, it seems as though for this rust code: ``` unsafe { let a:&str = "Hello World %i\n"; let b = a as *const str; let c = b as *const i8; printf(c, 123); } ``` we would be creating something like: ``` struct Str { data_ptr: i8*; length: usize; } const char *const_string = "hello world %i\n"; Str& a = &Slice{data_ptr: const_string, l5}; Str* b = a; const unsigned char* c = b->data_ptr; ``` I think we should be able to fix this when we get slices working in the compiler. What do you think? --Phil On Thu, 23 Sept 2021 at 21:53, Mark Wielaard wrote: > On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > > Something I was thinking about outside of the scope of that patch was > > about the utf8 how do they get represented? Is it some kind of wchar_t? > > > > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode > > Scalar Values which are 4 bytes wide. > > > > From the docs over here: [ > https://doc.rust-lang.org/std/primitive.char.html] > > > > So I'm assuming they could be represented as `int32_t`s which would also > > make sense for the check > > Yes, for rust characters a 32bit type (I would pick uint32_t or maybe > char32_t) makes sense, since chars in rust are (almost) equal to > unicode code points (technically only 21 bits are used). But not > really, it is a Unicode scalar value, which excludes high-surrogate > and low-surrogate code points and so the only valid values are 0x0 to > 0xD7FF and 0xE000 to 0x10FFFF. > > We should not use the C type wchar_t, because wchar_t is > implementation defined and can be 16 or 32 bits. > > See also https://doc.rust-lang.org/reference/types/textual.html > > But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8 > chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8 > We can use c++ strings made up of (8 bit) chars for that. > > Our lexer should make sure we only accept valid rust characters or > utf-8 sequences. > > Note that the above doesn't hold for "byte chars" (b'A') or "byte > strings" (b"abc"). Those are really just u8 or [u8] arrays which hold > bytes (0x0 to 0xff). > > We currently get the type for byte strings wrong. We pretend they are > &str, but they really should be &[u8]. > > I tried to fix that with the following: > > diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h > b/gcc/rust/typecheck/rust-hir-type-check-expr.h > index fe8973a4d81..b0dd1c3ff2c 100644 > --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h > +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h > @@ -609,15 +609,42 @@ public: > break; > > case HIR::Literal::LitType::BYTE_STRING: { > - /* We just treat this as a string, but it really is an arraytype > of > - u8. It isn't in UTF-8, but really just a byte array. */ > - TyTy::BaseType *base = nullptr; > - auto ok = context->lookup_builtin ("str", &base); > + /* This is an arraytype of u8 reference (&[u8;size]). It isn't in > + UTF-8, but really just a byte array. Code to construct the > array > + reference copied from ArrayElemsValues and ArrayType. */ > + TyTy::BaseType *u8; > + auto ok = context->lookup_builtin ("u8", &u8); > rust_assert (ok); > > + auto crate_num = mappings->get_current_crate (); > + Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID, > + mappings->get_next_hir_id > (crate_num), > + UNKNOWN_LOCAL_DEFID); > + > + /* Capacity is the size of the string (number of chars). > + It is a constant, but for fold it to get a BExpression. */ > + std::string capacity_str = std::to_string (expr.as_string > ().size ()); > + HIR::LiteralExpr literal_capacity (mapping, capacity_str, > + HIR::Literal::LitType::INT, > + > PrimitiveCoreType::CORETYPE_USIZE, > + expr.get_locus ()); > + > + // mark the type for this implicit node > + context->insert_type (mapping, > + new TyTy::USizeType (mapping.get_hirid > ())); > + > + Bexpression *capacity > + = ConstFold::ConstFoldExpr::fold (&literal_capacity); > + > + TyTy::ArrayType *array > + = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), > capacity, > + TyTy::TyVar (u8->get_ref ())); > + > + context->insert_type (expr.get_mappings (), array); > + > infered > = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (), > - TyTy::TyVar (base->get_ref ()), > false); > + TyTy::TyVar (array->get_ref ()), > false); > } > break; > > But that looks more complicated than is probably necessary and it > doesn't work. When the type checker wants to print this type > ReferenceType.as_string () goes into a loop for some reason. > > Can anybody see what is wrong with the above? > > Cheers, > > Mark > > --000000000000d2176f05ccbbace9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Mark,
This is really useful information, will this mean that the lex= er token will need to represent strings differently as well? Or is the std:= :string in the lexer still ok?

The change you made= above has the problem that reference types like, arrays are forms of what = rust calls covariant types since they might contain an inference variable, = so they require lookup to determine the base type. Its likely there is a re= ference cycle here. Though this change will not be correct for type checkin= g purposes. The design of the type system is purely about rust type checkin= g and inferring types. So for example this change will break the case of:

```
=C2=A0 let a:str =3D "test";```

Since the TypePath of str can't know= the size of the expected array at compilation time. And the error message = will end up with something like "expected str got [i8, 4]"; As fo= r the string implementation, I did some experimentation this morning and it= looks as though strings in rust are a kind of slice that we still need to = support so, for example, you can see in this gdb session the implicit struc= t for the slice:

```
Temporary breakpoint 1, test::main () a= t test.rs:8
8 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 let a =3D "Hello World %i\n";
(gdb) n=
9 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 let b =3D a as *cons= t str;
(gdb) p a
$1 =3D "Hello World %i\n"
(gdb) p b
= No symbol 'b' in current context
(gdb) n
10 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0let c =3D b as *const i8;
(gdb) p b
$2= =3D *const str {data_ptr: 0x555555588000 "Hello World %i\n\000",= length: 15}
(gdb) p b.data_ptr
$3 =3D (*mut u8) 0x555555588000 "= ;Hello World %i\n\000"
```

So to me, this is some= thing that we will need to do in the backend with a bunch of implicit code = and types, it seems as though for this rust code:

```
unsafe {=C2=A0 =C2=A0 =C2=A0 =C2=A0 let a:&str =3D "Hello World %i\n"= ;;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 let b =3D a as *const str;
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 let c =3D b as *const i8;

=C2=A0 =C2=A0 =C2=A0 =C2= =A0 printf(c, 123);
=C2=A0 =C2=A0 }
```

we would be creating = something like:

```
struct Str {
= =C2=A0 data_ptr: i8*;
=C2=A0 length: usize;
}
=

const char *const_string =3D=C2=A0 "hello world %i= \n";
Str& a =3D &Slice{data_ptr: const_string, l= 5};
Str* b =3D a;
const unsigned char* c =3D b->data= _ptr;
```

I think we should be able = to fix this when we get slices working in the compiler.

W= hat do you think?

--Phil

=
On Thu, 23= Sept 2021 at 21:53, Mark Wielaard <ma= rk@klomp.org> wrote:
On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > Something I was thinking about outside of the scope of that patch= was
> about the utf8 how do they get represented? Is it some kind of wchar_t= ?
>
> Do you mean in C++ or in rustc? In rustc, they are represented as Unic= ode
> Scalar Values which are 4 bytes wide.
>
> From the docs over here: [https://doc.rust-lan= g.org/std/primitive.char.html]
>
> So I'm assuming they could be represented as `int32_t`s which woul= d also
> make sense for the check

Yes, for rust characters a 32bit type (I would pick uint32_t or maybe
char32_t) makes sense, since chars in rust are (almost) equal to
unicode code points (technically only 21 bits are used). But not
really, it is a Unicode scalar value, which excludes high-surrogate
and low-surrogate code points and so the only valid values are 0x0 to
0xD7FF and 0xE000 to 0x10FFFF.

We should not use the C type wchar_t, because wchar_t is
implementation defined and can be 16 or 32 bits.

See also https://doc.rust-lang.org/reference/t= ypes/textual.html

But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 = utf8
chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8=
We can use c++ strings made up of (8 bit) chars for that.

Our lexer should make sure we only accept valid rust characters or
utf-8 sequences.

Note that the above doesn't hold for "byte chars" (b'A= 9;) or "byte
strings" (b"abc"). Those are really just u8 or [u8] arrays w= hich hold
bytes (0x0 to 0xff).

We currently get the type for byte strings wrong. We pretend they are
&str, but they really should be &[u8].

I tried to fix that with the following:

diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typec= heck/rust-hir-type-check-expr.h
index fe8973a4d81..b0dd1c3ff2c 100644
--- a/gcc/rust/typecheck/rust-hir-type-check-expr.h
+++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h
@@ -609,15 +609,42 @@ public:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 break;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 case HIR::Literal::LitType::BYTE_STRING: {
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* We just treat this as a string, but i= t really is an arraytype of
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 u8. It isn't in UTF-8, but r= eally just a byte array.=C2=A0 */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0TyTy::BaseType *base =3D nullptr;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0auto ok =3D context->lookup_builtin (= "str", &base);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* This is an arraytype of u8 reference = (&[u8;size]). It isn't in
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 UTF-8, but really just a byte ar= ray. Code to construct the array
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 reference copied from ArrayElems= Values and ArrayType. */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0TyTy::BaseType *u8;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0auto ok =3D context->lookup_builtin (= "u8", &u8);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 rust_assert (ok);

+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0auto crate_num =3D mappings->get_curr= ent_crate ();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Analysis::NodeMapping mapping (crate_num= , UNKNOWN_NODEID,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mappings= ->get_next_hir_id (crate_num),
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 UNKNOWN_= LOCAL_DEFID);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Capacity is the size of the string (n= umber of chars).
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 It is a constant, but for fold i= t to get a BExpression.=C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0std::string capacity_str =3D std::to_str= ing (expr.as_string ().size ());
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0HIR::LiteralExpr literal_capacity (mappi= ng, capacity_str,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 HIR::Literal::LitType::INT,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 PrimitiveCoreType::CORETYPE_USIZE,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 expr.get_locus ());
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// mark the type for this implicit node<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0context->insert_type (mapping,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0new TyTy::USizeType (mapping.get_hiri= d ()));
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Bexpression *capacity
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D ConstFold::ConstFoldExpr::fol= d (&literal_capacity);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0TyTy::ArrayType *array
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D new TyTy::ArrayType (expr.get= _mappings ().get_hirid (), capacity,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 TyTy::TyVar (u8->get_ref (= )));
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0context->insert_type (expr.get_mappin= gs (), array);
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 infered
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D new TyTy::ReferenceType (expr= .get_mappings ().get_hirid (),
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 TyTy::TyVar (ba= se->get_ref ()), false);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 TyTy::TyVar (ar= ray->get_ref ()), false);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 break;

But that looks more complicated than is probably necessary and it
doesn't work. When the type checker wants to print this type
ReferenceType.as_string () goes into a loop for some reason.

Can anybody see what is wrong with the above?

Cheers,

Mark

--000000000000d2176f05ccbbace9--