* [PATCH] Fix byte char and byte string lexing code @ 2021-09-21 22:54 Mark Wielaard 2021-09-22 9:48 ` Thomas Schwinge 0 siblings, 1 reply; 10+ messages in thread From: Mark Wielaard @ 2021-09-21 22:54 UTC (permalink / raw) To: gcc-rust; +Cc: Mark Wielaard There were two warnings in lexer parse_byte_char and parse_byte_string code for arches with signed chars: rust-lex.cc: In member function ‘Rust::TokenPtr Rust::Lexer::parse_byte_char(Location)’: rust-lex.cc:1564:21: warning: comparison is always false due to limited range of data type [-Wtype-limits] 1564 | if (byte_char > 127) | ~~~~~~~~~~^~~~~ rust-lex.cc: In member function ‘Rust::TokenPtr Rust::Lexer::parse_byte_string(Location)’: rust-lex.cc:1639:27: warning: comparison is always false due to limited range of data type [-Wtype-limits] 1639 | if (output_char > 127) | ~~~~~~~~~~~~^~~~~ The fix would be to cast to an unsigned char before the comparison. But that is actually wrong, and would produce the following errors parsing a byte char or byte string: bytecharstring.rs:3:14: error: ‘byte char’ ‘�’ out of range 3 | let _bc = b'\x80'; | ^ bytecharstring.rs:4:14: error: character ‘�’ in byte string out of range 4 | let _bs = b"foo\x80bar"; | ^ Both byte chars and byte strings may contain up to \xFF (255) characters. It is utf-8 chars or strings that can only Remove the faulty check and add a new testcase bytecharstring.rs that checks byte chars and strings do accept > 127 hex char escapes, but utf-8 chars and strings reject such hex char escapes. --- https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring gcc/rust/lex/rust-lex.cc | 15 --------------- gcc/testsuite/rust/compile/bytecharstring.rs | 8 ++++++++ 2 files changed, 8 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/rust/compile/bytecharstring.rs diff --git a/gcc/rust/lex/rust-lex.cc b/gcc/rust/lex/rust-lex.cc index 49b6b6d32a7..b70877be9ff 100644 --- a/gcc/rust/lex/rust-lex.cc +++ b/gcc/rust/lex/rust-lex.cc @@ -1559,13 +1559,6 @@ Lexer::parse_byte_char (Location loc) byte_char = std::get<0> (escape_length_pair); length += std::get<1> (escape_length_pair); - if (byte_char > 127) - { - rust_error_at (get_current_location (), - "%<byte char%> %<%c%> out of range", byte_char); - byte_char = 0; - } - current_char = peek_input (); if (current_char != '\'') @@ -1634,14 +1627,6 @@ Lexer::parse_byte_string (Location loc) else length += std::get<1> (escape_length_pair); - if (output_char > 127) - { - rust_error_at (get_current_location (), - "character %<%c%> in byte string out of range", - output_char); - output_char = 0; - } - if (output_char != 0) str += output_char; diff --git a/gcc/testsuite/rust/compile/bytecharstring.rs b/gcc/testsuite/rust/compile/bytecharstring.rs new file mode 100644 index 00000000000..9242e2c5a0b --- /dev/null +++ b/gcc/testsuite/rust/compile/bytecharstring.rs @@ -0,0 +1,8 @@ +fn main () +{ + let _bc = b'\x80'; + let _bs = b"foo\x80bar"; + + let _c = '\xef'; // { dg-error "out of range" } + let _s = "Foo\xEFBar"; // { dg-error "out of range" } +} -- 2.32.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix byte char and byte string lexing code 2021-09-21 22:54 [PATCH] Fix byte char and byte string lexing code Mark Wielaard @ 2021-09-22 9:48 ` Thomas Schwinge 2021-09-22 20:37 ` Mark Wielaard 0 siblings, 1 reply; 10+ messages in thread From: Thomas Schwinge @ 2021-09-22 9:48 UTC (permalink / raw) To: Mark Wielaard; +Cc: gcc-rust Hi Mark! On 2021-09-22T00:54:30+0200, Mark Wielaard <mark@klomp.org> wrote: > There were two warnings in lexer parse_byte_char and parse_byte_string > code for arches with signed chars: > > rust-lex.cc: In member function > ‘Rust::TokenPtr Rust::Lexer::parse_byte_char(Location)’: > rust-lex.cc:1564:21: warning: comparison is always false due to limited > range of data type [-Wtype-limits] > 1564 | if (byte_char > 127) > | ~~~~~~~~~~^~~~~ That's <https://github.com/Rust-GCC/gccrs/pull/343>. > rust-lex.cc: In member function > ‘Rust::TokenPtr Rust::Lexer::parse_byte_string(Location)’: > rust-lex.cc:1639:27: warning: comparison is always false due to limited > range of data type [-Wtype-limits] > 1639 | if (output_char > 127) > | ~~~~~~~~~~~~^~~~~ That's <https://github.com/Rust-GCC/gccrs/pull/344>. Both these related to <https://github.com/Rust-GCC/gccrs/issues/336> "GCC '--enable-bootstrap' build". > The fix would be to cast to an unsigned char before the comparison. > But that is actually wrong, and would produce the following errors > parsing a byte char or byte string: > > bytecharstring.rs:3:14: error: ‘byte char’ ‘�’ out of range > 3 | let _bc = b'\x80'; > | ^ > bytecharstring.rs:4:14: error: character ‘�’ in byte string out of range > 4 | let _bs = b"foo\x80bar"; > | ^ > > Both byte chars and byte strings may contain up to \xFF (255) > characters. It is utf-8 chars or strings that can only [truncated here -- but I understand what you mean] I think this does match my thoughts in <https://github.com/Rust-GCC/gccrs/pull/343#issuecomment-816214689>. > Remove the faulty check and add a new testcase bytecharstring.rs > that checks byte chars and strings do accept > 127 hex char > escapes, but utf-8 chars and strings reject such hex char escapes. > --- > > https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687> "Fix byte char and byte string lexing code". Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix byte char and byte string lexing code 2021-09-22 9:48 ` Thomas Schwinge @ 2021-09-22 20:37 ` Mark Wielaard 2021-09-23 11:43 ` Philip Herron 0 siblings, 1 reply; 10+ messages in thread From: Mark Wielaard @ 2021-09-22 20:37 UTC (permalink / raw) To: Thomas Schwinge; +Cc: gcc-rust Hi Thomas, On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote: > That's <https://github.com/Rust-GCC/gccrs/pull/343>. > [...] > That's <https://github.com/Rust-GCC/gccrs/pull/344>. Ah, sorry, I don't really track the github issues and had missed those. But good to see the analysis matches. > Both these related to <https://github.com/Rust-GCC/gccrs/issues/336> > "GCC '--enable-bootstrap' build". To make --enable-bootstrap possible (wow, that takes a long time...) you'll also need: https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable With this and that patch applied there are no more warnings building the rust frontend, so a --enable-bootstrap (-Werror) build completes successfully. > > https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring > > Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687> > "Fix byte char and byte string lexing code". Thanks, Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix byte char and byte string lexing code 2021-09-22 20:37 ` Mark Wielaard @ 2021-09-23 11:43 ` Philip Herron 2021-09-23 14:10 ` Arthur Cohen 0 siblings, 1 reply; 10+ messages in thread From: Philip Herron @ 2021-09-23 11:43 UTC (permalink / raw) To: Mark Wielaard; +Cc: Thomas Schwinge, gcc-rust [-- Attachment #1: Type: text/plain, Size: 1621 bytes --] Hi guys, Thanks, Thomas for raising the PR it is currently being merged. We will raise the next PR to get bootstrapable builds working which will be really nice there is a GitHub automation for that at the moment. Something I was thinking about outside of the scope of that patch was about the utf8 how do they get represented? Is it some kind of wchar_t? Thanks again --Phil On Wed, 22 Sept 2021 at 21:37, Mark Wielaard <mark@klomp.org> wrote: > Hi Thomas, > > On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote: > > That's <https://github.com/Rust-GCC/gccrs/pull/343>. > > [...] > > That's <https://github.com/Rust-GCC/gccrs/pull/344>. > > Ah, sorry, I don't really track the github issues and had missed > those. But good to see the analysis matches. > > > Both these related to <https://github.com/Rust-GCC/gccrs/issues/336> > > "GCC '--enable-bootstrap' build". > > To make --enable-bootstrap possible (wow, that takes a long time...) > you'll also need: > https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html > > https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable > > With this and that patch applied there are no more warnings building > the rust frontend, so a --enable-bootstrap (-Werror) build completes > successfully. > > > > > https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring > > > > Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687> > > "Fix byte char and byte string lexing code". > > Thanks, > > Mark > > -- > Gcc-rust mailing list > Gcc-rust@gcc.gnu.org > https://gcc.gnu.org/mailman/listinfo/gcc-rust > [-- Attachment #2: Type: text/html, Size: 3085 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] Fix byte char and byte string lexing code 2021-09-23 11:43 ` Philip Herron @ 2021-09-23 14:10 ` Arthur Cohen 2021-09-23 20:53 ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard 0 siblings, 1 reply; 10+ messages in thread From: Arthur Cohen @ 2021-09-23 14:10 UTC (permalink / raw) To: Philip Herron; +Cc: Mark Wielaard, gcc-rust, Thomas Schwinge [-- Attachment #1: Type: text/plain, Size: 2499 bytes --] > Something I was thinking about outside of the scope of that patch was about the utf8 how do they get represented? Is it some kind of wchar_t? Do you mean in C++ or in rustc? In rustc, they are represented as Unicode Scalar Values which are 4 bytes wide. From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html] So I'm assuming they could be represented as `int32_t`s which would also make sense for the check Cheers, Arthur Le jeu. 23 sept. 2021 à 13:43, Philip Herron <philip.herron@embecosm.com> a écrit : > Hi guys, > > Thanks, Thomas for raising the PR it is currently being merged. We will > raise the next PR to get bootstrapable builds working which will be really > nice there is a GitHub automation for that at the moment. > > Something I was thinking about outside of the scope of that patch was > about the utf8 how do they get represented? Is it some kind of wchar_t? > > Thanks again > > --Phil > > On Wed, 22 Sept 2021 at 21:37, Mark Wielaard <mark@klomp.org> wrote: > >> Hi Thomas, >> >> On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote: >> > That's <https://github.com/Rust-GCC/gccrs/pull/343>. >> > [...] >> > That's <https://github.com/Rust-GCC/gccrs/pull/344>. >> >> Ah, sorry, I don't really track the github issues and had missed >> those. But good to see the analysis matches. >> >> > Both these related to <https://github.com/Rust-GCC/gccrs/issues/336> >> > "GCC '--enable-bootstrap' build". >> >> To make --enable-bootstrap possible (wow, that takes a long time...) >> you'll also need: >> https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html >> >> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable >> >> With this and that patch applied there are no more warnings building >> the rust frontend, so a --enable-bootstrap (-Werror) build completes >> successfully. >> >> > > >> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring >> > >> > Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687> >> > "Fix byte char and byte string lexing code". >> >> Thanks, >> >> Mark >> >> -- >> Gcc-rust mailing list >> Gcc-rust@gcc.gnu.org >> https://gcc.gnu.org/mailman/listinfo/gcc-rust >> > -- > Gcc-rust mailing list > Gcc-rust@gcc.gnu.org > https://gcc.gnu.org/mailman/listinfo/gcc-rust > -- Arthur Cohen +33 6 10 15 73 74 cohenarthur.dev@gmail.com https://github.com/cohenarthur [-- Attachment #2: Type: text/html, Size: 4706 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) 2021-09-23 14:10 ` Arthur Cohen @ 2021-09-23 20:53 ` Mark Wielaard 2021-09-24 11:01 ` Philip Herron 0 siblings, 1 reply; 10+ messages in thread From: Mark Wielaard @ 2021-09-23 20:53 UTC (permalink / raw) To: Arthur Cohen; +Cc: Philip Herron, gcc-rust, Thomas Schwinge On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > Something I was thinking about outside of the scope of that patch was > about the utf8 how do they get represented? Is it some kind of wchar_t? > > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode > Scalar Values which are 4 bytes wide. > > From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html] > > So I'm assuming they could be represented as `int32_t`s which would also > make sense for the check Yes, for rust characters a 32bit type (I would pick uint32_t or maybe char32_t) makes sense, since chars in rust are (almost) equal to unicode code points (technically only 21 bits are used). But not really, it is a Unicode scalar value, which excludes high-surrogate and low-surrogate code points and so the only valid values are 0x0 to 0xD7FF and 0xE000 to 0x10FFFF. We should not use the C type wchar_t, because wchar_t is implementation defined and can be 16 or 32 bits. See also https://doc.rust-lang.org/reference/types/textual.html But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8 chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8 We can use c++ strings made up of (8 bit) chars for that. Our lexer should make sure we only accept valid rust characters or utf-8 sequences. Note that the above doesn't hold for "byte chars" (b'A') or "byte strings" (b"abc"). Those are really just u8 or [u8] arrays which hold bytes (0x0 to 0xff). We currently get the type for byte strings wrong. We pretend they are &str, but they really should be &[u8]. I tried to fix that with the following: diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typecheck/rust-hir-type-check-expr.h index fe8973a4d81..b0dd1c3ff2c 100644 --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h @@ -609,15 +609,42 @@ public: break; case HIR::Literal::LitType::BYTE_STRING: { - /* We just treat this as a string, but it really is an arraytype of - u8. It isn't in UTF-8, but really just a byte array. */ - TyTy::BaseType *base = nullptr; - auto ok = context->lookup_builtin ("str", &base); + /* This is an arraytype of u8 reference (&[u8;size]). It isn't in + UTF-8, but really just a byte array. Code to construct the array + reference copied from ArrayElemsValues and ArrayType. */ + TyTy::BaseType *u8; + auto ok = context->lookup_builtin ("u8", &u8); rust_assert (ok); + auto crate_num = mappings->get_current_crate (); + Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID, + mappings->get_next_hir_id (crate_num), + UNKNOWN_LOCAL_DEFID); + + /* Capacity is the size of the string (number of chars). + It is a constant, but for fold it to get a BExpression. */ + std::string capacity_str = std::to_string (expr.as_string ().size ()); + HIR::LiteralExpr literal_capacity (mapping, capacity_str, + HIR::Literal::LitType::INT, + PrimitiveCoreType::CORETYPE_USIZE, + expr.get_locus ()); + + // mark the type for this implicit node + context->insert_type (mapping, + new TyTy::USizeType (mapping.get_hirid ())); + + Bexpression *capacity + = ConstFold::ConstFoldExpr::fold (&literal_capacity); + + TyTy::ArrayType *array + = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), capacity, + TyTy::TyVar (u8->get_ref ())); + + context->insert_type (expr.get_mappings (), array); + infered = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (), - TyTy::TyVar (base->get_ref ()), false); + TyTy::TyVar (array->get_ref ()), false); } break; But that looks more complicated than is probably necessary and it doesn't work. When the type checker wants to print this type ReferenceType.as_string () goes into a loop for some reason. Can anybody see what is wrong with the above? Cheers, Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) 2021-09-23 20:53 ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard @ 2021-09-24 11:01 ` Philip Herron 2021-09-25 11:53 ` Mark Wielaard 0 siblings, 1 reply; 10+ messages in thread From: Philip Herron @ 2021-09-24 11:01 UTC (permalink / raw) To: Mark Wielaard; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge [-- Attachment #1: Type: text/plain, Size: 6835 bytes --] Hi Mark, This is really useful information, will this mean that the lexer token will need to represent strings differently as well? Or is the std::string in the lexer still ok? The change you made above has the problem that reference types like, arrays are forms of what rust calls covariant types since they might contain an inference variable, so they require lookup to determine the base type. Its likely there is a reference cycle here. Though this change will not be correct for type checking purposes. The design of the type system is purely about rust type checking and inferring types. So for example this change will break the case of: ``` let a:str = "test"; ``` Since the TypePath of str can't know the size of the expected array at compilation time. And the error message will end up with something like "expected str got [i8, 4]"; As for the string implementation, I did some experimentation this morning and it looks as though strings in rust are a kind of slice that we still need to support so, for example, you can see in this gdb session the implicit struct for the slice: ``` Temporary breakpoint 1, test::main () at test.rs:8 8 let a = "Hello World %i\n"; (gdb) n 9 let b = a as *const str; (gdb) p a $1 = "Hello World %i\n" (gdb) p b No symbol 'b' in current context (gdb) n 10 let c = b as *const i8; (gdb) p b $2 = *const str {data_ptr: 0x555555588000 "Hello World %i\n\000", length: 15} (gdb) p b.data_ptr $3 = (*mut u8) 0x555555588000 "Hello World %i\n\000" ``` So to me, this is something that we will need to do in the backend with a bunch of implicit code and types, it seems as though for this rust code: ``` unsafe { let a:&str = "Hello World %i\n"; let b = a as *const str; let c = b as *const i8; printf(c, 123); } ``` we would be creating something like: ``` struct Str { data_ptr: i8*; length: usize; } const char *const_string = "hello world %i\n"; Str& a = &Slice{data_ptr: const_string, l5}; Str* b = a; const unsigned char* c = b->data_ptr; ``` I think we should be able to fix this when we get slices working in the compiler. What do you think? --Phil On Thu, 23 Sept 2021 at 21:53, Mark Wielaard <mark@klomp.org> wrote: > On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote: > > > Something I was thinking about outside of the scope of that patch was > > about the utf8 how do they get represented? Is it some kind of wchar_t? > > > > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode > > Scalar Values which are 4 bytes wide. > > > > From the docs over here: [ > https://doc.rust-lang.org/std/primitive.char.html] > > > > So I'm assuming they could be represented as `int32_t`s which would also > > make sense for the check > > Yes, for rust characters a 32bit type (I would pick uint32_t or maybe > char32_t) makes sense, since chars in rust are (almost) equal to > unicode code points (technically only 21 bits are used). But not > really, it is a Unicode scalar value, which excludes high-surrogate > and low-surrogate code points and so the only valid values are 0x0 to > 0xD7FF and 0xE000 to 0x10FFFF. > > We should not use the C type wchar_t, because wchar_t is > implementation defined and can be 16 or 32 bits. > > See also https://doc.rust-lang.org/reference/types/textual.html > > But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8 > chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8 > We can use c++ strings made up of (8 bit) chars for that. > > Our lexer should make sure we only accept valid rust characters or > utf-8 sequences. > > Note that the above doesn't hold for "byte chars" (b'A') or "byte > strings" (b"abc"). Those are really just u8 or [u8] arrays which hold > bytes (0x0 to 0xff). > > We currently get the type for byte strings wrong. We pretend they are > &str, but they really should be &[u8]. > > I tried to fix that with the following: > > diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h > b/gcc/rust/typecheck/rust-hir-type-check-expr.h > index fe8973a4d81..b0dd1c3ff2c 100644 > --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h > +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h > @@ -609,15 +609,42 @@ public: > break; > > case HIR::Literal::LitType::BYTE_STRING: { > - /* We just treat this as a string, but it really is an arraytype > of > - u8. It isn't in UTF-8, but really just a byte array. */ > - TyTy::BaseType *base = nullptr; > - auto ok = context->lookup_builtin ("str", &base); > + /* This is an arraytype of u8 reference (&[u8;size]). It isn't in > + UTF-8, but really just a byte array. Code to construct the > array > + reference copied from ArrayElemsValues and ArrayType. */ > + TyTy::BaseType *u8; > + auto ok = context->lookup_builtin ("u8", &u8); > rust_assert (ok); > > + auto crate_num = mappings->get_current_crate (); > + Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID, > + mappings->get_next_hir_id > (crate_num), > + UNKNOWN_LOCAL_DEFID); > + > + /* Capacity is the size of the string (number of chars). > + It is a constant, but for fold it to get a BExpression. */ > + std::string capacity_str = std::to_string (expr.as_string > ().size ()); > + HIR::LiteralExpr literal_capacity (mapping, capacity_str, > + HIR::Literal::LitType::INT, > + > PrimitiveCoreType::CORETYPE_USIZE, > + expr.get_locus ()); > + > + // mark the type for this implicit node > + context->insert_type (mapping, > + new TyTy::USizeType (mapping.get_hirid > ())); > + > + Bexpression *capacity > + = ConstFold::ConstFoldExpr::fold (&literal_capacity); > + > + TyTy::ArrayType *array > + = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), > capacity, > + TyTy::TyVar (u8->get_ref ())); > + > + context->insert_type (expr.get_mappings (), array); > + > infered > = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (), > - TyTy::TyVar (base->get_ref ()), > false); > + TyTy::TyVar (array->get_ref ()), > false); > } > break; > > But that looks more complicated than is probably necessary and it > doesn't work. When the type checker wants to print this type > ReferenceType.as_string () goes into a loop for some reason. > > Can anybody see what is wrong with the above? > > Cheers, > > Mark > > [-- Attachment #2: Type: text/html, Size: 8893 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) 2021-09-24 11:01 ` Philip Herron @ 2021-09-25 11:53 ` Mark Wielaard 2021-09-30 10:46 ` Philip Herron 0 siblings, 1 reply; 10+ messages in thread From: Mark Wielaard @ 2021-09-25 11:53 UTC (permalink / raw) To: Philip Herron; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge Hi Philip, On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote: > This is really useful information, will this mean that the lexer token will > need to represent strings differently as well? Or is the std::string in the > lexer still ok? I think the respresentation as std::string is fine. As long as we don't mix std::strings between different types (byte strings may contain sequences of chars that aren't valid utf-8 sequenecs). > The change you made above has the problem that reference types like, arrays > are forms of what rust calls covariant types since they might contain an > inference variable, so they require lookup to determine the base type. Its > likely there is a reference cycle here. Though this change will not be > correct for type checking purposes. The design of the type system is purely > about rust type checking and inferring types. OK, so how do I represent an reference to an array type that doesn't contain any inference variables? When we see a b"hello" byte string that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8]; So we know this is &[u8;5] and if we write: let a = b"hello"; We want to infer that a has type &[u8;5]. > So for example this change will break the case of: > > ``` > let a:str = "test"; > ``` > > Since the TypePath of str can't know the size of the expected array at > compilation time. And the error message will end up with something like > "expected str got [i8, 4]"; Right, but that is for "proper strings". It is somewhat unfortunate that Rust calls byte strings also "strings", but they really aren't. b"abc" is static array of u8, not a &str (containing utf-8). I have to think about the slicing of "proper strings", which sound more complicated than slicing of byte strings, because I don't think you want to chop up a utf-8 sequence. For now I would simply try to get the type of byte strings like b"test" correct. Cheers, Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) 2021-09-25 11:53 ` Mark Wielaard @ 2021-09-30 10:46 ` Philip Herron 2021-10-03 22:04 ` Mark Wielaard 0 siblings, 1 reply; 10+ messages in thread From: Philip Herron @ 2021-09-30 10:46 UTC (permalink / raw) To: Mark Wielaard; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge [-- Attachment #1: Type: text/plain, Size: 2963 bytes --] Hi Mark, Thanks for clarifying this, I was getting mixed up between normal str's and byte strings. Your patch was 99% of the way there to fix the type resolution so I finished it off for you: https://github.com/Rust-GCC/gccrs/pull/698/files The missing piece was that References and Array's are a type of covariant type so that an array type can look like this: [_, capacity], so the inference variable here is the variant so that we need to make sure it has its own implicit mapping id. You just needed to create one more mapping to get that implicit id so that the reference type similarly doesn't get into a loop of looking up itself. Creating implicit types like this could be made easier, so we should likely add some helpers for this scenario. Let me know what you think. Thanks --Phil On Sat, 25 Sept 2021 at 12:53, Mark Wielaard <mark@klomp.org> wrote: > Hi Philip, > > On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote: > > This is really useful information, will this mean that the lexer token > will > > need to represent strings differently as well? Or is the std::string in > the > > lexer still ok? > > I think the respresentation as std::string is fine. As long as we > don't mix std::strings between different types (byte strings may > contain sequences of chars that aren't valid utf-8 sequenecs). > > > The change you made above has the problem that reference types like, > arrays > > are forms of what rust calls covariant types since they might contain an > > inference variable, so they require lookup to determine the base type. > Its > > likely there is a reference cycle here. Though this change will not be > > correct for type checking purposes. The design of the type system is > purely > > about rust type checking and inferring types. > > OK, so how do I represent an reference to an array type that doesn't > contain any inference variables? When we see a b"hello" byte string > that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is > the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8]; > > So we know this is &[u8;5] and if we write: > > let a = b"hello"; > > We want to infer that a has type &[u8;5]. > > > So for example this change will break the case of: > > > > ``` > > let a:str = "test"; > > ``` > > > > Since the TypePath of str can't know the size of the expected array at > > compilation time. And the error message will end up with something like > > "expected str got [i8, 4]"; > > Right, but that is for "proper strings". It is somewhat unfortunate > that Rust calls byte strings also "strings", but they really > aren't. b"abc" is static array of u8, not a &str (containing utf-8). > > I have to think about the slicing of "proper strings", which sound > more complicated than slicing of byte strings, because I don't think > you want to chop up a utf-8 sequence. For now I would simply try to > get the type of byte strings like b"test" correct. > > Cheers, > > Mark > > [-- Attachment #2: Type: text/html, Size: 3842 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) 2021-09-30 10:46 ` Philip Herron @ 2021-10-03 22:04 ` Mark Wielaard 0 siblings, 0 replies; 10+ messages in thread From: Mark Wielaard @ 2021-10-03 22:04 UTC (permalink / raw) To: Philip Herron; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge Hi Philip, On Thu, Sep 30, 2021 at 11:46:30AM +0100, Philip Herron wrote: > Your patch was 99% of the way there to fix the type > resolution so I finished it off for you: > > https://github.com/Rust-GCC/gccrs/pull/698/files And you added the actual backend code needed! So I'll probably say I was maybe 45% there :) This is really useful. It works for me and will enable writing a couple of real tests for byte strings (like we have for str already, that aren't just lexer/parser checks, but which can actually check the contents of the byte strings. > The missing piece was that References and Array's are a type of covariant > type so that an array type can look like this: [_, capacity], so the > inference variable here is the variant so that we need to make sure it has > its own implicit mapping id. You just needed to create one more mapping to > get that implicit id so that the reference type similarly doesn't get into > a loop of looking up itself. Creating implicit types like this could be > made easier, so we should likely add some helpers for this scenario. Thanks, my knowledge of mappings was a bit fuzzy. So even though we know our own array type is fixed, we still need to attach a unique mapping to match against other types which might not be fixed (yet). There is indeed a lot of extra work you need to do and that you can easily get wrong. Creating an empty mapping is a lot of work and it is easy to get the details right. An new_empty_mapping helper method would be great. Maybe it could even be hidden inside the expression or type constructor. I do have a question about why we also create these mappings for the capacity. The capacity is always a fixed usize number. Not just in this case, but always since it is constant expression in a const context. So any comparison of the length will always be a simple constant usize number check. Is using mappings not overkill in that case? We use a Bexpression for the capacity, created by ConstFold::ConstFoldExpr::fold, shouldn't we check it actually folded correctly? Currently when it cannot we just leave it in an error state till it is too late. e.g pub fn main() { let _a: &[u8; 4 / 0]; _a = b"test"; } Results in: rust1: internal compiler error: in const_size_val_to_string, at rust/rust-gcc.cc:210 0x794e67 Gcc_backend::const_size_val_to_string[abi:cxx11](Bexpression*) /home/mark/src/gccrs/gcc/rust/rust-gcc.cc:210 0x794e67 Gcc_backend::const_size_val_to_string[abi:cxx11](Bexpression*) /home/mark/src/gccrs/gcc/rust/rust-gcc.cc:208 0x8c51ef Rust::TyTy::ArrayType::capacity_string[abi:cxx11]() const /home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty.cc:1136 0x8c51ef Rust::TyTy::ArrayType::as_string[abi:cxx11]() const /home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty.cc:1129 0x8d0c8b Rust::TyTy::BaseCoercionRules::visit(Rust::TyTy::ArrayType&) /home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty-coercion.h:166 [...] So it seems we really should check for is_error_expr earlier. And maybe just get the integer value and store/compare that instead of the Bexpression? P.S. Don't let this stop you from committing the code as is, it does work. We can rework it afterwards if the above comments make any sense. Cheers, Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-10-03 22:04 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-21 22:54 [PATCH] Fix byte char and byte string lexing code Mark Wielaard 2021-09-22 9:48 ` Thomas Schwinge 2021-09-22 20:37 ` Mark Wielaard 2021-09-23 11:43 ` Philip Herron 2021-09-23 14:10 ` Arthur Cohen 2021-09-23 20:53 ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard 2021-09-24 11:01 ` Philip Herron 2021-09-25 11:53 ` Mark Wielaard 2021-09-30 10:46 ` Philip Herron 2021-10-03 22:04 ` Mark Wielaard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).