public inbox for gcc-rust@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Fix byte char and byte string lexing code
@ 2021-09-21 22:54 Mark Wielaard
  2021-09-22  9:48 ` Thomas Schwinge
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Wielaard @ 2021-09-21 22:54 UTC (permalink / raw)
  To: gcc-rust; +Cc: Mark Wielaard

There were two warnings in lexer parse_byte_char and parse_byte_string
code for arches with signed chars:

rust-lex.cc: In member function
             ‘Rust::TokenPtr Rust::Lexer::parse_byte_char(Location)’:
rust-lex.cc:1564:21: warning: comparison is always false due to limited
                     range of data type [-Wtype-limits]
 1564 |       if (byte_char > 127)
      |           ~~~~~~~~~~^~~~~
rust-lex.cc: In member function
             ‘Rust::TokenPtr Rust::Lexer::parse_byte_string(Location)’:
rust-lex.cc:1639:27: warning: comparison is always false due to limited
                     range of data type [-Wtype-limits]
 1639 |           if (output_char > 127)
      |               ~~~~~~~~~~~~^~~~~

The fix would be to cast to an unsigned char before the comparison.
But that is actually wrong, and would produce the following errors
parsing a byte char or byte string:

bytecharstring.rs:3:14: error: ‘byte char’ ‘�’ out of range
    3 |   let _bc = b'\x80';
      |              ^
bytecharstring.rs:4:14: error: character ‘�’ in byte string out of range
    4 |   let _bs = b"foo\x80bar";
      |              ^

Both byte chars and byte strings may contain up to \xFF (255)
characters. It is utf-8 chars or strings that can only

Remove the faulty check and add a new testcase bytecharstring.rs
that checks byte chars and strings do accept > 127 hex char
escapes, but utf-8 chars and strings reject such hex char escapes.
---

https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring

 gcc/rust/lex/rust-lex.cc                     | 15 ---------------
 gcc/testsuite/rust/compile/bytecharstring.rs |  8 ++++++++
 2 files changed, 8 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/rust/compile/bytecharstring.rs

diff --git a/gcc/rust/lex/rust-lex.cc b/gcc/rust/lex/rust-lex.cc
index 49b6b6d32a7..b70877be9ff 100644
--- a/gcc/rust/lex/rust-lex.cc
+++ b/gcc/rust/lex/rust-lex.cc
@@ -1559,13 +1559,6 @@ Lexer::parse_byte_char (Location loc)
       byte_char = std::get<0> (escape_length_pair);
       length += std::get<1> (escape_length_pair);
 
-      if (byte_char > 127)
-	{
-	  rust_error_at (get_current_location (),
-			 "%<byte char%> %<%c%> out of range", byte_char);
-	  byte_char = 0;
-	}
-
       current_char = peek_input ();
 
       if (current_char != '\'')
@@ -1634,14 +1627,6 @@ Lexer::parse_byte_string (Location loc)
 	  else
 	    length += std::get<1> (escape_length_pair);
 
-	  if (output_char > 127)
-	    {
-	      rust_error_at (get_current_location (),
-			     "character %<%c%> in byte string out of range",
-			     output_char);
-	      output_char = 0;
-	    }
-
 	  if (output_char != 0)
 	    str += output_char;
 
diff --git a/gcc/testsuite/rust/compile/bytecharstring.rs b/gcc/testsuite/rust/compile/bytecharstring.rs
new file mode 100644
index 00000000000..9242e2c5a0b
--- /dev/null
+++ b/gcc/testsuite/rust/compile/bytecharstring.rs
@@ -0,0 +1,8 @@
+fn main ()
+{
+  let _bc = b'\x80';
+  let _bs = b"foo\x80bar";
+
+  let _c = '\xef';        // { dg-error "out of range" }
+  let _s = "Foo\xEFBar";  // { dg-error "out of range" }
+}
-- 
2.32.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix byte char and byte string lexing code
  2021-09-21 22:54 [PATCH] Fix byte char and byte string lexing code Mark Wielaard
@ 2021-09-22  9:48 ` Thomas Schwinge
  2021-09-22 20:37   ` Mark Wielaard
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Schwinge @ 2021-09-22  9:48 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: gcc-rust

Hi Mark!

On 2021-09-22T00:54:30+0200, Mark Wielaard <mark@klomp.org> wrote:
> There were two warnings in lexer parse_byte_char and parse_byte_string
> code for arches with signed chars:
>
> rust-lex.cc: In member function
>              ‘Rust::TokenPtr Rust::Lexer::parse_byte_char(Location)’:
> rust-lex.cc:1564:21: warning: comparison is always false due to limited
>                      range of data type [-Wtype-limits]
>  1564 |       if (byte_char > 127)
>       |           ~~~~~~~~~~^~~~~

That's <https://github.com/Rust-GCC/gccrs/pull/343>.

> rust-lex.cc: In member function
>              ‘Rust::TokenPtr Rust::Lexer::parse_byte_string(Location)’:
> rust-lex.cc:1639:27: warning: comparison is always false due to limited
>                      range of data type [-Wtype-limits]
>  1639 |           if (output_char > 127)
>       |               ~~~~~~~~~~~~^~~~~

That's <https://github.com/Rust-GCC/gccrs/pull/344>.

Both these related to <https://github.com/Rust-GCC/gccrs/issues/336>
"GCC '--enable-bootstrap' build".

> The fix would be to cast to an unsigned char before the comparison.
> But that is actually wrong, and would produce the following errors
> parsing a byte char or byte string:
>
> bytecharstring.rs:3:14: error: ‘byte char’ ‘�’ out of range
>     3 |   let _bc = b'\x80';
>       |              ^
> bytecharstring.rs:4:14: error: character ‘�’ in byte string out of range
>     4 |   let _bs = b"foo\x80bar";
>       |              ^
>
> Both byte chars and byte strings may contain up to \xFF (255)
> characters. It is utf-8 chars or strings that can only [truncated here -- but I understand what you mean]

I think this does match my thoughts in
<https://github.com/Rust-GCC/gccrs/pull/343#issuecomment-816214689>.

> Remove the faulty check and add a new testcase bytecharstring.rs
> that checks byte chars and strings do accept > 127 hex char
> escapes, but utf-8 chars and strings reject such hex char escapes.
> ---
>
> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring

Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687>
"Fix byte char and byte string lexing code".


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix byte char and byte string lexing code
  2021-09-22  9:48 ` Thomas Schwinge
@ 2021-09-22 20:37   ` Mark Wielaard
  2021-09-23 11:43     ` Philip Herron
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Wielaard @ 2021-09-22 20:37 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-rust

Hi Thomas,

On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote:
> That's <https://github.com/Rust-GCC/gccrs/pull/343>.
> [...] 
> That's <https://github.com/Rust-GCC/gccrs/pull/344>.

Ah, sorry, I don't really track the github issues and had missed
those. But good to see the analysis matches.

> Both these related to <https://github.com/Rust-GCC/gccrs/issues/336>
> "GCC '--enable-bootstrap' build".

To make --enable-bootstrap possible (wow, that takes a long time...)
you'll also need:
https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html
https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable

With this and that patch applied there are no more warnings building
the rust frontend, so a --enable-bootstrap (-Werror) build completes
successfully.

> > https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring
> 
> Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687>
> "Fix byte char and byte string lexing code".

Thanks,

Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix byte char and byte string lexing code
  2021-09-22 20:37   ` Mark Wielaard
@ 2021-09-23 11:43     ` Philip Herron
  2021-09-23 14:10       ` Arthur Cohen
  0 siblings, 1 reply; 10+ messages in thread
From: Philip Herron @ 2021-09-23 11:43 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Thomas Schwinge, gcc-rust

[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]

Hi guys,

Thanks, Thomas for raising the PR it is currently being merged. We will
raise the next PR to get bootstrapable builds working which will be really
nice there is a GitHub automation for that at the moment.

Something I was thinking about outside of the scope of that patch was about
the utf8 how do they get represented? Is it some kind of wchar_t?

Thanks again

--Phil

On Wed, 22 Sept 2021 at 21:37, Mark Wielaard <mark@klomp.org> wrote:

> Hi Thomas,
>
> On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote:
> > That's <https://github.com/Rust-GCC/gccrs/pull/343>.
> > [...]
> > That's <https://github.com/Rust-GCC/gccrs/pull/344>.
>
> Ah, sorry, I don't really track the github issues and had missed
> those. But good to see the analysis matches.
>
> > Both these related to <https://github.com/Rust-GCC/gccrs/issues/336>
> > "GCC '--enable-bootstrap' build".
>
> To make --enable-bootstrap possible (wow, that takes a long time...)
> you'll also need:
> https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html
>
> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable
>
> With this and that patch applied there are no more warnings building
> the rust frontend, so a --enable-bootstrap (-Werror) build completes
> successfully.
>
> > >
> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring
> >
> > Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687>
> > "Fix byte char and byte string lexing code".
>
> Thanks,
>
> Mark
>
> --
> Gcc-rust mailing list
> Gcc-rust@gcc.gnu.org
> https://gcc.gnu.org/mailman/listinfo/gcc-rust
>

[-- Attachment #2: Type: text/html, Size: 3085 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Fix byte char and byte string lexing code
  2021-09-23 11:43     ` Philip Herron
@ 2021-09-23 14:10       ` Arthur Cohen
  2021-09-23 20:53         ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard
  0 siblings, 1 reply; 10+ messages in thread
From: Arthur Cohen @ 2021-09-23 14:10 UTC (permalink / raw)
  To: Philip Herron; +Cc: Mark Wielaard, gcc-rust, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 2499 bytes --]

> Something I was thinking about outside of the scope of that patch was
about the utf8 how do they get represented? Is it some kind of wchar_t?

Do you mean in C++ or in rustc? In rustc, they are represented as Unicode
Scalar Values which are 4 bytes wide.

From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html]

So I'm assuming they could be represented as `int32_t`s which would also
make sense for the check

Cheers,
Arthur

Le jeu. 23 sept. 2021 à 13:43, Philip Herron <philip.herron@embecosm.com> a
écrit :

> Hi guys,
>
> Thanks, Thomas for raising the PR it is currently being merged. We will
> raise the next PR to get bootstrapable builds working which will be really
> nice there is a GitHub automation for that at the moment.
>
> Something I was thinking about outside of the scope of that patch was
> about the utf8 how do they get represented? Is it some kind of wchar_t?
>
> Thanks again
>
> --Phil
>
> On Wed, 22 Sept 2021 at 21:37, Mark Wielaard <mark@klomp.org> wrote:
>
>> Hi Thomas,
>>
>> On Wed, Sep 22, 2021 at 11:48:56AM +0200, Thomas Schwinge wrote:
>> > That's <https://github.com/Rust-GCC/gccrs/pull/343>.
>> > [...]
>> > That's <https://github.com/Rust-GCC/gccrs/pull/344>.
>>
>> Ah, sorry, I don't really track the github issues and had missed
>> those. But good to see the analysis matches.
>>
>> > Both these related to <https://github.com/Rust-GCC/gccrs/issues/336>
>> > "GCC '--enable-bootstrap' build".
>>
>> To make --enable-bootstrap possible (wow, that takes a long time...)
>> you'll also need:
>> https://gcc.gnu.org/pipermail/gcc-rust/2021-September/000178.html
>>
>> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=rust-mangle-unreachable
>>
>> With this and that patch applied there are no more warnings building
>> the rust frontend, so a --enable-bootstrap (-Werror) build completes
>> successfully.
>>
>> > >
>> https://code.wildebeest.org/git/user/mjw/gccrs/commit/?h=bytecharstring
>> >
>> > Thanks, that's now: <https://github.com/Rust-GCC/gccrs/pull/687>
>> > "Fix byte char and byte string lexing code".
>>
>> Thanks,
>>
>> Mark
>>
>> --
>> Gcc-rust mailing list
>> Gcc-rust@gcc.gnu.org
>> https://gcc.gnu.org/mailman/listinfo/gcc-rust
>>
> --
> Gcc-rust mailing list
> Gcc-rust@gcc.gnu.org
> https://gcc.gnu.org/mailman/listinfo/gcc-rust
>


-- 
Arthur Cohen
+33 6 10 15 73 74
cohenarthur.dev@gmail.com
https://github.com/cohenarthur

[-- Attachment #2: Type: text/html, Size: 4706 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
  2021-09-23 14:10       ` Arthur Cohen
@ 2021-09-23 20:53         ` Mark Wielaard
  2021-09-24 11:01           ` Philip Herron
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Wielaard @ 2021-09-23 20:53 UTC (permalink / raw)
  To: Arthur Cohen; +Cc: Philip Herron, gcc-rust, Thomas Schwinge

On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote:
> > Something I was thinking about outside of the scope of that patch was
> about the utf8 how do they get represented? Is it some kind of wchar_t?
> 
> Do you mean in C++ or in rustc? In rustc, they are represented as Unicode
> Scalar Values which are 4 bytes wide.
> 
> From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html]
> 
> So I'm assuming they could be represented as `int32_t`s which would also
> make sense for the check

Yes, for rust characters a 32bit type (I would pick uint32_t or maybe
char32_t) makes sense, since chars in rust are (almost) equal to
unicode code points (technically only 21 bits are used). But not
really, it is a Unicode scalar value, which excludes high-surrogate
and low-surrogate code points and so the only valid values are 0x0 to
0xD7FF and 0xE000 to 0x10FFFF.

We should not use the C type wchar_t, because wchar_t is
implementation defined and can be 16 or 32 bits.

See also https://doc.rust-lang.org/reference/types/textual.html

But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8
chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8
We can use c++ strings made up of (8 bit) chars for that.

Our lexer should make sure we only accept valid rust characters or
utf-8 sequences.

Note that the above doesn't hold for "byte chars" (b'A') or "byte
strings" (b"abc"). Those are really just u8 or [u8] arrays which hold
bytes (0x0 to 0xff).

We currently get the type for byte strings wrong. We pretend they are
&str, but they really should be &[u8].

I tried to fix that with the following:

diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typecheck/rust-hir-type-check-expr.h
index fe8973a4d81..b0dd1c3ff2c 100644
--- a/gcc/rust/typecheck/rust-hir-type-check-expr.h
+++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h
@@ -609,15 +609,42 @@ public:
 	break;
 
 	case HIR::Literal::LitType::BYTE_STRING: {
-	  /* We just treat this as a string, but it really is an arraytype of
-	     u8. It isn't in UTF-8, but really just a byte array.  */
-	  TyTy::BaseType *base = nullptr;
-	  auto ok = context->lookup_builtin ("str", &base);
+	  /* This is an arraytype of u8 reference (&[u8;size]). It isn't in
+	     UTF-8, but really just a byte array. Code to construct the array
+	     reference copied from ArrayElemsValues and ArrayType. */
+	  TyTy::BaseType *u8;
+	  auto ok = context->lookup_builtin ("u8", &u8);
 	  rust_assert (ok);
 
+	  auto crate_num = mappings->get_current_crate ();
+	  Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID,
+					 mappings->get_next_hir_id (crate_num),
+					 UNKNOWN_LOCAL_DEFID);
+
+	  /* Capacity is the size of the string (number of chars).
+	     It is a constant, but for fold it to get a BExpression.  */
+	  std::string capacity_str = std::to_string (expr.as_string ().size ());
+	  HIR::LiteralExpr literal_capacity (mapping, capacity_str,
+					     HIR::Literal::LitType::INT,
+					     PrimitiveCoreType::CORETYPE_USIZE,
+					     expr.get_locus ());
+
+	  // mark the type for this implicit node
+	  context->insert_type (mapping,
+				new TyTy::USizeType (mapping.get_hirid ()));
+
+	  Bexpression *capacity
+	    = ConstFold::ConstFoldExpr::fold (&literal_capacity);
+
+	  TyTy::ArrayType *array
+	    = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), capacity,
+				   TyTy::TyVar (u8->get_ref ()));
+
+	  context->insert_type (expr.get_mappings (), array);
+
 	  infered
 	    = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (),
-				       TyTy::TyVar (base->get_ref ()), false);
+				       TyTy::TyVar (array->get_ref ()), false);
 	}
 	break;
 
But that looks more complicated than is probably necessary and it
doesn't work. When the type checker wants to print this type
ReferenceType.as_string () goes into a loop for some reason.

Can anybody see what is wrong with the above?

Cheers,

Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
  2021-09-23 20:53         ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard
@ 2021-09-24 11:01           ` Philip Herron
  2021-09-25 11:53             ` Mark Wielaard
  0 siblings, 1 reply; 10+ messages in thread
From: Philip Herron @ 2021-09-24 11:01 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 6835 bytes --]

Hi Mark,

This is really useful information, will this mean that the lexer token will
need to represent strings differently as well? Or is the std::string in the
lexer still ok?

The change you made above has the problem that reference types like, arrays
are forms of what rust calls covariant types since they might contain an
inference variable, so they require lookup to determine the base type. Its
likely there is a reference cycle here. Though this change will not be
correct for type checking purposes. The design of the type system is purely
about rust type checking and inferring types. So for example this change
will break the case of:

```
  let a:str = "test";
```

Since the TypePath of str can't know the size of the expected array at
compilation time. And the error message will end up with something like
"expected str got [i8, 4]"; As for the string implementation, I did some
experimentation this morning and it looks as though strings in rust are a
kind of slice that we still need to support so, for example, you can see in
this gdb session the implicit struct for the slice:

```
Temporary breakpoint 1, test::main () at test.rs:8
8               let a = "Hello World %i\n";
(gdb) n
9               let b = a as *const str;
(gdb) p a
$1 = "Hello World %i\n"
(gdb) p b
No symbol 'b' in current context
(gdb) n
10              let c = b as *const i8;
(gdb) p b
$2 = *const str {data_ptr: 0x555555588000 "Hello World %i\n\000", length:
15}
(gdb) p b.data_ptr
$3 = (*mut u8) 0x555555588000 "Hello World %i\n\000"
```

So to me, this is something that we will need to do in the backend with a
bunch of implicit code and types, it seems as though for this rust code:

```
unsafe {
        let a:&str = "Hello World %i\n";
        let b = a as *const str;
        let c = b as *const i8;

        printf(c, 123);
    }
```

we would be creating something like:

```
struct Str {
  data_ptr: i8*;
  length: usize;
}

const char *const_string =  "hello world %i\n";
Str& a = &Slice{data_ptr: const_string, l5};
Str* b = a;
const unsigned char* c = b->data_ptr;
```

I think we should be able to fix this when we get slices working in the
compiler.

What do you think?

--Phil

On Thu, 23 Sept 2021 at 21:53, Mark Wielaard <mark@klomp.org> wrote:

> On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote:
> > > Something I was thinking about outside of the scope of that patch was
> > about the utf8 how do they get represented? Is it some kind of wchar_t?
> >
> > Do you mean in C++ or in rustc? In rustc, they are represented as Unicode
> > Scalar Values which are 4 bytes wide.
> >
> > From the docs over here: [
> https://doc.rust-lang.org/std/primitive.char.html]
> >
> > So I'm assuming they could be represented as `int32_t`s which would also
> > make sense for the check
>
> Yes, for rust characters a 32bit type (I would pick uint32_t or maybe
> char32_t) makes sense, since chars in rust are (almost) equal to
> unicode code points (technically only 21 bits are used). But not
> really, it is a Unicode scalar value, which excludes high-surrogate
> and low-surrogate code points and so the only valid values are 0x0 to
> 0xD7FF and 0xE000 to 0x10FFFF.
>
> We should not use the C type wchar_t, because wchar_t is
> implementation defined and can be 16 or 32 bits.
>
> See also https://doc.rust-lang.org/reference/types/textual.html
>
> But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8
> chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8
> We can use c++ strings made up of (8 bit) chars for that.
>
> Our lexer should make sure we only accept valid rust characters or
> utf-8 sequences.
>
> Note that the above doesn't hold for "byte chars" (b'A') or "byte
> strings" (b"abc"). Those are really just u8 or [u8] arrays which hold
> bytes (0x0 to 0xff).
>
> We currently get the type for byte strings wrong. We pretend they are
> &str, but they really should be &[u8].
>
> I tried to fix that with the following:
>
> diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h
> b/gcc/rust/typecheck/rust-hir-type-check-expr.h
> index fe8973a4d81..b0dd1c3ff2c 100644
> --- a/gcc/rust/typecheck/rust-hir-type-check-expr.h
> +++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h
> @@ -609,15 +609,42 @@ public:
>         break;
>
>         case HIR::Literal::LitType::BYTE_STRING: {
> -         /* We just treat this as a string, but it really is an arraytype
> of
> -            u8. It isn't in UTF-8, but really just a byte array.  */
> -         TyTy::BaseType *base = nullptr;
> -         auto ok = context->lookup_builtin ("str", &base);
> +         /* This is an arraytype of u8 reference (&[u8;size]). It isn't in
> +            UTF-8, but really just a byte array. Code to construct the
> array
> +            reference copied from ArrayElemsValues and ArrayType. */
> +         TyTy::BaseType *u8;
> +         auto ok = context->lookup_builtin ("u8", &u8);
>           rust_assert (ok);
>
> +         auto crate_num = mappings->get_current_crate ();
> +         Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID,
> +                                        mappings->get_next_hir_id
> (crate_num),
> +                                        UNKNOWN_LOCAL_DEFID);
> +
> +         /* Capacity is the size of the string (number of chars).
> +            It is a constant, but for fold it to get a BExpression.  */
> +         std::string capacity_str = std::to_string (expr.as_string
> ().size ());
> +         HIR::LiteralExpr literal_capacity (mapping, capacity_str,
> +                                            HIR::Literal::LitType::INT,
> +
> PrimitiveCoreType::CORETYPE_USIZE,
> +                                            expr.get_locus ());
> +
> +         // mark the type for this implicit node
> +         context->insert_type (mapping,
> +                               new TyTy::USizeType (mapping.get_hirid
> ()));
> +
> +         Bexpression *capacity
> +           = ConstFold::ConstFoldExpr::fold (&literal_capacity);
> +
> +         TyTy::ArrayType *array
> +           = new TyTy::ArrayType (expr.get_mappings ().get_hirid (),
> capacity,
> +                                  TyTy::TyVar (u8->get_ref ()));
> +
> +         context->insert_type (expr.get_mappings (), array);
> +
>           infered
>             = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (),
> -                                      TyTy::TyVar (base->get_ref ()),
> false);
> +                                      TyTy::TyVar (array->get_ref ()),
> false);
>         }
>         break;
>
> But that looks more complicated than is probably necessary and it
> doesn't work. When the type checker wants to print this type
> ReferenceType.as_string () goes into a loop for some reason.
>
> Can anybody see what is wrong with the above?
>
> Cheers,
>
> Mark
>
>

[-- Attachment #2: Type: text/html, Size: 8893 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
  2021-09-24 11:01           ` Philip Herron
@ 2021-09-25 11:53             ` Mark Wielaard
  2021-09-30 10:46               ` Philip Herron
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Wielaard @ 2021-09-25 11:53 UTC (permalink / raw)
  To: Philip Herron; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge

Hi Philip,

On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote:
> This is really useful information, will this mean that the lexer token will
> need to represent strings differently as well? Or is the std::string in the
> lexer still ok?

I think the respresentation as std::string is fine. As long as we
don't mix std::strings between different types (byte strings may
contain sequences of chars that aren't valid utf-8 sequenecs).

> The change you made above has the problem that reference types like, arrays
> are forms of what rust calls covariant types since they might contain an
> inference variable, so they require lookup to determine the base type. Its
> likely there is a reference cycle here. Though this change will not be
> correct for type checking purposes. The design of the type system is purely
> about rust type checking and inferring types.

OK, so how do I represent an reference to an array type that doesn't
contain any inference variables? When we see a b"hello" byte string
that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is
the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8];

So we know this is &[u8;5] and if we write:

let a = b"hello";

We want to infer that a has type &[u8;5].

> So for example this change will break the case of:
> 
> ```
>   let a:str = "test";
> ```
> 
> Since the TypePath of str can't know the size of the expected array at
> compilation time. And the error message will end up with something like
> "expected str got [i8, 4]";

Right, but that is for "proper strings". It is somewhat unfortunate
that Rust calls byte strings also "strings", but they really
aren't. b"abc" is static array of u8, not a &str (containing utf-8).

I have to think about the slicing of "proper strings", which sound
more complicated than slicing of byte strings, because I don't think
you want to chop up a utf-8 sequence. For now I would simply try to
get the type of byte strings like b"test" correct.

Cheers,

Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
  2021-09-25 11:53             ` Mark Wielaard
@ 2021-09-30 10:46               ` Philip Herron
  2021-10-03 22:04                 ` Mark Wielaard
  0 siblings, 1 reply; 10+ messages in thread
From: Philip Herron @ 2021-09-30 10:46 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 2963 bytes --]

Hi Mark,

Thanks for clarifying this, I was getting mixed up between normal str's and
byte strings. Your patch was 99% of the way there to fix the type
resolution so I finished it off for you:

https://github.com/Rust-GCC/gccrs/pull/698/files

The missing piece was that References and Array's are a type of covariant
type so that an array type can look like this: [_, capacity], so the
inference variable here is the variant so that we need to make sure it has
its own implicit mapping id. You just needed to create one more mapping to
get that implicit id so that the reference type similarly doesn't get into
a loop of looking up itself. Creating implicit types like this could be
made easier, so we should likely add some helpers for this scenario.

Let me know what you think.

Thanks

--Phil

On Sat, 25 Sept 2021 at 12:53, Mark Wielaard <mark@klomp.org> wrote:

> Hi Philip,
>
> On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote:
> > This is really useful information, will this mean that the lexer token
> will
> > need to represent strings differently as well? Or is the std::string in
> the
> > lexer still ok?
>
> I think the respresentation as std::string is fine. As long as we
> don't mix std::strings between different types (byte strings may
> contain sequences of chars that aren't valid utf-8 sequenecs).
>
> > The change you made above has the problem that reference types like,
> arrays
> > are forms of what rust calls covariant types since they might contain an
> > inference variable, so they require lookup to determine the base type.
> Its
> > likely there is a reference cycle here. Though this change will not be
> > correct for type checking purposes. The design of the type system is
> purely
> > about rust type checking and inferring types.
>
> OK, so how do I represent an reference to an array type that doesn't
> contain any inference variables? When we see a b"hello" byte string
> that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is
> the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8];
>
> So we know this is &[u8;5] and if we write:
>
> let a = b"hello";
>
> We want to infer that a has type &[u8;5].
>
> > So for example this change will break the case of:
> >
> > ```
> >   let a:str = "test";
> > ```
> >
> > Since the TypePath of str can't know the size of the expected array at
> > compilation time. And the error message will end up with something like
> > "expected str got [i8, 4]";
>
> Right, but that is for "proper strings". It is somewhat unfortunate
> that Rust calls byte strings also "strings", but they really
> aren't. b"abc" is static array of u8, not a &str (containing utf-8).
>
> I have to think about the slicing of "proper strings", which sound
> more complicated than slicing of byte strings, because I don't think
> you want to chop up a utf-8 sequence. For now I would simply try to
> get the type of byte strings like b"test" correct.
>
> Cheers,
>
> Mark
>
>

[-- Attachment #2: Type: text/html, Size: 3842 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
  2021-09-30 10:46               ` Philip Herron
@ 2021-10-03 22:04                 ` Mark Wielaard
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Wielaard @ 2021-10-03 22:04 UTC (permalink / raw)
  To: Philip Herron; +Cc: Arthur Cohen, gcc-rust, Thomas Schwinge

Hi Philip,

On Thu, Sep 30, 2021 at 11:46:30AM +0100, Philip Herron wrote:
> Your patch was 99% of the way there to fix the type
> resolution so I finished it off for you:
> 
> https://github.com/Rust-GCC/gccrs/pull/698/files

And you added the actual backend code needed! So I'll probably say I
was maybe 45% there :) This is really useful. It works for me and will
enable writing a couple of real tests for byte strings (like we have
for str already, that aren't just lexer/parser checks, but which can
actually check the contents of the byte strings.

> The missing piece was that References and Array's are a type of covariant
> type so that an array type can look like this: [_, capacity], so the
> inference variable here is the variant so that we need to make sure it has
> its own implicit mapping id. You just needed to create one more mapping to
> get that implicit id so that the reference type similarly doesn't get into
> a loop of looking up itself. Creating implicit types like this could be
> made easier, so we should likely add some helpers for this scenario.

Thanks, my knowledge of mappings was a bit fuzzy. So even though we
know our own array type is fixed, we still need to attach a unique
mapping to match against other types which might not be fixed (yet).

There is indeed a lot of extra work you need to do and that you can
easily get wrong. Creating an empty mapping is a lot of work and it is
easy to get the details right. An new_empty_mapping helper method
would be great. Maybe it could even be hidden inside the expression or
type constructor.

I do have a question about why we also create these mappings for the
capacity. The capacity is always a fixed usize number. Not just in
this case, but always since it is constant expression in a const
context. So any comparison of the length will always be a simple
constant usize number check. Is using mappings not overkill in that
case?

We use a Bexpression for the capacity, created by
ConstFold::ConstFoldExpr::fold, shouldn't we check it actually folded
correctly? Currently when it cannot we just leave it in an error state
till it is too late.

e.g

pub fn main()
{
  let _a: &[u8; 4 / 0];
  _a = b"test";
}

Results in:

rust1: internal compiler error: in const_size_val_to_string, at rust/rust-gcc.cc:210
0x794e67 Gcc_backend::const_size_val_to_string[abi:cxx11](Bexpression*)
	/home/mark/src/gccrs/gcc/rust/rust-gcc.cc:210
0x794e67 Gcc_backend::const_size_val_to_string[abi:cxx11](Bexpression*)
	/home/mark/src/gccrs/gcc/rust/rust-gcc.cc:208
0x8c51ef Rust::TyTy::ArrayType::capacity_string[abi:cxx11]() const
	/home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty.cc:1136
0x8c51ef Rust::TyTy::ArrayType::as_string[abi:cxx11]() const
	/home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty.cc:1129
0x8d0c8b Rust::TyTy::BaseCoercionRules::visit(Rust::TyTy::ArrayType&)
	/home/mark/src/gccrs/gcc/rust/typecheck/rust-tyty-coercion.h:166
[...]

So it seems we really should check for is_error_expr earlier.  And
maybe just get the integer value and store/compare that instead of the
Bexpression?

P.S. Don't let this stop you from committing the code as is, it does
work. We can rework it afterwards if the above comments make any
sense.

Cheers,

Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-03 22:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-21 22:54 [PATCH] Fix byte char and byte string lexing code Mark Wielaard
2021-09-22  9:48 ` Thomas Schwinge
2021-09-22 20:37   ` Mark Wielaard
2021-09-23 11:43     ` Philip Herron
2021-09-23 14:10       ` Arthur Cohen
2021-09-23 20:53         ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Mark Wielaard
2021-09-24 11:01           ` Philip Herron
2021-09-25 11:53             ` Mark Wielaard
2021-09-30 10:46               ` Philip Herron
2021-10-03 22:04                 ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).