From: Mark Wielaard <mark@klomp.org>
To: Arthur Cohen <cohenarthur.dev@gmail.com>
Cc: Philip Herron <philip.herron@embecosm.com>,
gcc-rust@gcc.gnu.org, Thomas Schwinge <thomas@codesourcery.com>
Subject: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code)
Date: Thu, 23 Sep 2021 22:53:30 +0200 [thread overview]
Message-ID: <YUzpSop/pE8TVKlh@wildebeest.org> (raw)
In-Reply-To: <CAKNo5ARVKRpr378OsNRP5xU4QxpL9f98qtEbdGLSikUALfTQ+Q@mail.gmail.com>
On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote:
> > Something I was thinking about outside of the scope of that patch was
> about the utf8 how do they get represented? Is it some kind of wchar_t?
>
> Do you mean in C++ or in rustc? In rustc, they are represented as Unicode
> Scalar Values which are 4 bytes wide.
>
> From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html]
>
> So I'm assuming they could be represented as `int32_t`s which would also
> make sense for the check
Yes, for rust characters a 32bit type (I would pick uint32_t or maybe
char32_t) makes sense, since chars in rust are (almost) equal to
unicode code points (technically only 21 bits are used). But not
really, it is a Unicode scalar value, which excludes high-surrogate
and low-surrogate code points and so the only valid values are 0x0 to
0xD7FF and 0xE000 to 0x10FFFF.
We should not use the C type wchar_t, because wchar_t is
implementation defined and can be 16 or 32 bits.
See also https://doc.rust-lang.org/reference/types/textual.html
But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8
chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8
We can use c++ strings made up of (8 bit) chars for that.
Our lexer should make sure we only accept valid rust characters or
utf-8 sequences.
Note that the above doesn't hold for "byte chars" (b'A') or "byte
strings" (b"abc"). Those are really just u8 or [u8] arrays which hold
bytes (0x0 to 0xff).
We currently get the type for byte strings wrong. We pretend they are
&str, but they really should be &[u8].
I tried to fix that with the following:
diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typecheck/rust-hir-type-check-expr.h
index fe8973a4d81..b0dd1c3ff2c 100644
--- a/gcc/rust/typecheck/rust-hir-type-check-expr.h
+++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h
@@ -609,15 +609,42 @@ public:
break;
case HIR::Literal::LitType::BYTE_STRING: {
- /* We just treat this as a string, but it really is an arraytype of
- u8. It isn't in UTF-8, but really just a byte array. */
- TyTy::BaseType *base = nullptr;
- auto ok = context->lookup_builtin ("str", &base);
+ /* This is an arraytype of u8 reference (&[u8;size]). It isn't in
+ UTF-8, but really just a byte array. Code to construct the array
+ reference copied from ArrayElemsValues and ArrayType. */
+ TyTy::BaseType *u8;
+ auto ok = context->lookup_builtin ("u8", &u8);
rust_assert (ok);
+ auto crate_num = mappings->get_current_crate ();
+ Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID,
+ mappings->get_next_hir_id (crate_num),
+ UNKNOWN_LOCAL_DEFID);
+
+ /* Capacity is the size of the string (number of chars).
+ It is a constant, but for fold it to get a BExpression. */
+ std::string capacity_str = std::to_string (expr.as_string ().size ());
+ HIR::LiteralExpr literal_capacity (mapping, capacity_str,
+ HIR::Literal::LitType::INT,
+ PrimitiveCoreType::CORETYPE_USIZE,
+ expr.get_locus ());
+
+ // mark the type for this implicit node
+ context->insert_type (mapping,
+ new TyTy::USizeType (mapping.get_hirid ()));
+
+ Bexpression *capacity
+ = ConstFold::ConstFoldExpr::fold (&literal_capacity);
+
+ TyTy::ArrayType *array
+ = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), capacity,
+ TyTy::TyVar (u8->get_ref ()));
+
+ context->insert_type (expr.get_mappings (), array);
+
infered
= new TyTy::ReferenceType (expr.get_mappings ().get_hirid (),
- TyTy::TyVar (base->get_ref ()), false);
+ TyTy::TyVar (array->get_ref ()), false);
}
break;
But that looks more complicated than is probably necessary and it
doesn't work. When the type checker wants to print this type
ReferenceType.as_string () goes into a loop for some reason.
Can anybody see what is wrong with the above?
Cheers,
Mark
next prev parent reply other threads:[~2021-09-23 20:53 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-21 22:54 [PATCH] Fix byte char and byte string lexing code Mark Wielaard
2021-09-22 9:48 ` Thomas Schwinge
2021-09-22 20:37 ` Mark Wielaard
2021-09-23 11:43 ` Philip Herron
2021-09-23 14:10 ` Arthur Cohen
2021-09-23 20:53 ` Mark Wielaard [this message]
2021-09-24 11:01 ` byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) Philip Herron
2021-09-25 11:53 ` Mark Wielaard
2021-09-30 10:46 ` Philip Herron
2021-10-03 22:04 ` Mark Wielaard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YUzpSop/pE8TVKlh@wildebeest.org \
--to=mark@klomp.org \
--cc=cohenarthur.dev@gmail.com \
--cc=gcc-rust@gcc.gnu.org \
--cc=philip.herron@embecosm.com \
--cc=thomas@codesourcery.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).