From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark@klomp.org>
Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112])
 by sourceware.org (Postfix) with ESMTPS id C4AB53858D29
 for <gcc-rust@gcc.gnu.org>; Thu, 23 Sep 2021 20:53:33 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C4AB53858D29
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=klomp.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org
Received: from reform (deer0x0a.wildebeest.org [172.31.17.140])
 (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by gnu.wildebeest.org (Postfix) with ESMTPSA id E80DD300071F;
 Thu, 23 Sep 2021 22:53:31 +0200 (CEST)
Received: by reform (Postfix, from userid 1000)
 id F19C92E82F0F; Thu, 23 Sep 2021 22:53:30 +0200 (CEST)
Date: Thu, 23 Sep 2021 22:53:30 +0200
From: Mark Wielaard <mark@klomp.org>
To: Arthur Cohen <cohenarthur.dev@gmail.com>
Cc: Philip Herron <philip.herron@embecosm.com>, gcc-rust@gcc.gnu.org,
 Thomas Schwinge <thomas@codesourcery.com>
Subject: byte/char string representation (Was: [PATCH] Fix byte char and byte
 string lexing code)
Message-ID: <YUzpSop/pE8TVKlh@wildebeest.org>
References: <20210921225430.166550-1-mark@klomp.org>
 <87k0j9ym7r.fsf@euler.schwinge.homeip.net>
 <YUuUEnzWnUMmjFhg@wildebeest.org>
 <CAB2u+n2vGM32BKmb0yJnZMJirxPtKFzximSH8YU4M_SiKpHOnw@mail.gmail.com>
 <CAKNo5ARVKRpr378OsNRP5xU4QxpL9f98qtEbdGLSikUALfTQ+Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAKNo5ARVKRpr378OsNRP5xU4QxpL9f98qtEbdGLSikUALfTQ+Q@mail.gmail.com>
X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-rust@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: gcc-rust mailing list <gcc-rust.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-rust/>
List-Post: <mailto:gcc-rust@gcc.gnu.org>
List-Help: <mailto:gcc-rust-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Sep 2021 20:53:35 -0000

On Thu, Sep 23, 2021 at 04:10:59PM +0200, Arthur Cohen wrote:
> > Something I was thinking about outside of the scope of that patch was
> about the utf8 how do they get represented? Is it some kind of wchar_t?
> 
> Do you mean in C++ or in rustc? In rustc, they are represented as Unicode
> Scalar Values which are 4 bytes wide.
> 
> From the docs over here: [https://doc.rust-lang.org/std/primitive.char.html]
> 
> So I'm assuming they could be represented as `int32_t`s which would also
> make sense for the check

Yes, for rust characters a 32bit type (I would pick uint32_t or maybe
char32_t) makes sense, since chars in rust are (almost) equal to
unicode code points (technically only 21 bits are used). But not
really, it is a Unicode scalar value, which excludes high-surrogate
and low-surrogate code points and so the only valid values are 0x0 to
0xD7FF and 0xE000 to 0x10FFFF.

We should not use the C type wchar_t, because wchar_t is
implementation defined and can be 16 or 32 bits.

See also https://doc.rust-lang.org/reference/types/textual.html

But utf8 strings are made up of u8 "utf8 chars". You need 1 to 4 utf8
chars to encode a code point. https://en.wikipedia.org/wiki/UTF-8
We can use c++ strings made up of (8 bit) chars for that.

Our lexer should make sure we only accept valid rust characters or
utf-8 sequences.

Note that the above doesn't hold for "byte chars" (b'A') or "byte
strings" (b"abc"). Those are really just u8 or [u8] arrays which hold
bytes (0x0 to 0xff).

We currently get the type for byte strings wrong. We pretend they are
&str, but they really should be &[u8].

I tried to fix that with the following:

diff --git a/gcc/rust/typecheck/rust-hir-type-check-expr.h b/gcc/rust/typecheck/rust-hir-type-check-expr.h
index fe8973a4d81..b0dd1c3ff2c 100644
--- a/gcc/rust/typecheck/rust-hir-type-check-expr.h
+++ b/gcc/rust/typecheck/rust-hir-type-check-expr.h
@@ -609,15 +609,42 @@ public:
 	break;
 
 	case HIR::Literal::LitType::BYTE_STRING: {
-	  /* We just treat this as a string, but it really is an arraytype of
-	     u8. It isn't in UTF-8, but really just a byte array.  */
-	  TyTy::BaseType *base = nullptr;
-	  auto ok = context->lookup_builtin ("str", &base);
+	  /* This is an arraytype of u8 reference (&[u8;size]). It isn't in
+	     UTF-8, but really just a byte array. Code to construct the array
+	     reference copied from ArrayElemsValues and ArrayType. */
+	  TyTy::BaseType *u8;
+	  auto ok = context->lookup_builtin ("u8", &u8);
 	  rust_assert (ok);
 
+	  auto crate_num = mappings->get_current_crate ();
+	  Analysis::NodeMapping mapping (crate_num, UNKNOWN_NODEID,
+					 mappings->get_next_hir_id (crate_num),
+					 UNKNOWN_LOCAL_DEFID);
+
+	  /* Capacity is the size of the string (number of chars).
+	     It is a constant, but for fold it to get a BExpression.  */
+	  std::string capacity_str = std::to_string (expr.as_string ().size ());
+	  HIR::LiteralExpr literal_capacity (mapping, capacity_str,
+					     HIR::Literal::LitType::INT,
+					     PrimitiveCoreType::CORETYPE_USIZE,
+					     expr.get_locus ());
+
+	  // mark the type for this implicit node
+	  context->insert_type (mapping,
+				new TyTy::USizeType (mapping.get_hirid ()));
+
+	  Bexpression *capacity
+	    = ConstFold::ConstFoldExpr::fold (&literal_capacity);
+
+	  TyTy::ArrayType *array
+	    = new TyTy::ArrayType (expr.get_mappings ().get_hirid (), capacity,
+				   TyTy::TyVar (u8->get_ref ()));
+
+	  context->insert_type (expr.get_mappings (), array);
+
 	  infered
 	    = new TyTy::ReferenceType (expr.get_mappings ().get_hirid (),
-				       TyTy::TyVar (base->get_ref ()), false);
+				       TyTy::TyVar (array->get_ref ()), false);
 	}
 	break;
 
But that looks more complicated than is probably necessary and it
doesn't work. When the type checker wants to print this type
ReferenceType.as_string () goes into a loop for some reason.

Can anybody see what is wrong with the above?

Cheers,

Mark