From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark@klomp.org>
Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112])
 by sourceware.org (Postfix) with ESMTPS id 0E5E63858404
 for <gcc-rust@gcc.gnu.org>; Sat, 25 Sep 2021 11:53:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0E5E63858404
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=klomp.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org
Received: from reform (deer0x0a.wildebeest.org [172.31.17.140])
 (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by gnu.wildebeest.org (Postfix) with ESMTPSA id 78D6230008AE;
 Sat, 25 Sep 2021 13:53:08 +0200 (CEST)
Received: by reform (Postfix, from userid 1000)
 id 234752E80BE3; Sat, 25 Sep 2021 13:53:08 +0200 (CEST)
Date: Sat, 25 Sep 2021 13:53:08 +0200
From: Mark Wielaard <mark@klomp.org>
To: Philip Herron <philip.herron@embecosm.com>
Cc: Arthur Cohen <cohenarthur.dev@gmail.com>, gcc-rust@gcc.gnu.org,
 Thomas Schwinge <thomas@codesourcery.com>
Subject: Re: byte/char string representation (Was: [PATCH] Fix byte char and
 byte string lexing code)
Message-ID: <YU8NpNfIgnJRoxbi@wildebeest.org>
References: <20210921225430.166550-1-mark@klomp.org>
 <87k0j9ym7r.fsf@euler.schwinge.homeip.net>
 <YUuUEnzWnUMmjFhg@wildebeest.org>
 <CAB2u+n2vGM32BKmb0yJnZMJirxPtKFzximSH8YU4M_SiKpHOnw@mail.gmail.com>
 <CAKNo5ARVKRpr378OsNRP5xU4QxpL9f98qtEbdGLSikUALfTQ+Q@mail.gmail.com>
 <YUzpSop/pE8TVKlh@wildebeest.org>
 <CAB2u+n11kpd0KwsZZu6cXCsqHcALmFRfQOiQA=L1NRW8-faCjQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAB2u+n11kpd0KwsZZu6cXCsqHcALmFRfQOiQA=L1NRW8-faCjQ@mail.gmail.com>
X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-rust@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: gcc-rust mailing list <gcc-rust.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-rust/>
List-Post: <mailto:gcc-rust@gcc.gnu.org>
List-Help: <mailto:gcc-rust-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Sep 2021 11:53:11 -0000

Hi Philip,

On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote:
> This is really useful information, will this mean that the lexer token will
> need to represent strings differently as well? Or is the std::string in the
> lexer still ok?

I think the respresentation as std::string is fine. As long as we
don't mix std::strings between different types (byte strings may
contain sequences of chars that aren't valid utf-8 sequenecs).

> The change you made above has the problem that reference types like, arrays
> are forms of what rust calls covariant types since they might contain an
> inference variable, so they require lookup to determine the base type. Its
> likely there is a reference cycle here. Though this change will not be
> correct for type checking purposes. The design of the type system is purely
> about rust type checking and inferring types.

OK, so how do I represent an reference to an array type that doesn't
contain any inference variables? When we see a b"hello" byte string
that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is
the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8];

So we know this is &[u8;5] and if we write:

let a = b"hello";

We want to infer that a has type &[u8;5].

> So for example this change will break the case of:
> 
> ```
>   let a:str = "test";
> ```
> 
> Since the TypePath of str can't know the size of the expected array at
> compilation time. And the error message will end up with something like
> "expected str got [i8, 4]";

Right, but that is for "proper strings". It is somewhat unfortunate
that Rust calls byte strings also "strings", but they really
aren't. b"abc" is static array of u8, not a &str (containing utf-8).

I have to think about the slicing of "proper strings", which sound
more complicated than slicing of byte strings, because I don't think
you want to chop up a utf-8 sequence. For now I would simply try to
get the type of byte strings like b"test" correct.

Cheers,

Mark