From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) by sourceware.org (Postfix) with ESMTPS id 0C6DC3858C2C for ; Thu, 30 Sep 2021 10:46:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0C6DC3858C2C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com Received: by mail-ed1-x52d.google.com with SMTP id l8so20681378edw.2 for ; Thu, 30 Sep 2021 03:46:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/QZ+7/2zOkD9mQPd4MV8yMGQmOijX3bgDAv/rmNVImw=; b=XzymgbFOpQiUdsIX+c+Ei0R6TakptaFZHv/D9S2YZKPIw3UsdcGnQ95T9JcdAa7S/i 7feeKC6gZLfXZb0xqri4tAqdnahZ8Z5jA+bDcvsdczYWXnUGPeto6lU33RME+rkguTmc ZI6xNmqr/sYCsNFTJe+z1opuHLM61wYt3xQvAeEtu1FB19jcrgboinbTsoYQfg7bq6UF j2XlKiSNRv9/saacIiVyljZnR8l6n9AXJWivhTV1gyd+SLkLKNt6f6OVqy1QLVUtW89c Ja2yLEJ3Q+w4fC553ZsYuzgC6ccSBYqMP4nrO8yNPkKITeh8Hb4Q8hqTt+m7H/AGjlFc GckQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/QZ+7/2zOkD9mQPd4MV8yMGQmOijX3bgDAv/rmNVImw=; b=vWRxWJPNTZ3/IR055kL6/u2PulNR8sH2KCCJFVU82WfntsxNjS5HvqNZhmtb56jOIW AxrpuGBpSeT0VKrXr7xg6JGjj0/SwkhQg755TaoM5Fq9smhO+yJinEcH+1jHfXr+9Y7k XxGNQuO8H+L80zZU1ytCVQG402VBEtUIW5a2VflG/IPT3rhGxuqRkk5fBuM+6IzzvYVC d62P3O50FjbACHmeBDlE9vRtHHyYqRkT73CgvOoqhj8I7ctQOWzvoGlxy3byFUt5U/dY G9VZszyXTtf7Zf+AM16CcnUHzc3P7hTr7sNyRZzXz2w4RbJf9j1anEHzYi689CMr1HvA 47bQ== X-Gm-Message-State: AOAM530D2oVMijG3Xew433UwdW7+AznSWIgDba72qemSvAXDtmYXO491 zCVcalzzmDSlyo52e8QZxn9hSPtd+4j7bgDTO6H+Cg== X-Google-Smtp-Source: ABdhPJydhHf7ZsCQCfG0sWZGzInonBtqUxgcq4V3+OfL8hgsfObDgFI8xrj0On4bQp7sGrDINQRT3Wt5vnJg2REROPM= X-Received: by 2002:a17:906:60c7:: with SMTP id f7mr5920576ejk.57.1632998801038; Thu, 30 Sep 2021 03:46:41 -0700 (PDT) MIME-Version: 1.0 References: <20210921225430.166550-1-mark@klomp.org> <87k0j9ym7r.fsf@euler.schwinge.homeip.net> In-Reply-To: From: Philip Herron Date: Thu, 30 Sep 2021 11:46:30 +0100 Message-ID: Subject: Re: byte/char string representation (Was: [PATCH] Fix byte char and byte string lexing code) To: Mark Wielaard Cc: Arthur Cohen , gcc-rust@gcc.gnu.org, Thomas Schwinge Content-Type: multipart/alternative; boundary="00000000000074761b05cd34298e" X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2021 10:46:43 -0000 --00000000000074761b05cd34298e Content-Type: text/plain; charset="UTF-8" Hi Mark, Thanks for clarifying this, I was getting mixed up between normal str's and byte strings. Your patch was 99% of the way there to fix the type resolution so I finished it off for you: https://github.com/Rust-GCC/gccrs/pull/698/files The missing piece was that References and Array's are a type of covariant type so that an array type can look like this: [_, capacity], so the inference variable here is the variant so that we need to make sure it has its own implicit mapping id. You just needed to create one more mapping to get that implicit id so that the reference type similarly doesn't get into a loop of looking up itself. Creating implicit types like this could be made easier, so we should likely add some helpers for this scenario. Let me know what you think. Thanks --Phil On Sat, 25 Sept 2021 at 12:53, Mark Wielaard wrote: > Hi Philip, > > On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote: > > This is really useful information, will this mean that the lexer token > will > > need to represent strings differently as well? Or is the std::string in > the > > lexer still ok? > > I think the respresentation as std::string is fine. As long as we > don't mix std::strings between different types (byte strings may > contain sequences of chars that aren't valid utf-8 sequenecs). > > > The change you made above has the problem that reference types like, > arrays > > are forms of what rust calls covariant types since they might contain an > > inference variable, so they require lookup to determine the base type. > Its > > likely there is a reference cycle here. Though this change will not be > > correct for type checking purposes. The design of the type system is > purely > > about rust type checking and inferring types. > > OK, so how do I represent an reference to an array type that doesn't > contain any inference variables? When we see a b"hello" byte string > that is the same as seeing &[b'h', b'e', b'l', b'l', b'o'] which is > the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8]; > > So we know this is &[u8;5] and if we write: > > let a = b"hello"; > > We want to infer that a has type &[u8;5]. > > > So for example this change will break the case of: > > > > ``` > > let a:str = "test"; > > ``` > > > > Since the TypePath of str can't know the size of the expected array at > > compilation time. And the error message will end up with something like > > "expected str got [i8, 4]"; > > Right, but that is for "proper strings". It is somewhat unfortunate > that Rust calls byte strings also "strings", but they really > aren't. b"abc" is static array of u8, not a &str (containing utf-8). > > I have to think about the slicing of "proper strings", which sound > more complicated than slicing of byte strings, because I don't think > you want to chop up a utf-8 sequence. For now I would simply try to > get the type of byte strings like b"test" correct. > > Cheers, > > Mark > > --00000000000074761b05cd34298e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Mark,

Thanks for clarifyi= ng this, I was getting mixed up between normal str's and byte strings. = Your patch was 99% of the way there to fix the type resolution so I finishe= d it off for you:

https://github.com/Rust-GCC/gccrs/pull/698/files
<= br>
The missing piece was that References and Array's are a t= ype of covariant type so that an array type can look like this: [_, capacit= y], so the inference variable here is the variant so that we need to make s= ure it has its own implicit mapping id. You just needed to create one more = mapping to get that implicit id so that the reference type similarly doesn&= #39;t get into a loop of looking up itself. Creating implicit types like th= is could be made easier, so we should likely add some helpers for this scen= ario.

Let me know what you think.

Thanks

--Phil

On Sat, 25 Sept 20= 21 at 12:53, Mark Wielaard <mark@klomp= .org> wrote:
Hi Philip,

On Fri, Sep 24, 2021 at 12:01:42PM +0100, Philip Herron wrote:
> This is really useful information, will this mean that the lexer token= will
> need to represent strings differently as well? Or is the std::string i= n the
> lexer still ok?

I think the respresentation as std::string is fine. As long as we
don't mix std::strings between different types (byte strings may
contain sequences of chars that aren't valid utf-8 sequenecs).

> The change you made above has the problem that reference types like, a= rrays
> are forms of what rust calls covariant types since they might contain = an
> inference variable, so they require lookup to determine the base type.= Its
> likely there is a reference cycle here. Though this change will not be=
> correct for type checking purposes. The design of the type system is p= urely
> about rust type checking and inferring types.

OK, so how do I represent an reference to an array type that doesn't contain any inference variables? When we see a b"hello" byte stri= ng
that is the same as seeing &[b'h', b'e', b'l', = b'l', b'o'] which is
the same as seeing &[0x68u8, 0x65u8, 0x6cu8, 0x6cu8, 0x6fu8];

So we know this is &[u8;5] and if we write:

let a =3D b"hello";

We want to infer that a has type &[u8;5].

> So for example this change will break the case of:
>
> ```
>=C2=A0 =C2=A0let a:str =3D "test";
> ```
>
> Since the TypePath of str can't know the size of the expected arra= y at
> compilation time. And the error message will end up with something lik= e
> "expected str got [i8, 4]";

Right, but that is for "proper strings". It is somewhat unfortuna= te
that Rust calls byte strings also "strings", but they really
aren't. b"abc" is static array of u8, not a &str (contain= ing utf-8).

I have to think about the slicing of "proper strings", which soun= d
more complicated than slicing of byte strings, because I don't think you want to chop up a utf-8 sequence. For now I would simply try to
get the type of byte strings like b"test" correct.

Cheers,

Mark

--00000000000074761b05cd34298e--