UTF8 in identifiers

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* UTF8 in identifiers
@ 1998-05-01 12:29 Martin von Loewis
  1998-05-01 14:59 ` Per Bothner
  0 siblings, 1 reply; 4+ messages in thread
From: Martin von Loewis @ 1998-05-01 12:29 UTC (permalink / raw)
  To: egcs

After reading gxxint, I see that the mangling format already provides
for Unicode in identifiers. This is good.

However, I wonder whether it might be better to choose a different
solution: mangle Unicode characters as UTF-8. This would give a couple
of advantages:
- It applies to C as well. The mangling approach cannot be extended to
  C without introducing ambiguities. However, C9X has the same kind of
  Unicode support as ISO C++.
- It is more compact. The current mangling consumes 5 bytes per
  character, plus one per identifier.
- It supports \U escapes as well. This is not an important issue, as
  the current mangling can be extended to \U escapes, and because
  those escapes will be rare in the next few years.

There is one drawback, of course: UTF-8 is illegal in most assemblers.
So in order to implement this, I would have to start with binutils. As
far as I can tell, only gas needs to be changed - ld already handles 8
bit characters in symbols.

On platforms where the GNU binutils are not used, one would still have
to go with the current mechanism, so I would put UTF8_IN_IDENTIFIERS
into gcc, similar to DOLLAR_IN_IDENTIFIERS.

Before starting to work on it, I'd like to know what people think
about this proposal.

TIA,
Martin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF8 in identifiers
  1998-05-01 12:29 UTF8 in identifiers Martin von Loewis
@ 1998-05-01 14:59 ` Per Bothner
  1998-05-06  6:22   ` Gerald Pfeifer
  0 siblings, 1 reply; 4+ messages in thread
From: Per Bothner @ 1998-05-01 14:59 UTC (permalink / raw)
  To: Martin von Loewis; +Cc: egcs

> After reading gxxint, I see that the mangling format already provides
> for Unicode in identifiers.

Well, it is not really implemented yet, except perhaps some bits and pieces.

> However, I wonder whether it might be better to choose a different
> solution: mangle Unicode characters as UTF-8.

Yes, that is a better solution, and I have been tempted towards it.

The problem, as you point out:

> There is one drawback, of course: UTF-8 is illegal in most assemblers.

Extending gas would not be that difficult.  The problems is:  Can we
require gas?  I don't think we are ready for that:  We need to use
Gcc on targets to which Gas has not been ported yet.

> - It supports \U escapes as well. This is not an important issue, as
>   the current mangling can be extended to \U escapes, and because
>   those escapes will be rare in the next few years.

I'm not sure what you mean.  You use \U escapes to specify Unicode
characters in *source code* (and possibly assembly code); it has
nothing to do with mangling.

> Before starting to work on it, I'd like to know what people think
> about this proposal.

Well, whether or not we use UTF8 for mangling, I still think we
should make gas UTF8-aware, even if we don't immediately make the
compiler take advantage of it.

I would like:
1) Make sure gas and bfd are 8-bit clean for identifier names.
2) Agree that source characters with the high-bit set are interpreted as UTF-8.
3) Change Gas to handle \uXXXX and \UXXXXXXXX escapes in names,
and to generate corresponding UTF-8 sequence.

And one more small feature:

4) \ followed by a non-alphanumeric and not inside a string literal
means that the following character is treated as part on an identifier
(i.e. as if it were a letter).

One other suggestion:

5) In a string literal, a \u or \U escape generates a UTF-8 sequence,
while an octal or hex escape generates a single byte with the specified value.
Thus "\u00FF" translates to { 0xC3, 0xBF } while "\xFF" or "\377"
translate to { 0xFF }.  This is different from what Java does (whose
Strings a re Unicode strings), but seems to make sense for byte strings.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF8 in identifiers
  1998-05-01 14:59 ` Per Bothner
@ 1998-05-06  6:22   ` Gerald Pfeifer
  1998-05-06  9:19     ` Per Bothner
  0 siblings, 1 reply; 4+ messages in thread
From: Gerald Pfeifer @ 1998-05-06  6:22 UTC (permalink / raw)
  To: Per Bothner; +Cc: Martin von Loewis, egcs

On Fri, 1 May 1998, Per Bothner wrote:
> Extending gas would not be that difficult.  The problems is:  Can we
> require gas?  I don't think we are ready for that:  We need to use
> Gcc on targets to which Gas has not been ported yet.

I think egcs shouldn't become too expensive in terms of prerequisites. 

IMHO a regular "user" should be able to build egcs without having to
install gmake, gas/binutils and the like, _unless_ the native tools
are really seriously broken.

Gerald
-- 
Gerald Pfeifer (Jerry)      Vienna University of Technology
pfeifer@dbai.tuwien.ac.at   http://www.dbai.tuwien.ac.at/~pfeifer/


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: UTF8 in identifiers
  1998-05-06  6:22   ` Gerald Pfeifer
@ 1998-05-06  9:19     ` Per Bothner
  0 siblings, 0 replies; 4+ messages in thread
From: Per Bothner @ 1998-05-06  9:19 UTC (permalink / raw)
  To: Gerald Pfeifer; +Cc: egcs

> IMHO a regular "user" should be able to build egcs without having to
> install gmake, gas/binutils and the like, _unless_ the native tools
> are really seriously broken.

But it might be acceptable to say: "If you want identifiers with non-ascii
(Unicode) characters, you need to install gas".

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~1998-05-06  9:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-05-01 12:29 UTF8 in identifiers Martin von Loewis
1998-05-01 14:59 ` Per Bothner
1998-05-06  6:22   ` Gerald Pfeifer
1998-05-06  9:19     ` Per Bothner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).