* UTF8 in identifiers
@ 1998-05-01 12:29 Martin von Loewis
1998-05-01 14:59 ` Per Bothner
0 siblings, 1 reply; 4+ messages in thread
From: Martin von Loewis @ 1998-05-01 12:29 UTC (permalink / raw)
To: egcs
After reading gxxint, I see that the mangling format already provides
for Unicode in identifiers. This is good.
However, I wonder whether it might be better to choose a different
solution: mangle Unicode characters as UTF-8. This would give a couple
of advantages:
- It applies to C as well. The mangling approach cannot be extended to
C without introducing ambiguities. However, C9X has the same kind of
Unicode support as ISO C++.
- It is more compact. The current mangling consumes 5 bytes per
character, plus one per identifier.
- It supports \U escapes as well. This is not an important issue, as
the current mangling can be extended to \U escapes, and because
those escapes will be rare in the next few years.
There is one drawback, of course: UTF-8 is illegal in most assemblers.
So in order to implement this, I would have to start with binutils. As
far as I can tell, only gas needs to be changed - ld already handles 8
bit characters in symbols.
On platforms where the GNU binutils are not used, one would still have
to go with the current mechanism, so I would put UTF8_IN_IDENTIFIERS
into gcc, similar to DOLLAR_IN_IDENTIFIERS.
Before starting to work on it, I'd like to know what people think
about this proposal.
TIA,
Martin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: UTF8 in identifiers
1998-05-01 12:29 UTF8 in identifiers Martin von Loewis
@ 1998-05-01 14:59 ` Per Bothner
1998-05-06 6:22 ` Gerald Pfeifer
0 siblings, 1 reply; 4+ messages in thread
From: Per Bothner @ 1998-05-01 14:59 UTC (permalink / raw)
To: Martin von Loewis; +Cc: egcs
> After reading gxxint, I see that the mangling format already provides
> for Unicode in identifiers.
Well, it is not really implemented yet, except perhaps some bits and pieces.
> However, I wonder whether it might be better to choose a different
> solution: mangle Unicode characters as UTF-8.
Yes, that is a better solution, and I have been tempted towards it.
The problem, as you point out:
> There is one drawback, of course: UTF-8 is illegal in most assemblers.
Extending gas would not be that difficult. The problems is: Can we
require gas? I don't think we are ready for that: We need to use
Gcc on targets to which Gas has not been ported yet.
> - It supports \U escapes as well. This is not an important issue, as
> the current mangling can be extended to \U escapes, and because
> those escapes will be rare in the next few years.
I'm not sure what you mean. You use \U escapes to specify Unicode
characters in *source code* (and possibly assembly code); it has
nothing to do with mangling.
> Before starting to work on it, I'd like to know what people think
> about this proposal.
Well, whether or not we use UTF8 for mangling, I still think we
should make gas UTF8-aware, even if we don't immediately make the
compiler take advantage of it.
I would like:
1) Make sure gas and bfd are 8-bit clean for identifier names.
2) Agree that source characters with the high-bit set are interpreted as UTF-8.
3) Change Gas to handle \uXXXX and \UXXXXXXXX escapes in names,
and to generate corresponding UTF-8 sequence.
And one more small feature:
4) \ followed by a non-alphanumeric and not inside a string literal
means that the following character is treated as part on an identifier
(i.e. as if it were a letter).
One other suggestion:
5) In a string literal, a \u or \U escape generates a UTF-8 sequence,
while an octal or hex escape generates a single byte with the specified value.
Thus "\u00FF" translates to { 0xC3, 0xBF } while "\xFF" or "\377"
translate to { 0xFF }. This is different from what Java does (whose
Strings a re Unicode strings), but seems to make sense for byte strings.
--Per Bothner
Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: UTF8 in identifiers
1998-05-01 14:59 ` Per Bothner
@ 1998-05-06 6:22 ` Gerald Pfeifer
1998-05-06 9:19 ` Per Bothner
0 siblings, 1 reply; 4+ messages in thread
From: Gerald Pfeifer @ 1998-05-06 6:22 UTC (permalink / raw)
To: Per Bothner; +Cc: Martin von Loewis, egcs
On Fri, 1 May 1998, Per Bothner wrote:
> Extending gas would not be that difficult. The problems is: Can we
> require gas? I don't think we are ready for that: We need to use
> Gcc on targets to which Gas has not been ported yet.
I think egcs shouldn't become too expensive in terms of prerequisites.
IMHO a regular "user" should be able to build egcs without having to
install gmake, gas/binutils and the like, _unless_ the native tools
are really seriously broken.
Gerald
--
Gerald Pfeifer (Jerry) Vienna University of Technology
pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~1998-05-06 9:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-05-01 12:29 UTF8 in identifiers Martin von Loewis
1998-05-01 14:59 ` Per Bothner
1998-05-06 6:22 ` Gerald Pfeifer
1998-05-06 9:19 ` Per Bothner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).