revised proposal for GCC and non-Ascii source files

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* revised proposal for GCC and non-Ascii source files
@ 1998-12-28 17:59 Paul Eggert
  1998-12-29  1:38 ` Martin von Loewis
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-28 17:59 UTC (permalink / raw)
  To: rms, zack, bothner, amylaar, martin, gcc2, egcs

Here is a revised version of the proposal I sent on 12-21 in reaction
to martin's proposed patch for GCC and UTF-8.  I've tried to accommodate
everyone's comments and concerns, by making the following changes:

 a. Assembler text is always Ascii; see (6) and (10) below.

 b. Assembler identifiers are translated to UTF-8 only if the new
    -funify-names option is in effect; see (8) and (9) below.

 c. There is no preprocessor directive specifying the charset, as this
    causes too many conceptual and implementation problems; see (R1) below.

I also added more detail (e.g. the new GNUC charset) and a rationale.

FIXME: the .uXXXX, .UXXXXXXXX and .xXX escape sequences described in
(9) and (10) below are reported to not work for C++ mangled names; I
don't fully understand the problem, though, so I haven't fixed this.

----------

 1. Determining the input charset.

    The input character set FOO can be specified by the new `-charset
    FOO' compile-time option.  The default charset is determined from
    the locale, which is specified in the usual way with the LC_ALL,
    LC_CTYPE, and LANG environment variables.  To determine the
    default charset from the locale, GCC uses setlocale (LC_CTYPE, "")
    and nl_langinfo (CODESET) if these two functions are available and
    succeed; otherwise the default charset is GNUC.

 2. In the GNUC charset, each input byte is a character, each
    non-Ascii byte is allowed in an identifier, string, or comment,
    and each \u and \U escape is equivalent to the corresponding UTF-8
    multibyte sequence.

 3. For non-GNUC charsets, GCC uses the compilation host's iconv
    function to determine character boundaries.

 4. If the compilation host lacks iconv, GCC supports only the GNUC
    charset; however, if the installer wants to build a compiler that
    knows about foreign encodings (e.g. for cross-compilation), we
    supply an easy way to use glibc's iconv.  We can remove the
    existing local_mblen function and friends, as they're no longer
    needed.

 5. GCC translates each \u and \U escape in a string to a character in
    the input charset.  For non-GNUC charsets, the translation uses
    iconv; hence if no character corresponds to the \u or \U escape,
    GCC translates it to the same substitute character that iconv uses.

 6. After the translation in (5) (and after processing the other
    escapes like \n), GCC copies the contents of strings straight
    through to the assembler.  As with GCC 2.8, GCC uses backslashes
    to escape string bytes like \ and ", and bytes with values greater
    than 127.

 7. For diagnostics (and for all identifier output other than
    assembler), GCC translates \u and \U escapes in identifiers to the
    default charset using iconv; hence iconv's substitute character is
    used for untranslatable escapes.

 8. The internal charset of assembler identifiers is either UTF-8 or
    the input charset, depending on the value of the new -funify-names
    option (with inverse -fno-unify-names).  The default value of this
    option depends on the platform and the language; it is on for Java
    regardless of platform, and off for C and C++ on GNU platforms.
    This option controls how identifiers (after any name mangling) are
    canonicalized and translated to assembler identifiers internally.

 9. With -funify-names, identifiers (including their \u and \U
    escapes) are translated to UTF-8 internally; if the input charset
    is not a subset of UTF-8, any extra information is lost.  With
    -fno-unify-names, each \u and \U escape in identifiers is
    translated to the input charset, if the corresponding character
    exists; otherwise, it is canonicalized by converting all its
    hexadecimal digits to upper case and by converting `\U0000XXXX' to
    `\uXXXX', and are then made safe for the assembler by translating
    the leading `\' to `.'.

10. After the translation described in (9), assembler identifiers are
    output with escapes.  The escape for any byte outside the set
    $.0-9A-Z_a-z (Ascii) is `.xXX', where XX is the byte's lower case
    hexadecimal code.

----------

Properties of this proposal:

 A. The assembly language output is always Ascii.

 B. The assembler needn't know about encodings.

 C. If the -funify-names option is in effect, you can link together
    source files written in different locales even if their identifiers
    contain non-Ascii characters.

----------

Rationale (numbers like `R1' correspond to proposal numbers like `1' above):

 R0.  Why don't we just standardize on UTF-8?

      Currently, most text files do not use UTF-8, and many important
      tools (including Emacs 20.3) do not support UTF-8.  On the other
      hand, many text files and tools use encodings like ISO 8859 and
      Shift-JIS that are incompatible with UTF-8.  For some time to come,
      non-UTF-8 encodings will remain in widespread use, and hence GCC
      should support them if feasible.

 R1.  Why is there no `#pragma charset FOO' or `_Pragma ("charset FOO")'?

      If a _Pragma ("charset FOO") directive is in the expansion of a
      macro, either directly or indirectly, the charset of the rest of
      that macro expansion would be undefined, since it would be read
      in one charset but macro-processed in another.  A similar
      problem would occur if a charset pragma is in an ignored section
      of text -- i.e. it is #ifdef'ed or #if'ed out, or it is in a
      macro argument that is not used.  To be portable, a section of
      text with undefined charset would have to use only characters
      from the "C" charset, and would not be able to use \u or \U
      escapes if the interpretation of those escapes affects the
      meaning of the program.  These rules would be tricky to
      implement and, worse, would be hard to explain.

      Another problem with having directives specify charset is that
      if you translate a source file from one charset to another, you
      have to remember to update its charset directives.  (A similar
      problem occurs no matter what method is used to specify charset,
      of course, so this particular objection is not fatal.)

 R2.  Why isn't the GNUC charset UTF-8?

      The GNUC charset is more permissive than UTF-8: it allows any
      encoding that does not use Ascii bytes within multibyte
      characters.  This includes not only UTF-8, but also ISO 8859 and
      EUC.  Hence by default GCC will handle many popular encodings
      without any need for the user to specify an encoding.  If the
      default encoding were UTF-8, GCC would have to reject most valid
      programs that used non-UTF-8 encodings, which would mean that
      more users would have to worry about encodings.

 R3a. Why must GCC worry about character boundaries in non-GNUC charsets?

      Some non-GNUC multibyte charsets (e.g. Shift-JIS) contain Ascii bytes
      within multibyte characters.

 R3b. Why use iconv and not mblen to determine multibyte character boundaries?

      GCC must use iconv to translate characters, since the mblen
      family cannot translate.  It is more consistent to use iconv to
      also determine character boundaries; this avoids configuration
      problems where iconv and mblen inadvertently disagree.  For
      example, iconv is configured by charset name, whereas mblen is
      configured by locale name, and it's possible for the two
      configurations to be inconsistent.

 R9.  GCC normally uses iconv to translate \u and \U escapes to the
      input charset.  Why doesn't it do this for identifiers when the
      -fno-unify-names option is in effect?

      Draft C9x requires that, for example, \u00b5 (MICRO SIGN) and
      \u00b7 (MIDDLE DOT) must be distinct identifiers, even if the
      input charset cannot represent those two characters.  If GCC
      used iconv to translate those two escapes, it could translate
      them both to the same substitute character.

R10.  Why aren't assembler identifiers output as-is, instead of being escaped?

      Many assemblers do not allow identifiers that contain UTF-8 or
      other encodings.  It is low priority for GCC to make non-Ascii
      assembly-language identifiers easy to read; it is simpler and more
      portable for GCC to use an Ascii encoding for such identifiers.

      Perhaps some hosts will use a different convention, and will
      require non-escaped assembler identifiers; if so, we'll modify
      GCC to follow the host convention as needed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
@ 1998-12-29  1:38 ` Martin von Loewis
  1998-12-29  1:39 ` Martin von Loewis
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Martin von Loewis @ 1998-12-29  1:38 UTC (permalink / raw)
  To: eggert; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

> 10. After the translation described in (9), assembler identifiers are
>     output with escapes.  The escape for any byte outside the set
>     $.0-9A-Z_a-z (Ascii) is `.xXX', where XX is the byte's lower case
>     hexadecimal code.

What do we do with platforms that have NO_DOT_IN_LABELS?

IMHO, they either lose, or we try $. If they also have
NO_DOLLAR_IN_LABELS, they should definitely lose (i.e. identifiers
with funny characters are rejected).

Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
  1998-12-29  1:38 ` Martin von Loewis
@ 1998-12-29  1:39 ` Martin von Loewis
  1998-12-29  5:53   ` Paul Eggert
  1998-12-29  1:50 ` Martin von Loewis
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Martin von Loewis @ 1998-12-29  1:39 UTC (permalink / raw)
  To: eggert; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

> Here is a revised version of the proposal I sent on 12-21 in reaction
> to martin's proposed patch for GCC and UTF-8.

Thanks for this very elaborate proposal. It sounds good to me. I have
some concerns, which I'll split into separate messages.

> FIXME: the .uXXXX, .UXXXXXXXX and .xXX escape sequences described in
> (9) and (10) below are reported to not work for C++ mangled names; I
> don't fully understand the problem, though, so I haven't fixed this.

The problem really is the choice of escape character. The requirement
simply is: Different C++ objects need to have different assembler
names. Consider

class F{
  static int u00C0;
};

This is mangled as '_1F.u00C0'. Originally, I thought I could create
another C++ object with the same mangled name (containing Ã€), but this
is not the case.

There still is a conflict with

extern "C" void _1F\u00C0();

Maybe it is OK to reserve this name for the implementation...

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
  1998-12-29  1:38 ` Martin von Loewis
  1998-12-29  1:39 ` Martin von Loewis
@ 1998-12-29  1:50 ` Martin von Loewis
  1998-12-29  5:41   ` Paul Eggert
  1998-12-30  5:21 ` Richard Stallman
  1998-12-30 14:58 ` Zack Weinberg
  4 siblings, 1 reply; 31+ messages in thread
From: Martin von Loewis @ 1998-12-29  1:50 UTC (permalink / raw)
  To: eggert; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

>  3. For non-GNUC charsets, GCC uses the compilation host's iconv
>     function to determine character boundaries.

How do you do that? Looking at the iconv(3) documentation, I could not
find out how this could work.

Also, how do you detect that you are in ASCII mode in a MBCS encoding?
Consider

void <Escape to JIS X0201>how_many_<yen sign>();<ESCAPE to ASCII>

IMHO, this is illegal: the parens after the function name are not
ASCII characters. However, iconv (and mblen) will tell us that the
character has a single byte, and it will look like ASCII '('.

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-29  1:50 ` Martin von Loewis
@ 1998-12-29  5:41   ` Paul Eggert
  0 siblings, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-29  5:41 UTC (permalink / raw)
  To: martin; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

   Date: Tue, 29 Dec 1998 10:45:38 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   >  3. For non-GNUC charsets, GCC uses the compilation host's iconv
   >     function to determine character boundaries.

   How do you do that?

Ooops.  I thought I could ask iconv to translate one byte, and if that
reports an incomplete character then two bytes, and so forth.  But I
just tried to write code to do this and you're correct, it doesn't
work (at least on Solaris 2.6) since e.g. for ISO-2022 iconv reports a
complete ESC character when given the single ESC byte at the start of
a shift sequence, which is not what is wanted here.

Hence we can't use iconv to find character boundaries; we must use
mbrlen.  Unfortunately, there is no way to go from the charset name to
the locale (which mbrlen needs).  Therefore, GCC must be given the
locale name rather than the charset name.  E.g. on Solaris 2.6 the
user will specify an option like `-ctype ja' instead of `-charset eucJP'.

Among other things, this means that the `GNUC charset' in my proposal
will have to become the `GNUC locale'.

   Also, how do you detect that you are in ASCII mode in a MBCS
   encoding?

I think we can use mbsinit for that.  However, to avoid further
embarrassment, I will write some code to test this idea before putting
out the next revision of my proposal.

If the host does not support mbrlen, mbsinit, and iconv, then GCC will
support only the GNUC locale.  For cross-compilation, we can provide a
way to substitute glibc mbrlen, mbsinit, and iconv if the host doesn't
have those functions but we need them to compile for the target.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-29  1:39 ` Martin von Loewis
@ 1998-12-29  5:53   ` Paul Eggert
  1998-12-29  6:22     ` Martin von Loewis
  0 siblings, 1 reply; 31+ messages in thread
From: Paul Eggert @ 1998-12-29  5:53 UTC (permalink / raw)
  To: martin; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

   Date: Tue, 29 Dec 1998 10:34:39 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   class F{
     static int u00C0;
   };
   This is mangled as '_1F.u00C0'....  There still is a conflict with
   extern "C" void _1F\u00C0();

OK, suppose we translate `\u00C0' to `..u00c0' instead?  If we use two
dots, does that avoid collisions with C++ name mangling?  If that
doesn't work, perhaps you can suggest an escape sequence does work,
e.g. by using a name that can be reserved by the implementation
(e.g. `.__u00c0').

   Date: Tue, 29 Dec 1998 10:37:52 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   What do we do with platforms that have NO_DOT_IN_LABELS?

   IMHO, they either lose, or we try $. If they also have
   NO_DOLLAR_IN_LABELS, they should definitely lose (i.e. identifiers
   with funny characters are rejected).

What does C++ name mangling do on platforms with NO_DOT_IN_LABELS
and/or NO_DOLLAR_IN_LABELS?  If C++ name mangling tries `.', then `$',
and then fails, then non-ASCII name mangling should be able to do the
same thing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-29  5:53   ` Paul Eggert
@ 1998-12-29  6:22     ` Martin von Loewis
  1998-12-31 13:55       ` Paul Eggert
  0 siblings, 1 reply; 31+ messages in thread
From: Martin von Loewis @ 1998-12-29  6:22 UTC (permalink / raw)
  To: eggert; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

> OK, suppose we translate `\u00C0' to `..u00c0' instead?

That would work, at the first glance.

> If that doesn't work, perhaps you can suggest an escape sequence
> does work,

This is exactly the problem: I could not come up with a good solution;
all 'funny' characters are already taken.

> What does C++ name mangling do on platforms with NO_DOT_IN_LABELS
> and/or NO_DOLLAR_IN_LABELS?  If C++ name mangling tries `.', then `$',
> and then fails, then non-ASCII name mangling should be able to do the
> same thing.

C++ tries '$', then '.', then '_'. Because the latter may give
conflicts, we put __static_ in front of the entire identifier
(i.e. __static_1F_u00C0).

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
                   ` (2 preceding siblings ...)
  1998-12-29  1:50 ` Martin von Loewis
@ 1998-12-30  5:21 ` Richard Stallman
  1998-12-31 15:55   ` Paul Eggert
  1998-12-30 14:58 ` Zack Weinberg
  4 siblings, 1 reply; 31+ messages in thread
From: Richard Stallman @ 1998-12-30  5:21 UTC (permalink / raw)
  To: eggert; +Cc: zack, bothner, amylaar, martin, gcc2, egcs

Your proposal seems pretty good, but it's missing one very important
feature.  That is the warning if the locale has an effect on the
result and was not explicitly specified.  That warning is very
important.  Depending on the environment is unreliable, so we need to
discourage people from doing that *before* they learn the hard way.

Your arguments about "#pragma charset" are based on obscure ways of
using such a construct.  That is letting the tail wag the dog.
Instead why not just clip off the tail?

	  If a _Pragma ("charset FOO") directive is in the expansion of a
	  macro,

Solution A: use #charset rather than #pragma charset.
Then the issue of _Pragma does not even arise.

Solution B: use #pragma charset but don't allow _Pragma ("charset FOO").

      A similar
	  problem would occur if a charset pragma is in an ignored section
	  of text

We could simply forbid this usage and say that #charset must be used
at the beginning of the file only.  Nothing but whitespace (including
comments) should be allowed preceding #charset.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
                   ` (3 preceding siblings ...)
  1998-12-30  5:21 ` Richard Stallman
@ 1998-12-30 14:58 ` Zack Weinberg
  1998-12-31 14:28   ` Paul Eggert
  1998-12-31 15:13   ` problems with C9x's _Pragma Paul Eggert
  4 siblings, 2 replies; 31+ messages in thread
From: Zack Weinberg @ 1998-12-30 14:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, bothner, amylaar, martin, gcc2, egcs, zack

I only have a couple comments:

- C9x CD2 unambiguously says \u and \U escapes are to be treated as Unicode. 
It also disallows these escapes for certain ranges of Unicode which
encompass all of 7-bit ASCII.  That being so, I propose to encode \u and \U
in UTF-8 always.  This can be done regardless of the availability of
translation libraries.  Assuming cpp and cc1 will take any character with
the high bit set in an identifier, we need only add parsing support to cpp
to make this work.

The only issue is unification of a \u escape for symbol X with the same
symbol natively represented in the input encoding.  I'm not sure what the
right way to deal with that is.

- #pragma charset is easily implementable in cpplib (not sure about cccp)
provided we accept constraints on #pragma/_Pragma().  I posted some lengthy
discussion of this last week, but to sum: pragmas affecting the preprocessor
(which this is) cannot be expressed with _Pragma() at all, and neither
#pragma nor _Pragma() can appear in a position that is inconvenient to the
parser -- I think that will translate to "must look like a C
statement-or-declaration".

zw

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-29  6:22     ` Martin von Loewis
@ 1998-12-31 13:55       ` Paul Eggert
  1999-01-31 23:58         ` Zack Weinberg
  1999-01-31 23:58         ` Martin v. Loewis
  0 siblings, 2 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-31 13:55 UTC (permalink / raw)
  To: martin; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

   Date: Tue, 29 Dec 1998 15:14:53 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   C++ tries '$', then '.', then '_'. Because the latter may give
   conflicts, we put __static_ in front of the entire identifier
   (i.e. __static_1F_u00C0).

OK, here's a revised version of the name Ascization part of that
proposal, which should be safe in the presence of C++ name mangling.
Sorry, it's a bit complicated, but I don't see a way to simplify it
without introducing other disadvantages.


 9. With -funify-names, identifiers (including their UCNs)
    are translated to UTF-8 internally; if the input ctype
    is not a subset of UTF-8, any extra information is lost, and if
    the input ctype is GNUC only UCNs are translated.  With
    -fno-unify-names, each UCN in identifiers is translated to the
    input ctype internally, if the corresponding character
    exists; otherwise, it is canonicalized by converting all its
    hexadecimal digits to upper case and by converting `\U0000XXXX' to
    `\uXXXX'.  Two identifiers are considered to be the same if and only
    if their internal representations are identical.

10. After the translation described in (9), assembler identifiers are
    output with escape bytes if necessary.  If an assembler identifier
    contains characters that are not allowed by the platform, the
    following steps are done (in order) before the identifier is
    output:

     . Each instance of the escape byte is doubled.  The escape byte
       is `.' if `.' and `$' are both allowed in identifiers, and is
       `V' otherwise.

     . The `\' of each UCN is replaced by the escape byte.
       This can occur only if -fno-unify-names is in effect, since
       -funify-names never puts UCNs into an internal identifier.

     . Each byte in a disallowed character is replaced by its 2-digit
       hexadecimal code in upper case, prefixed by the escape byte.

     . If the escape byte is `V', then `__9V_' is prepended to the
       entire identifier.

    Normally the bytes $.0-9A-Z_a-z Ascii are allowed in assembler
    identifiers, but some platforms prohibit `$' or `.', and some
    platforms allow any nonzero byte.

----------

For example, suppose

 . The input encoding is ISO 8859-1.
 . @ represents the character MICRO SIGN (Unicode `00B5', UTF-8 `C2 B5',
   ISO 8859-1 `B5').
 . # represents the character GEORGIAN CAPITAL LETTER AN (Unicode `10A0',
   UTF-8 `E1 81 A0', no ISO 8859-1 representation).

If -fno-unify-names is in effect, the identifier
`p@q\u00b5r\U000010a0sVt' is represented internally as
`p@q@r\u10A0sVt'; note that the internal @ uses a one-byte ISO 8859-1
representation.  If @ and \ are not allowed by the platform, the
identifier output is `p.B5q.B5r.u10A0sVt' if the escape character
is `.', and is `__9V_pVB5qVB5rVu10A0sVVt' otherwise.

If -funify-names is in effect, the same identifier is represented
internally as `p@q@r#sVt'; note that the internal @ and # use
multibyte UTF-8 representations.  If @ and # are not allowed by the
platform, the identifier output is `p.C2.B5q.C2.B5r.E1.81.A0sVt' if
the escape character is `.', and is
`__9V_pVC2VB5qVC2VB5rVE1V81VA0sVVt' otherwise.


----------

Properties of this proposal: ...

 D. With the GNUC ctype, or any UTF-8 ctype, -funify-names has no effect.

 E. The internal form of assembler identifiers can never contain a null
    byte, and therefore `.00' (or `V00' if the escape byte is `V') can
    never appear in an assembler identifier.

----------

Rationale ...

R10b. Why does the escape convention for outputting identifiers in Ascii
      depend on whether the assembler allows `$' in identifiers?

      C++ name mangling uses `$' if available, otherwise `.' if
      available, otherwise `_' (prepending `__static_' to the entire
      identifier).  If `$' not available but `.' is available, then
      identifier Ascization must not use `.', as that would clash with
      C++ name mangling.  If `$' and `.' are both available, it is
      safe for Ascization to use `.', since C++ name mangling uses `$'
      in that case.  It is preferable to use `.' if it is available,
      since that avoids any possible clashes with user identifiers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-30 14:58 ` Zack Weinberg
@ 1998-12-31 14:28   ` Paul Eggert
  1998-12-31 15:13   ` problems with C9x's _Pragma Paul Eggert
  1 sibling, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-31 14:28 UTC (permalink / raw)
  To: zack; +Cc: rms, bothner, amylaar, martin, gcc2, egcs, zack

   Date: Wed, 30 Dec 1998 17:58:17 -0500
   From: Zack Weinberg <zack@rabi.columbia.edu>

   - C9x CD2 unambiguously says \u and \U escapes are to be treated as
   Unicode.  It also disallows these escapes for certain ranges of
   Unicode which encompass all of 7-bit ASCII.  That being so, I
   propose to encode \u and \U in UTF-8 always.

For UTF-8 locales this is reasonable, but it doesn't work for
non-UTF-8 locales, which is the main point of my revised proposal.
For such locales, your proposal would map UCNs to one encoding, while
the rest of the program would continue to use an incompatible encoding
-- but this would render UCNs useless, as you normally can't mix
encodings like that.

   This can be done regardless of the availability of translation
   libraries.

If translation libraries are not available, then GCC can and should
fall back to something along the lines that you propose.  I call this
fallback behavior the ``GNUC charset'' (soon to be revised to the
``GNUC ctype'') in my proposal.

   pragmas affecting the preprocessor (which this is) cannot be
   expressed with _Pragma() at all, and neither #pragma nor _Pragma()
   can appear in a position that is inconvenient to the parser -- I
   think that will translate to "must look like a C
   statement-or-declaration".

I missed your message (as I don't get egcs) but will look into it in
the egcs archives.  However, even ``must look like a C
statement-or-declaration'' is overly optimistic, since a
charset-changing pragma can change the interpretation of comments.  I
see real problems in implementing a charset-changing pragma without
placing draconian restrictions on it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: problems with C9x's _Pragma
  1998-12-30 14:58 ` Zack Weinberg
  1998-12-31 14:28   ` Paul Eggert
@ 1998-12-31 15:13   ` Paul Eggert
  1 sibling, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-31 15:13 UTC (permalink / raw)
  To: zack; +Cc: gcc2, egcs

   From: Zack Weinberg (zack@rabi.columbia.edu)
   Date: Mon, 21 Dec 1998 23:43:24 -0500 

   Theoretically, this code is legal:

   int bar(void);
   int foo(void)
   {
     return _Pragma("something") bar();
   }

Yes, and there's no problem with that example, as it is equivalent to
the following:

int bar(void);
int foo(void)
{
  return
# pragma something
  bar();
}

which GCC already allows.

   Also, _Pragma must be recognized even if I do something like this:

   #define LP (
   #define RP )

   _Pragma LP "string" RP

The way I read draft C9x, this is not required.  Admittedly the draft is
muddy in this area.  This topic has already come up in comp.std.c and
my impression is that the committee will fix any ambiguity in the draft.

   I'd like to propose we place two constraints on #pragma and _Pragma():

   [1] _Pragma() must appear as a separate statement, with a semicolon
   after it.

This doesn't match draft C9x, which says that e.g.
`_Pragma ("STDC FP_CONTRACT DEFAULT")' must work without a semicolon.
I don't see why the semicolon is needed.

   It can appear where statements or declarations are legal.

If we're talking about the STDC pragmas, this is more generous than
what draft C9x requires.  Draft C9x requires that a STDC pragma appear
either outside external declarations, or at the start of a compound
statement.  So you would allow

if (x == 0) _Pragma ("STDC FP_CONTRACT DEFAULT");

whereas draft C9x would not.  I'm not sure that it's wise to relax the
draft C9x restrictions for the STDC pragmas, as the above code (which has no
effect, since the scope of the pragma is just the then-part of the if)
probably doesn't mean what the user intended.

Conversely, if we're talking about non-STDC pragmas, I don't see why
the restriction is needed.  Other compilers do not have this
restriction for similar constructs; e.g. I've heard that

float __pragma (__mycall) FloatAdd (float n1, float n2);

is allowed by Watcom C, and similar _Pragma expressions might be
useful for GCC.  To some extent, my impression is that _Pragma is the
draft C9x way of doing GCC's __attribute__; it might be useful at some point
to put in a pragma that has the equivalent effect to __attribute__.

   #pragma must appear where it would have been legal if it were written
   the other way, in addition to being valid as a preprocessing
   directive.

This sounds backwards; by definition, _Pragma is translated to
#pragma, not the other way around.  Wherever #pragma is valid, the
corresponding _Pragma must be valid.

   [2] Pragmas that affect the preprocessor cannot be written using
   _Pragma().

This is a reasonable restriction for some pragmas (e.g. one that
affects the character set) but I don't see why it's reasonable for
others (e.g. `#pragma implementation').  Also, any such restriction
should be phrased in terms of #pragma, not _Pragma, since the draft
C9x specifies the latter in terms of the former.

   No supported or proposed pragmas are broken by these constraints.

Hmm, not even `#pragma implementation'?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-30  5:21 ` Richard Stallman
@ 1998-12-31 15:55   ` Paul Eggert
  0 siblings, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1998-12-31 15:55 UTC (permalink / raw)
  To: rms; +Cc: zack, bothner, amylaar, martin, gcc2, egcs

   Date: Wed, 30 Dec 1998 08:24:40 -0500
   From: Richard Stallman <rms@gnu.org>

   Your proposal seems pretty good, but it's missing one very important
   feature.  That is the warning if the locale has an effect on the
   result and was not explicitly specified....

   Solution A: use #charset rather than #pragma charset....
   We could simply forbid this usage and say that #charset must be used
   at the beginning of the file only.  Nothing but whitespace (including
   comments) should be allowed preceding #charset.

Thanks for reminding me about the diagnostic, and your solution about
#charset sounds reasonable.  I'll include the following new text in
the next version of the proposal.  It has a couple of extra ideas:

* There are some restrictions on the comments that can precede the
  `#ctype' directive.  They have to be valid GNUC comments, and thus
  cannot use arbitrary characters from the desired charset.

* If you want to disable the warnings, you can do so with an explicit
  `-ctype ""' option or with an explicit `#ctype ""' directive; this
  gets the ctype from the environment without any warnings.

----------

 1. Determining the input ctype.

    The input character type FOO can be specified by the new `-ctype
    FOO' compile-time option or the new `#ctype "FOO"' directive.  GCC
    processes this option by invoking the functions setlocale
    (LC_CTYPE, "FOO") and nl_langinfo (CODESET), and reports an error
    if these two functions do not both succeed.  If "FOO" is the empty
    string, this procedure obtains the ctype from the locale in the
    usual way; this disables the warnings specified in (11).

    The default ctype is determined from the locale, which
    is specified in the usual way with the LC_ALL, LC_CTYPE, and LANG
    environment variables; if the environment variables are not set,
    or if the setlocale and nl_langinfo functions do not both succeed,
    the default ctype is GNUC.

    The `#ctype "FOO"' directive must be at the start of its source
    file, preceded only by white space and comments that are valid
    when interpreted with the GNUC ctype.  In practice, it's safe to
    put non-GNUC encodings into // comments before the #ctype
    directive, but /* comments are not safe, because some multibyte
    encodings use `*' and `/' bytes.

    ...

11. If no -ctype option or #ctype directive is specified, and GCC's
    behavior differs from what it would have been with the GNUC ctype,
    then GCC issues a warning.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58             ` Paul Eggert
@ 1999-01-31 23:58               ` Horst von Brand
  1999-01-31 23:58                 ` Paul Eggert
  1999-01-31 23:58                 ` Martin v. Loewis
  0 siblings, 2 replies; 31+ messages in thread
From: Horst von Brand @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: zack, martin, rms, bothner, amylaar, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]

Paul Eggert <eggert@twinsun.com> said:
> Horst von Brand <vonbrand@sleipnir.valparaiso.cl> said:
>    It's confusing enough to have to handle different charsets, and then
>    different charsets in the same file, switching in the middle! Maybe it
>    _can_ be done, but I'd vote it _shouldn't_ be done.

> Good point, and this suggests another problem with #ctype: it doesn't
> work well with #include.  For example, suppose we have:
> 
> 	main.c:
> 		#ctype "ja_JP.PCK" // Shift-JIS in Solaris 7
> 		#include "myfile.h"
> 		char s[] = S;
> 
> 	myfile.h:
> 		#ctype "ja" // EUC in Solaris 7
> 		#define S "some EUC string"
> 
> This will be difficult to implement, as it will mean we'll need to
> support mixed-charset translation, which would have many problems
> (e.g. how do you concatenate identifiers with different ctypes?).
> Also, it'd be nearly impossible to explain -- e.g. should `s' use
> Shift-JIS or EUC in the example above?
> 
> To avoid these problems, #ctype would have to be allowed only at the
> start of the compilation unit.

And then you force the #include'd (and also the ld'd, etc) files to
_always_ use that same charset, which tends to negate the whole thing,
AFAIKS.

Has anybody else implemented this kind of stuff? What are the ideas
floating around? I'd guess everybody will just go ahead, shut their eyes
and use _one_ charset on their systems (that's what Unicode was invented
for, isn't it?). In that case, the easy way out is to assume that... or are
there fundamental reasons that _force_ the concurrent use of incompatible
charsets? To have a bunch of libraries, one to each charset is _not_
nice... but it could be handled by some kind of shim that just translates
the symbol names and calls the "one and true" library.

Another way out (I know next to nothing about charsets, so bear with me):
Just disallow (for the time being?) non-ASCII (or non-Unicode + assorted
restrictions?) external identifiers. Note that some time back external
identifiers in C were only guaranteed in one case, 6 characters long (I
suspect that's where the mythical 'n' in umount(2) got lost ;-) Once
everybody uses the same charset, relax the restriction.

> The more I think about the #ctype directive, the less I like it -- it
> has so many funny restrictions, and its operands are so unportable.
> Perhaps it'd be better if we support only a -ctype option, at least at
> first.  We can add a #ctype directive later if the need arises.

You'd have to record that value in the *.o, *.a, *.so files to check they
are compatible when linking... and that means control over the _whole_
toolchain. But if you have control over the whole toolchain, simpler
solutions seem possible. If you don't, there is not much you can do.

Just MHO, barging in on a discussion I haven't followed.
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, ViÃ±a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58         ` Martin v. Loewis
  1999-01-31 23:58           ` Paul Eggert
@ 1999-01-31 23:58           ` Per Bothner
  1999-01-31 23:58             ` Joe Buck
  1 sibling, 1 reply; 31+ messages in thread
From: Per Bothner @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Martin v. Loewis; +Cc: gcc2, egcs

> we currently get '_._7myclass'. Under your proposal, we get
> '_.._7myclass'. This is a link-incompatibility with current code.

There is no need to consider "link-incompatibility with current code"
as a requirement or even a goal for C++, only for C.  The intention
is that C++ ABI will change in the next year or so, anyway.
This is the purpose for the -fnew-abi flag.  However, we should
resolve these issues before we freeze the "new ABI", which we
have to do before we make it the default.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58             ` Zack Weinberg
  1999-01-31 23:58               ` Martin v. Loewis
@ 1999-01-31 23:58               ` Paul Eggert
  1 sibling, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1999-01-31 23:58 UTC (permalink / raw)
  To: zack; +Cc: martin, rms, bothner, amylaar, gcc2, egcs

   Date: Mon, 04 Jan 1999 16:15:46 -0500
   From: Zack Weinberg <zack@rabi.columbia.edu>

   The preprocessor may have trouble with native extended chars as
   first characters, but if we can get isalpha() to cooperate, it
   should work.

There should be no trouble with isalpha cooperating, since GCC
shouldn't use isalpha to detect identifiers.

To detect C alphanumerics, GCC must use its own function or table.
GCC shouldn't use isalpha unless we want it to detect alphanumeric
bytes in the current locale, which is normally not what is wanted.
It might be OK for GCC to use isalpha for some obscure low-level stuff,
e.g. detecting whether an MS-DOS file name has a drive specifier.
But GCC shouldn't use isalpha for detecting identifiers, because
isalpha doesn't work with multibyte chars.

Parts of GCC already do the right thing here.  E.g. cccp.c uses
is_idchar instead of isalpha.  Some other parts of GCC use isalpha
where they shouldn't, though, and we'll have to fix these parts if we
want gcc to work with multibyte chars properly.

I already did most of this job for the back end and the C front end,
in the internationalization patch that has been applied to GCC2:

ftp://alpha.gnu.org/gnu/testgcc-980705-intl.patch.gz

However, this patch didn't cover the C++ front end, and there are
probably some other places that I missed that we'll just have to find
if and when they come up.  Also, this patch hasn't been fully applied
to EGCS yet, and there's undoubtedly some integration work needed
there since GCC2 and EGCS have diverged.  At some point I'd like to
get it working properly with EGCS as well as GCC2.  I understand that
a merge is in progress and so I'll wait for it to finish before
hacking any further on isalpha problems.

   This raises the issue of how we tell native extended character X
   from native ASCII character %.  I'm beginning to suspect we need
   the more general locale information, not just the charset.

I reluctantly agree.  Another reason we need the locale info is
because iconv (which needs only the charset) doesn't let us determine
character boundaries reliably; we need mbrlen and mbsinit for that,
and they need the LC_CTYPE locale.

This is why I'm changing the word `charset' to `ctype' in my next
version of the proposal for non-Ascii source files.  GCC will need to
know the LC_CTYPE locale (which I am calling the `ctype'), and from
that GCC can use nl_langinfo (CODESET) to determine the charset.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58           ` Martin v. Loewis
@ 1999-01-31 23:58             ` Zack Weinberg
  1999-01-31 23:58               ` Martin v. Loewis
  1999-01-31 23:58               ` Paul Eggert
  0 siblings, 2 replies; 31+ messages in thread
From: Zack Weinberg @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Martin v. Loewis; +Cc: eggert, rms, bothner, amylaar, gcc2, egcs

On Fri, 1 Jan 1999 22:49:03 +0100, "Martin v. Loewis" wrote:
>> - Can we forbid UCNs and native extended characters as the first character
>> of an identifier?
>
>For conformance, we must allow UCNs as first characters. Since native
>extended characters are implementation-defined, we could define such
>a restriction.

My initial idea of how to do UCNs was completely wrong.  It's not a problem
to take them as first characters.  The preprocessor may have trouble with
native extended chars as first characters, but if we can get isalpha() to
cooperate, it should work.  c-lex I don't know about.

>Even though extended characters are exactly that
>(extended), they still might use the same bytes as the basic character
>set in an MBCS encoding. For example, there could be an encoding where
>
><some escape>*/
>
>forms a valid extended character, which is different from both '*' and
>'/'. I believe ISO-2022-JP is such an encoding (or similar cases can
>probably be constructed).

Yuck.  OK, I'm convinced we need to know the charset out-of-band.

This raises the issue of how we tell native extended character X from native
ASCII character %.  I'm beginning to suspect we need the more general locale
information, not just the charset.

zw

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58         ` Martin v. Loewis
@ 1999-01-31 23:58           ` Paul Eggert
  1999-01-31 23:58           ` Per Bothner
  1 sibling, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1999-01-31 23:58 UTC (permalink / raw)
  To: martin; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

   Date: Fri, 1 Jan 1999 12:18:12 +0100
   From: "Martin v. Loewis" <martin@mira.isdn.cs.tu-berlin.de>

   class myclass{ public: ~myclass(); };

   we currently get '_._7myclass'. Under your proposal, we get
   '_.._7myclass'.

No, because under my proposal `.' is doubled only for assembler
identifiers that contain characters that are not allowed by the
platform.  If we currently get `_._7myclass', then it must be allowed
by the platform; so it it won't be affected by Ascization.

   Also, due to some misconfiguration in current gcc, many platforms
   define NO_DOLLAR_IN_LABEL even though their assemblers could handle it
   (e.g. on Linux).

This misconfiguration should get fixed, and there will be motivation
to do this if my proposal is adopted (currently my impression is that
there's not much point to fixing it).  If it's not fixed, then GCC
would fall back on `V', which is almost as good; but `.' is preferable
if it works.

Perhaps some platforms will allow all nonzero bytes in identifiers, in
which case Ascization won't affect identifiers.  It would be OK with
me if GNU/Linux adopted this policy, but I believe RMS is dubious
about it.

   So 'V' would be default escape. This is bad for functions that
   contain 'V' in their identifier.

This shouldn't be a problem.  If the identifier contains only the
usual Ascii letters and digits, it won't be munged at all under my
proposal.  It gets munged only if it contains one or more characters
not allowed by the platform.

I'll add the following comment to the proposal, to help clarify this point:

	Suppose `.' is allowed in assembler identifiers.  Then the C++ mangled
	identifier `_._7Vyclass' is not affected by (9) and (10) and is output
	unscathed, since it contains only traditional C characters (plus `.'
	or `$'), and all its characters are allowed by the platform.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58             ` Zack Weinberg
@ 1999-01-31 23:58               ` Martin v. Loewis
  1999-01-31 23:58               ` Paul Eggert
  1 sibling, 0 replies; 31+ messages in thread
From: Martin v. Loewis @ 1999-01-31 23:58 UTC (permalink / raw)
  To: zack; +Cc: eggert, rms, bothner, amylaar, gcc2, egcs

> This raises the issue of how we tell native extended character X from native
> ASCII character %.  I'm beginning to suspect we need the more general locale
> information, not just the charset.

Paul suggested that you can test whether you are in the initial state
when using the multibyte functions. If that is true, and if we assume
that the initial state is ASCII (as mandated by the C and C++
standards), we have a test whether a single byte we just saw really is
from the base character set.

Furthermore, the standards require that we are in the initial state
after each identifier. So we have good reason to reject

<escape>printf<funny characters>("Hello world");<escape to ASCII>

The escape back to ASCII must occur right after <funny characters>,
and should probably count as part of the identifier.

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58                   ` Horst von Brand
@ 1999-01-31 23:58                     ` Martin v. Loewis
  0 siblings, 0 replies; 31+ messages in thread
From: Martin v. Loewis @ 1999-01-31 23:58 UTC (permalink / raw)
  To: vonbrand; +Cc: eggert, egcs

> Suppose in encoding A I use "xyz" as a symbol name, which just so happens
> is what encoding B gives you for "uvw"... and now I link a file with
> encoding B using "uvw" (which is totally different from "xyz") nagainst the
> library generated under encoding A. Sure, the probability of this happening
> is remote, but...

And of course, you'll lose. The question is whether this will be
acceptable to the users. There is no real experience with that. Since
being restricted to ASCII was somewhat acceptable all the years, we'll
have to see.

IMHO, this just means that everybody will switch to Unicode in the
long run, which will reduce that risk. However, I agree that gcc
should not favour a particular encoding; that's not our business.

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58               ` Horst von Brand
@ 1999-01-31 23:58                 ` Paul Eggert
  1999-01-31 23:58                   ` Horst von Brand
  1999-01-31 23:58                 ` Martin v. Loewis
  1 sibling, 1 reply; 31+ messages in thread
From: Paul Eggert @ 1999-01-31 23:58 UTC (permalink / raw)
  To: vonbrand; +Cc: zack, martin, rms, bothner, amylaar, gcc2, egcs

   Date: Sat, 02 Jan 1999 13:00:58 -0400
   From: Horst von Brand <vonbrand@sleipnir.valparaiso.cl>

   Has anybody else implemented this kind of stuff?

As far as I know, no one has yet published an implementation of C9x
CD2.  There is some prior art that uses similar ideas, but the C
committee has added some quirks.

   I'd guess everybody will just go ahead, shut their eyes and use
   _one_ charset on their systems

Most likely this will be true for identifiers -- i.e. identifiers will
stick to A-Za-z0-9_ (and sometimes $ and .) for some time to come, as
RMS said.

However, for strings this is not true even now.  Most
internationalized code puts non-Ascii strings in separate message
files, but there is a reasonable amount of quick-and-dirty code that
puts EUC (or even Shift-JIS!) into comments and strings.

   are there fundamental reasons that _force_ the concurrent use of
   incompatible charsets?

With comments and strings, established audiences already use
Shift-JIS, EUC, or whatever, and are unlikely to change in the near
future due to understandable inertia.

With identifiers, it's too early to say -- there's little experience.
It's possible that identifiers will go the way of strings; another
possibility is that platforms will use UTF-8 in identifiers.

   Another way out (I know next to nothing about charsets, so bear with me):
   Just disallow (for the time being?) non-ASCII (or non-Unicode + assorted
   restrictions?) external identifiers.

There are two related issues here:

* What characters are allowed in C identifiers?  Here draft C9x (and
  the C++ standard) are quite clear: many, many different characters
  are allowed, including most Kanji characters.  GCC must support
  this, even on platforms that do not have the Kanji characters,
  because the user can always enter the UCN equivalent of the Kanji
  character on such platforms.  The standards require that such characters
  be allowed in external identifiers.

* How should untraditional characters be encoded in external
  identifers?  One possibility is to use a multibyte encoding
  (e.g. one could use UTF-8, where the three-byte sequence with codes
  E1, 81 and A0 represent \u10a0); another is to use some sort of
  Ascii escape sequence (e.g. `.E1.81.A0' or `.u10A0' for \u10a0).
  The former assumes multibyte character support in the assembler,
  linker, debugger, and all other programs that examine assembler or
  object files; the latter assumes much less, though it's a pain to
  use unless the debugger and some other tools can unescape the
  sequences.  The current proposal allows either possibility, at the
  platform's option.

   > Perhaps it'd be better if we support only a -ctype option...

   You'd have to record that value in the *.o, *.a, *.so files to check they
   are compatible when linking.

Why is this necessary?  If you use incompatible ctypes, the files
won't link.

I suppose that one might record the encoding in each object file, as
an extra check, but it would mean extra locale processing in the
linker.  E.g. the linker would have to know that Ascii is compatible
with both ISO 8859-1 and UTF-8, but that ISO 8859-1 and UTF-8 are
incompatible.  But one of our design goals is to have the linker to be
able to work independently of the ctype.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58           ` Horst von Brand
@ 1999-01-31 23:58             ` Paul Eggert
  1999-01-31 23:58               ` Horst von Brand
  0 siblings, 1 reply; 31+ messages in thread
From: Paul Eggert @ 1999-01-31 23:58 UTC (permalink / raw)
  To: vonbrand; +Cc: zack, martin, rms, bothner, amylaar, gcc2, egcs

   Date: Fri, 01 Jan 1999 18:20:07 -0400
   From: Horst von Brand <vonbrand@sleipnir.valparaiso.cl>

   It's confusing enough to have to handle different charsets, and then
   different charsets in the same file, switching in the middle! Maybe it
   _can_ be done, but I'd vote it _shouldn't_ be done.

Good point, and this suggests another problem with #ctype: it doesn't
work well with #include.  For example, suppose we have:

	main.c:
		#ctype "ja_JP.PCK" // Shift-JIS in Solaris 7
		#include "myfile.h"
		char s[] = S;

	myfile.h:
		#ctype "ja" // EUC in Solaris 7
		#define S "some EUC string"

This will be difficult to implement, as it will mean we'll need to
support mixed-charset translation, which would have many problems
(e.g. how do you concatenate identifiers with different ctypes?).
Also, it'd be nearly impossible to explain -- e.g. should `s' use
Shift-JIS or EUC in the example above?

To avoid these problems, #ctype would have to be allowed only at the
start of the compilation unit.

The more I think about the #ctype directive, the less I like it -- it
has so many funny restrictions, and its operands are so unportable.
Perhaps it'd be better if we support only a -ctype option, at least at
first.  We can add a #ctype directive later if the need arises.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-31 13:55       ` Paul Eggert
@ 1999-01-31 23:58         ` Zack Weinberg
  1999-01-31 23:58           ` Martin v. Loewis
                             ` (2 more replies)
  1999-01-31 23:58         ` Martin v. Loewis
  1 sibling, 3 replies; 31+ messages in thread
From: Zack Weinberg @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: martin, rms, bothner, amylaar, gcc2, egcs

On Thu, 31 Dec 1998 13:54:19 -0800 (PST), Paul Eggert wrote:

>OK, here's a revised version of the name Ascization part of that
>proposal, which should be safe in the presence of C++ name mangling.
>Sorry, it's a bit complicated, but I don't see a way to simplify it
>without introducing other disadvantages.

Just a couple issues with this:

- Can we forbid UCNs and native extended characters as the first character
of an identifier?  This would substantially reduce the amount of changes to
the parser.

- C9x says UCNs (and presumably extended characters) may not designate any
character in the required source character set.  This means that there is no
problem with recognizing comments or strings even when we don't know the
source charset; and therefore I see no particular difficulties with
#charset/#pragma charset in the middle of the file.  It just takes effect
after the line it appears on.  (I'll respond separately to your comments on
#pragma.)

- I don't like allowing arbitrary ASCII non-required-charset symbols, such
as '@', in identifiers (this seems to be implicitly permitted in your
proposal).  It's easiest to allow arbitrary non-ASCII symbols in
identifiers, but we know what '@' is.

- What do you do with an identifier that has 'V' in it already?

zw

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58                 ` Paul Eggert
@ 1999-01-31 23:58                   ` Horst von Brand
  1999-01-31 23:58                     ` Martin v. Loewis
  0 siblings, 1 reply; 31+ messages in thread
From: Horst von Brand @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

Paul Eggert <eggert@twinsun.com> said:
>    Date: Sat, 02 Jan 1999 13:00:58 -0400
>    From: Horst von Brand <vonbrand@sleipnir.valparaiso.cl>

[...]

>    > Perhaps it'd be better if we support only a -ctype option...
> 
>    You'd have to record that value in the *.o, *.a, *.so files to check they
>    are compatible when linking.
> 
> Why is this necessary?  If you use incompatible ctypes, the files
> won't link.

Suppose in encoding A I use "xyz" as a symbol name, which just so happens
is what encoding B gives you for "uvw"... and now I link a file with
encoding B using "uvw" (which is totally different from "xyz") nagainst the
library generated under encoding A. Sure, the probability of this happening
is remote, but...

> I suppose that one might record the encoding in each object file, as
> an extra check, but it would mean extra locale processing in the
> linker.  E.g. the linker would have to know that Ascii is compatible
> with both ISO 8859-1 and UTF-8, but that ISO 8859-1 and UTF-8 are
> incompatible.  But one of our design goals is to have the linker to be
> able to work independently of the ctype.

Hope that works out.
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, ViÃ±a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58             ` Joe Buck
@ 1999-01-31 23:58               ` Jeffrey A Law
  0 siblings, 0 replies; 31+ messages in thread
From: Jeffrey A Law @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Joe Buck; +Cc: Per Bothner, martin, gcc2, egcs

  In message < 199901022036.MAA02868@atrus.synopsys.com >you write:
  > 
  > > > we currently get '_._7myclass'. Under your proposal, we get
  > > > '_.._7myclass'. This is a link-incompatibility with current code.
  > > 
  > > There is no need to consider "link-incompatibility with current code"
  > > as a requirement or even a goal for C++, only for C.  The intention
  > > is that C++ ABI will change in the next year or so, anyway.
  > 
  > Just the same, we should avoid making symbol lengths unnecessarily longer.
  > They just eat disk space, and one of the benefits of -fnew-abi is that
  > it makes mangled names substantially shorter in many cases.
Not only do they eat disk space, a variety of system utilities (assemblers and
linkers in particular) blow chunks when presented with very long names.

The GNU utilities tend to handle such situations reasonably well, but the
GNU utilities aren't always available on every platform.

jeff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58               ` Horst von Brand
  1999-01-31 23:58                 ` Paul Eggert
@ 1999-01-31 23:58                 ` Martin v. Loewis
  1 sibling, 0 replies; 31+ messages in thread
From: Martin v. Loewis @ 1999-01-31 23:58 UTC (permalink / raw)
  To: vonbrand; +Cc: eggert, zack, rms, bothner, amylaar, gcc2, egcs

> Has anybody else implemented this kind of stuff? What are the ideas
> floating around?

For string and character literals, non-ASCII support always was in the
compilers, in some form (usually not supporting multibyte
strings). They were just copied through as-is.

For non-ASCII in identifiers, I believe this is actually the first
time in computing history. Java (i.e. SunSoft or whoever) held onto
Unicode and Unicode only. This is the origin of the \u escapes, right?

The C standard very strongly suggests to use 2-byte escapes for
Unicode: they say that a \u character counts as six bytes in the 31
byte limit for external identifiers, and \U counts as 10 bytes :-)

> In that case, the easy way out is to assume that... or are there
> fundamental reasons that _force_ the concurrent use of incompatible
> charsets?

There are fundamental reasons to support more than one charset, not
necessarily concurrently. The main reason is that Unicode is not
(yet?) universally accepted.

Whether this means we need *simultaneous* use of different charsets is
still an open question.

> Just disallow (for the time being?) non-ASCII (or non-Unicode + assorted
> restrictions?) external identifiers.

This would basically put as back to where we are right now. We can
have that approach for free :-)

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1998-12-31 13:55       ` Paul Eggert
  1999-01-31 23:58         ` Zack Weinberg
@ 1999-01-31 23:58         ` Martin v. Loewis
  1999-01-31 23:58           ` Paul Eggert
  1999-01-31 23:58           ` Per Bothner
  1 sibling, 2 replies; 31+ messages in thread
From: Martin v. Loewis @ 1999-01-31 23:58 UTC (permalink / raw)
  To: eggert; +Cc: rms, zack, bothner, amylaar, gcc2, egcs

>      . Each instance of the escape byte is doubled.  The escape byte
>        is `.' if `.' and `$' are both allowed in identifiers, and is
>        `V' otherwise.

I assume this applies to C++ mangled names? Not good. If we have

class myclass{
      public:
      ~myclass();
};

we currently get '_._7myclass'. Under your proposal, we get
'_.._7myclass'. This is a link-incompatibility with current code.

Also, due to some misconfiguration in current gcc, many platforms
define NO_DOLLAR_IN_LABEL even though their assemblers could handle it
(e.g. on Linux). So 'V' would be default escape. This is bad for
functions that contain 'V' in their identifier.

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58         ` Zack Weinberg
  1999-01-31 23:58           ` Martin v. Loewis
  1999-01-31 23:58           ` Horst von Brand
@ 1999-01-31 23:58           ` Paul Eggert
  2 siblings, 0 replies; 31+ messages in thread
From: Paul Eggert @ 1999-01-31 23:58 UTC (permalink / raw)
  To: zack; +Cc: martin, rms, bothner, amylaar, gcc2, egcs

   Date: Fri, 01 Jan 1999 14:04:14 -0500
   From: Zack Weinberg <zack@rabi.columbia.edu>

   - Can we forbid UCNs and native extended characters as the first character
   of an identifier?

We could forbid native extended characters (since they're an
implementation extension) but we can't forbid the UCNs allowed by
Annex I, since the standard requires that we support them.  I don't
think it's wise to forbid the native extended characters, since this
will be a real hardship for people who want to use non-Ascii
identifiers.

   - C9x says UCNs (and presumably extended characters) may not designate any
   character in the required source character set.  This means that there is no
   problem with recognizing comments or strings even when we don't know the
   source charset;

This is true only if we forbid native extended characters that can be
confused with end of string or comment.  This restriction would be
unreasonable for e.g. Shift-JIS, which has multibyte characters that
contain `\' and `*' bytes.  This is the motivation for requiring that
#ctype be at the start of the file (possibly preceded by ``safe''
comments).

   - What do you do with an identifier that has 'V' in it already?

I leave it alone, unless the identifier also has a non-Ascii character in it.

   - I don't like allowing arbitrary ASCII non-required-charset symbols, such
   as '@', in identifiers (this seems to be implicitly permitted in your
   proposal).

I was using `@' to denote MICRO SIGN (Unicode `00B5'); I didn't want
to put that character in email since it might have gotten munged.  I
agree that only non-Ascii characters should be allowed in identifiers
(other than the Ascii characters already allowed); I'll add the
following clarifying note to (3):

Each non-Ascii character is allowed in an identifier, string, or comment.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58         ` Zack Weinberg
  1999-01-31 23:58           ` Martin v. Loewis
@ 1999-01-31 23:58           ` Horst von Brand
  1999-01-31 23:58             ` Paul Eggert
  1999-01-31 23:58           ` Paul Eggert
  2 siblings, 1 reply; 31+ messages in thread
From: Horst von Brand @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Paul Eggert, martin, rms, bothner, amylaar, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

Zack Weinberg <zack@rabi.columbia.edu> said:
> On Thu, 31 Dec 1998 13:54:19 -0800 (PST), Paul Eggert wrote:
> >OK, here's a revised version of the name Ascization part of that
> >proposal, which should be safe in the presence of C++ name mangling.
> >Sorry, it's a bit complicated, but I don't see a way to simplify it
> >without introducing other disadvantages.

> Just a couple issues with this:

> - Can we forbid UCNs and native extended characters as the first character
> of an identifier?  This would substantially reduce the amount of changes to
> the parser.

I think rather arbitrary restrictions like thatwill hurt much in the long run.

> - C9x says UCNs (and presumably extended characters) may not designate any
> character in the required source character set.  This means that there is no
> problem with recognizing comments or strings even when we don't know the
> source charset; and therefore I see no particular difficulties with
> #charset/#pragma charset in the middle of the file.  It just takes effect
> after the line it appears on.  (I'll respond separately to your comments on
> #pragma.)

It's confusing enough to have to handle different charsets, and then
different charsets in the same file, switching in the middle! Maybe it
_can_ be done, but I'd vote it _shouldn't_ be done.

> - I don't like allowing arbitrary ASCII non-required-charset symbols, such
> as '@', in identifiers (this seems to be implicitly permitted in your
> proposal).  It's easiest to allow arbitrary non-ASCII symbols in
> identifiers, but we know what '@' is.

Either allow non-{letter,digit,underscore} in identifiers uniformly, or
just disallow all of them. What you say sounds reasonable as long as ASCII
(or a close variant) is standard; think about the future when all this is
just a dim memory...
-- 
Horst von Brand                             vonbrand@sleipnir.valparaiso.cl
Casilla 9G, ViÃ±a del Mar, Chile                               +56 32 672616

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58           ` Per Bothner
@ 1999-01-31 23:58             ` Joe Buck
  1999-01-31 23:58               ` Jeffrey A Law
  0 siblings, 1 reply; 31+ messages in thread
From: Joe Buck @ 1999-01-31 23:58 UTC (permalink / raw)
  To: Per Bothner; +Cc: martin, gcc2, egcs

> > we currently get '_._7myclass'. Under your proposal, we get
> > '_.._7myclass'. This is a link-incompatibility with current code.
> 
> There is no need to consider "link-incompatibility with current code"
> as a requirement or even a goal for C++, only for C.  The intention
> is that C++ ABI will change in the next year or so, anyway.

Just the same, we should avoid making symbol lengths unnecessarily longer.
They just eat disk space, and one of the benefits of -fnew-abi is that
it makes mangled names substantially shorter in many cases.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: revised proposal for GCC and non-Ascii source files
  1999-01-31 23:58         ` Zack Weinberg
@ 1999-01-31 23:58           ` Martin v. Loewis
  1999-01-31 23:58             ` Zack Weinberg
  1999-01-31 23:58           ` Horst von Brand
  1999-01-31 23:58           ` Paul Eggert
  2 siblings, 1 reply; 31+ messages in thread
From: Martin v. Loewis @ 1999-01-31 23:58 UTC (permalink / raw)
  To: zack; +Cc: eggert, rms, bothner, amylaar, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]

> - Can we forbid UCNs and native extended characters as the first character
> of an identifier?

For conformance, we must allow UCNs as first characters. Since native
extended characters are implementation-defined, we could define such
a restriction.

> - C9x says UCNs (and presumably extended characters) may not designate any
> character in the required source character set.  This means that there is no
> problem with recognizing comments or strings even when we don't know the
> source charset

There is a problem. Even though extended characters are exactly that
(extended), they still might use the same bytes as the basic character
set in an MBCS encoding. For example, there could be an encoding where

<some escape>*/

forms a valid extended character, which is different from both '*' and
'/'. I believe ISO-2022-JP is such an encoding (or similar cases can
probably be constructed).

> - I don't like allowing arbitrary ASCII non-required-charset symbols, such
> as '@', in identifiers (this seems to be implicitly permitted in your
> proposal).  It's easiest to allow arbitrary non-ASCII symbols in
> identifiers, but we know what '@' is.

I believe '@' was a placeholder only, for Âµ (MICRO SIGN).

> - What do you do with an identifier that has 'V' in it already?

Very good question.

Regards,
Martin

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~1999-01-31 23:58 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-28 17:59 revised proposal for GCC and non-Ascii source files Paul Eggert
1998-12-29  1:38 ` Martin von Loewis
1998-12-29  1:39 ` Martin von Loewis
1998-12-29  5:53   ` Paul Eggert
1998-12-29  6:22     ` Martin von Loewis
1998-12-31 13:55       ` Paul Eggert
1999-01-31 23:58         ` Zack Weinberg
1999-01-31 23:58           ` Martin v. Loewis
1999-01-31 23:58             ` Zack Weinberg
1999-01-31 23:58               ` Martin v. Loewis
1999-01-31 23:58               ` Paul Eggert
1999-01-31 23:58           ` Horst von Brand
1999-01-31 23:58             ` Paul Eggert
1999-01-31 23:58               ` Horst von Brand
1999-01-31 23:58                 ` Paul Eggert
1999-01-31 23:58                   ` Horst von Brand
1999-01-31 23:58                     ` Martin v. Loewis
1999-01-31 23:58                 ` Martin v. Loewis
1999-01-31 23:58           ` Paul Eggert
1999-01-31 23:58         ` Martin v. Loewis
1999-01-31 23:58           ` Paul Eggert
1999-01-31 23:58           ` Per Bothner
1999-01-31 23:58             ` Joe Buck
1999-01-31 23:58               ` Jeffrey A Law
1998-12-29  1:50 ` Martin von Loewis
1998-12-29  5:41   ` Paul Eggert
1998-12-30  5:21 ` Richard Stallman
1998-12-31 15:55   ` Paul Eggert
1998-12-30 14:58 ` Zack Weinberg
1998-12-31 14:28   ` Paul Eggert
1998-12-31 15:13   ` problems with C9x's _Pragma Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).