Implementing Universal Character Names in identifiers

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Implementing Universal Character Names in identifiers
@ 2002-10-27 23:15 Martin v. Löwis
  2002-10-27 23:47 ` Fergus Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-10-27 23:15 UTC (permalink / raw)
  To: gcc-patches

This patch implements UCNs in cpplib. It does so by converting the
UCN to UTF-8, putting the UTF-8 bytes into the internal
representation of the identifier.

The back-ends will transparently output the UTF-8 identifiers into the
assembler file. If GNU as is used (or any other assembler supporting
non-ASCII identifiers), these UTF-8 strings will be copied transparently
into the object file. If the assembler does not support UTF-8, it
will produce a diagnostic.

As a result of this strategy, UCNs are now allowed in all places
mandated by the relevant standards, i.e. both in C99 and C++, and in
all identifiers, including macro names.

Regards,
Martin

2002-10-27  Martin v. LÃ¶wis  <loewis@informatik.hu-berlin.de>

	* c-lex.c (is_extended_char, utf8_extend_token): Remove.
	* cpplex.c (identifier_ucs_p, utf8_extend_token, 
	utf8_to_char): New functions.
	(parse_slow): Add utf8 parameter. Parse UCS names.
	(parse_identifier, parse_number): Adjust.
	(_cpp_lex_direct): Parse UCS names.
	(cpp_output_token): Print UCS names.
	* cpplib.h (NODE_UTF8): New flag.

Index: c-lex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/c-lex.c,v
retrieving revision 1.190
diff -c -p -r1.190 c-lex.c
*** c-lex.c	16 Sep 2002 16:36:31 -0000	1.190
--- c-lex.c	27 Oct 2002 17:35:33 -0000
*************** cb_undef (pfile, line, node)
*** 356,669 ****
  			 (const char *) NODE_NAME (node));
  }
  
- #if 0 /* not yet */
- /* Returns nonzero if C is a universal-character-name.  Give an error if it
-    is not one which may appear in an identifier, as per [extendid].
- 
-    Note that extended character support in identifiers has not yet been
-    implemented.  It is my personal opinion that this is not a desirable
-    feature.  Portable code cannot count on support for more than the basic
-    identifier character set.  */
- 
- static inline int
- is_extended_char (c)
-      int c;
- {
- #ifdef TARGET_EBCDIC
-   return 0;
- #else
-   /* ASCII.  */
-   if (c < 0x7f)
-     return 0;
- 
-   /* None of the valid chars are outside the Basic Multilingual Plane (the
-      low 16 bits).  */
-   if (c > 0xffff)
-     {
-       error ("universal-character-name '\\U%08x' not valid in identifier", c);
-       return 1;
-     }
-   
-   /* Latin */
-   if ((c >= 0x00c0 && c <= 0x00d6)
-       || (c >= 0x00d8 && c <= 0x00f6)
-       || (c >= 0x00f8 && c <= 0x01f5)
-       || (c >= 0x01fa && c <= 0x0217)
-       || (c >= 0x0250 && c <= 0x02a8)
-       || (c >= 0x1e00 && c <= 0x1e9a)
-       || (c >= 0x1ea0 && c <= 0x1ef9))
-     return 1;
- 
-   /* Greek */
-   if ((c == 0x0384)
-       || (c >= 0x0388 && c <= 0x038a)
-       || (c == 0x038c)
-       || (c >= 0x038e && c <= 0x03a1)
-       || (c >= 0x03a3 && c <= 0x03ce)
-       || (c >= 0x03d0 && c <= 0x03d6)
-       || (c == 0x03da)
-       || (c == 0x03dc)
-       || (c == 0x03de)
-       || (c == 0x03e0)
-       || (c >= 0x03e2 && c <= 0x03f3)
-       || (c >= 0x1f00 && c <= 0x1f15)
-       || (c >= 0x1f18 && c <= 0x1f1d)
-       || (c >= 0x1f20 && c <= 0x1f45)
-       || (c >= 0x1f48 && c <= 0x1f4d)
-       || (c >= 0x1f50 && c <= 0x1f57)
-       || (c == 0x1f59)
-       || (c == 0x1f5b)
-       || (c == 0x1f5d)
-       || (c >= 0x1f5f && c <= 0x1f7d)
-       || (c >= 0x1f80 && c <= 0x1fb4)
-       || (c >= 0x1fb6 && c <= 0x1fbc)
-       || (c >= 0x1fc2 && c <= 0x1fc4)
-       || (c >= 0x1fc6 && c <= 0x1fcc)
-       || (c >= 0x1fd0 && c <= 0x1fd3)
-       || (c >= 0x1fd6 && c <= 0x1fdb)
-       || (c >= 0x1fe0 && c <= 0x1fec)
-       || (c >= 0x1ff2 && c <= 0x1ff4)
-       || (c >= 0x1ff6 && c <= 0x1ffc))
-     return 1;
- 
-   /* Cyrillic */
-   if ((c >= 0x0401 && c <= 0x040d)
-       || (c >= 0x040f && c <= 0x044f)
-       || (c >= 0x0451 && c <= 0x045c)
-       || (c >= 0x045e && c <= 0x0481)
-       || (c >= 0x0490 && c <= 0x04c4)
-       || (c >= 0x04c7 && c <= 0x04c8)
-       || (c >= 0x04cb && c <= 0x04cc)
-       || (c >= 0x04d0 && c <= 0x04eb)
-       || (c >= 0x04ee && c <= 0x04f5)
-       || (c >= 0x04f8 && c <= 0x04f9))
-     return 1;
- 
-   /* Armenian */
-   if ((c >= 0x0531 && c <= 0x0556)
-       || (c >= 0x0561 && c <= 0x0587))
-     return 1;
- 
-   /* Hebrew */
-   if ((c >= 0x05d0 && c <= 0x05ea)
-       || (c >= 0x05f0 && c <= 0x05f4))
-     return 1;
- 
-   /* Arabic */
-   if ((c >= 0x0621 && c <= 0x063a)
-       || (c >= 0x0640 && c <= 0x0652)
-       || (c >= 0x0670 && c <= 0x06b7)
-       || (c >= 0x06ba && c <= 0x06be)
-       || (c >= 0x06c0 && c <= 0x06ce)
-       || (c >= 0x06e5 && c <= 0x06e7))
-     return 1;
- 
-   /* Devanagari */
-   if ((c >= 0x0905 && c <= 0x0939)
-       || (c >= 0x0958 && c <= 0x0962))
-     return 1;
- 
-   /* Bengali */
-   if ((c >= 0x0985 && c <= 0x098c)
-       || (c >= 0x098f && c <= 0x0990)
-       || (c >= 0x0993 && c <= 0x09a8)
-       || (c >= 0x09aa && c <= 0x09b0)
-       || (c == 0x09b2)
-       || (c >= 0x09b6 && c <= 0x09b9)
-       || (c >= 0x09dc && c <= 0x09dd)
-       || (c >= 0x09df && c <= 0x09e1)
-       || (c >= 0x09f0 && c <= 0x09f1))
-     return 1;
- 
-   /* Gurmukhi */
-   if ((c >= 0x0a05 && c <= 0x0a0a)
-       || (c >= 0x0a0f && c <= 0x0a10)
-       || (c >= 0x0a13 && c <= 0x0a28)
-       || (c >= 0x0a2a && c <= 0x0a30)
-       || (c >= 0x0a32 && c <= 0x0a33)
-       || (c >= 0x0a35 && c <= 0x0a36)
-       || (c >= 0x0a38 && c <= 0x0a39)
-       || (c >= 0x0a59 && c <= 0x0a5c)
-       || (c == 0x0a5e))
-     return 1;
- 
-   /* Gujarati */
-   if ((c >= 0x0a85 && c <= 0x0a8b)
-       || (c == 0x0a8d)
-       || (c >= 0x0a8f && c <= 0x0a91)
-       || (c >= 0x0a93 && c <= 0x0aa8)
-       || (c >= 0x0aaa && c <= 0x0ab0)
-       || (c >= 0x0ab2 && c <= 0x0ab3)
-       || (c >= 0x0ab5 && c <= 0x0ab9)
-       || (c == 0x0ae0))
-     return 1;
- 
-   /* Oriya */
-   if ((c >= 0x0b05 && c <= 0x0b0c)
-       || (c >= 0x0b0f && c <= 0x0b10)
-       || (c >= 0x0b13 && c <= 0x0b28)
-       || (c >= 0x0b2a && c <= 0x0b30)
-       || (c >= 0x0b32 && c <= 0x0b33)
-       || (c >= 0x0b36 && c <= 0x0b39)
-       || (c >= 0x0b5c && c <= 0x0b5d)
-       || (c >= 0x0b5f && c <= 0x0b61))
-     return 1;
- 
-   /* Tamil */
-   if ((c >= 0x0b85 && c <= 0x0b8a)
-       || (c >= 0x0b8e && c <= 0x0b90)
-       || (c >= 0x0b92 && c <= 0x0b95)
-       || (c >= 0x0b99 && c <= 0x0b9a)
-       || (c == 0x0b9c)
-       || (c >= 0x0b9e && c <= 0x0b9f)
-       || (c >= 0x0ba3 && c <= 0x0ba4)
-       || (c >= 0x0ba8 && c <= 0x0baa)
-       || (c >= 0x0bae && c <= 0x0bb5)
-       || (c >= 0x0bb7 && c <= 0x0bb9))
-     return 1;
- 
-   /* Telugu */
-   if ((c >= 0x0c05 && c <= 0x0c0c)
-       || (c >= 0x0c0e && c <= 0x0c10)
-       || (c >= 0x0c12 && c <= 0x0c28)
-       || (c >= 0x0c2a && c <= 0x0c33)
-       || (c >= 0x0c35 && c <= 0x0c39)
-       || (c >= 0x0c60 && c <= 0x0c61))
-     return 1;
- 
-   /* Kannada */
-   if ((c >= 0x0c85 && c <= 0x0c8c)
-       || (c >= 0x0c8e && c <= 0x0c90)
-       || (c >= 0x0c92 && c <= 0x0ca8)
-       || (c >= 0x0caa && c <= 0x0cb3)
-       || (c >= 0x0cb5 && c <= 0x0cb9)
-       || (c >= 0x0ce0 && c <= 0x0ce1))
-     return 1;
- 
-   /* Malayalam */
-   if ((c >= 0x0d05 && c <= 0x0d0c)
-       || (c >= 0x0d0e && c <= 0x0d10)
-       || (c >= 0x0d12 && c <= 0x0d28)
-       || (c >= 0x0d2a && c <= 0x0d39)
-       || (c >= 0x0d60 && c <= 0x0d61))
-     return 1;
- 
-   /* Thai */
-   if ((c >= 0x0e01 && c <= 0x0e30)
-       || (c >= 0x0e32 && c <= 0x0e33)
-       || (c >= 0x0e40 && c <= 0x0e46)
-       || (c >= 0x0e4f && c <= 0x0e5b))
-     return 1;
- 
-   /* Lao */
-   if ((c >= 0x0e81 && c <= 0x0e82)
-       || (c == 0x0e84)
-       || (c == 0x0e87)
-       || (c == 0x0e88)
-       || (c == 0x0e8a)
-       || (c == 0x0e0d)
-       || (c >= 0x0e94 && c <= 0x0e97)
-       || (c >= 0x0e99 && c <= 0x0e9f)
-       || (c >= 0x0ea1 && c <= 0x0ea3)
-       || (c == 0x0ea5)
-       || (c == 0x0ea7)
-       || (c == 0x0eaa)
-       || (c == 0x0eab)
-       || (c >= 0x0ead && c <= 0x0eb0)
-       || (c == 0x0eb2)
-       || (c == 0x0eb3)
-       || (c == 0x0ebd)
-       || (c >= 0x0ec0 && c <= 0x0ec4)
-       || (c == 0x0ec6))
-     return 1;
- 
-   /* Georgian */
-   if ((c >= 0x10a0 && c <= 0x10c5)
-       || (c >= 0x10d0 && c <= 0x10f6))
-     return 1;
- 
-   /* Hiragana */
-   if ((c >= 0x3041 && c <= 0x3094)
-       || (c >= 0x309b && c <= 0x309e))
-     return 1;
- 
-   /* Katakana */
-   if ((c >= 0x30a1 && c <= 0x30fe))
-     return 1;
- 
-   /* Bopmofo */
-   if ((c >= 0x3105 && c <= 0x312c))
-     return 1;
- 
-   /* Hangul */
-   if ((c >= 0x1100 && c <= 0x1159)
-       || (c >= 0x1161 && c <= 0x11a2)
-       || (c >= 0x11a8 && c <= 0x11f9))
-     return 1;
- 
-   /* CJK Unified Ideographs */
-   if ((c >= 0xf900 && c <= 0xfa2d)
-       || (c >= 0xfb1f && c <= 0xfb36)
-       || (c >= 0xfb38 && c <= 0xfb3c)
-       || (c == 0xfb3e)
-       || (c >= 0xfb40 && c <= 0xfb41)
-       || (c >= 0xfb42 && c <= 0xfb44)
-       || (c >= 0xfb46 && c <= 0xfbb1)
-       || (c >= 0xfbd3 && c <= 0xfd3f)
-       || (c >= 0xfd50 && c <= 0xfd8f)
-       || (c >= 0xfd92 && c <= 0xfdc7)
-       || (c >= 0xfdf0 && c <= 0xfdfb)
-       || (c >= 0xfe70 && c <= 0xfe72)
-       || (c == 0xfe74)
-       || (c >= 0xfe76 && c <= 0xfefc)
-       || (c >= 0xff21 && c <= 0xff3a)
-       || (c >= 0xff41 && c <= 0xff5a)
-       || (c >= 0xff66 && c <= 0xffbe)
-       || (c >= 0xffc2 && c <= 0xffc7)
-       || (c >= 0xffca && c <= 0xffcf)
-       || (c >= 0xffd2 && c <= 0xffd7)
-       || (c >= 0xffda && c <= 0xffdc)
-       || (c >= 0x4e00 && c <= 0x9fa5))
-     return 1;
- 
-   error ("universal-character-name '\\u%04x' not valid in identifier", c);
-   return 1;
- #endif
- }
- 
- /* Add the UTF-8 representation of C to the token_buffer.  */
- 
- static void
- utf8_extend_token (c)
-      int c;
- {
-   int shift, mask;
- 
-   if      (c <= 0x0000007f)
-     {
-       extend_token (c);
-       return;
-     }
-   else if (c <= 0x000007ff)
-     shift = 6, mask = 0xc0;
-   else if (c <= 0x0000ffff)
-     shift = 12, mask = 0xe0;
-   else if (c <= 0x001fffff)
-     shift = 18, mask = 0xf0;
-   else if (c <= 0x03ffffff)
-     shift = 24, mask = 0xf8;
-   else
-     shift = 30, mask = 0xfc;
- 
-   extend_token (mask | (c >> shift));
-   do
-     {
-       shift -= 6;
-       extend_token ((unsigned char) (0x80 | (c >> shift)));
-     }
-   while (shift);
- }
- #endif
  \f
  int
  c_lex (value)
--- 356,361 ----
Index: cpplex.c
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplex.c,v
retrieving revision 1.215
diff -c -p -r1.215 cpplex.c
*** cpplex.c	26 Sep 2002 22:25:12 -0000	1.215
--- cpplex.c	27 Oct 2002 17:35:33 -0000
*************** static void adjust_column PARAMS ((cpp_r
*** 71,77 ****
  static int skip_whitespace PARAMS ((cpp_reader *, cppchar_t));
  static cpp_hashnode *parse_identifier PARAMS ((cpp_reader *));
  static uchar *parse_slow PARAMS ((cpp_reader *, const uchar *, int,
! 				  unsigned int *));
  static void parse_number PARAMS ((cpp_reader *, cpp_string *, int));
  static int unescaped_terminator_p PARAMS ((cpp_reader *, const uchar *));
  static void parse_string PARAMS ((cpp_reader *, cpp_token *, cppchar_t));
--- 71,77 ----
  static int skip_whitespace PARAMS ((cpp_reader *, cppchar_t));
  static cpp_hashnode *parse_identifier PARAMS ((cpp_reader *));
  static uchar *parse_slow PARAMS ((cpp_reader *, const uchar *, int,
! 				  unsigned int *, unsigned int *));
  static void parse_number PARAMS ((cpp_reader *, cpp_string *, int));
  static int unescaped_terminator_p PARAMS ((cpp_reader *, const uchar *));
  static void parse_string PARAMS ((cpp_reader *, cpp_token *, cppchar_t));
*************** static tokenrun *next_tokenrun PARAMS ((
*** 86,91 ****
--- 86,95 ----
  
  static unsigned int hex_digit_value PARAMS ((unsigned int));
  static _cpp_buff *new_buff PARAMS ((size_t));
+ static bool identifier_ucs_p PARAMS ((cpp_reader *, cppchar_t));
+ static void utf8_extend_token PARAMS ((struct obstack *, int));
+ static cppchar_t utf8_to_char PARAMS((const unsigned char **));
+ 
  
  /* Utility routine:
  
*************** trigraph_p (pfile)
*** 161,166 ****
--- 165,529 ----
    return accept;
  }
  
+ /* Returns nonzero if C is a universal-character-name.  Give an error if it
+    is not one which may appear in an identifier, as per [extendid].
+ 
+    Note that extended character support in identifiers has not yet been
+    implemented.  It is my personal opinion that this is not a desirable
+    feature.  Portable code cannot count on support for more than the basic
+    identifier character set.  */
+ 
+ static bool
+ identifier_ucs_p (pfile, c)
+      cpp_reader *pfile;
+      cppchar_t c;
+ {
+ #ifdef TARGET_EBCDIC
+   return 0;
+ #else
+   /* ASCII.  */
+   if (c < 0x7f)
+     return 0;
+ 
+   /* None of the valid chars are outside the Basic Multilingual Plane (the
+      low 16 bits).  */
+   if (c > 0xffff)
+     {
+       cpp_error_with_line (pfile, DL_ERROR,
+                            pfile->line, 1, /* XXX */
+                            "universal-character-name '\\U%08x' not valid in identifier", (int)c);
+       return 0;
+     }
+   
+   /* Latin */
+   if ((c >= 0x00c0 && c <= 0x00d6)
+       || (c >= 0x00d8 && c <= 0x00f6)
+       || (c >= 0x00f8 && c <= 0x01f5)
+       || (c >= 0x01fa && c <= 0x0217)
+       || (c >= 0x0250 && c <= 0x02a8)
+       || (c >= 0x1e00 && c <= 0x1e9a)
+       || (c >= 0x1ea0 && c <= 0x1ef9))
+     return 1;
+ 
+   /* Greek */
+   if ((c == 0x0384)
+       || (c >= 0x0388 && c <= 0x038a)
+       || (c == 0x038c)
+       || (c >= 0x038e && c <= 0x03a1)
+       || (c >= 0x03a3 && c <= 0x03ce)
+       || (c >= 0x03d0 && c <= 0x03d6)
+       || (c == 0x03da)
+       || (c == 0x03dc)
+       || (c == 0x03de)
+       || (c == 0x03e0)
+       || (c >= 0x03e2 && c <= 0x03f3)
+       || (c >= 0x1f00 && c <= 0x1f15)
+       || (c >= 0x1f18 && c <= 0x1f1d)
+       || (c >= 0x1f20 && c <= 0x1f45)
+       || (c >= 0x1f48 && c <= 0x1f4d)
+       || (c >= 0x1f50 && c <= 0x1f57)
+       || (c == 0x1f59)
+       || (c == 0x1f5b)
+       || (c == 0x1f5d)
+       || (c >= 0x1f5f && c <= 0x1f7d)
+       || (c >= 0x1f80 && c <= 0x1fb4)
+       || (c >= 0x1fb6 && c <= 0x1fbc)
+       || (c >= 0x1fc2 && c <= 0x1fc4)
+       || (c >= 0x1fc6 && c <= 0x1fcc)
+       || (c >= 0x1fd0 && c <= 0x1fd3)
+       || (c >= 0x1fd6 && c <= 0x1fdb)
+       || (c >= 0x1fe0 && c <= 0x1fec)
+       || (c >= 0x1ff2 && c <= 0x1ff4)
+       || (c >= 0x1ff6 && c <= 0x1ffc))
+     return 1;
+ 
+   /* Cyrillic */
+   if ((c >= 0x0401 && c <= 0x040d)
+       || (c >= 0x040f && c <= 0x044f)
+       || (c >= 0x0451 && c <= 0x045c)
+       || (c >= 0x045e && c <= 0x0481)
+       || (c >= 0x0490 && c <= 0x04c4)
+       || (c >= 0x04c7 && c <= 0x04c8)
+       || (c >= 0x04cb && c <= 0x04cc)
+       || (c >= 0x04d0 && c <= 0x04eb)
+       || (c >= 0x04ee && c <= 0x04f5)
+       || (c >= 0x04f8 && c <= 0x04f9))
+     return 1;
+ 
+   /* Armenian */
+   if ((c >= 0x0531 && c <= 0x0556)
+       || (c >= 0x0561 && c <= 0x0587))
+     return 1;
+ 
+   /* Hebrew */
+   if ((c >= 0x05d0 && c <= 0x05ea)
+       || (c >= 0x05f0 && c <= 0x05f4))
+     return 1;
+ 
+   /* Arabic */
+   if ((c >= 0x0621 && c <= 0x063a)
+       || (c >= 0x0640 && c <= 0x0652)
+       || (c >= 0x0670 && c <= 0x06b7)
+       || (c >= 0x06ba && c <= 0x06be)
+       || (c >= 0x06c0 && c <= 0x06ce)
+       || (c >= 0x06e5 && c <= 0x06e7))
+     return 1;
+ 
+   /* Devanagari */
+   if ((c >= 0x0905 && c <= 0x0939)
+       || (c >= 0x0958 && c <= 0x0962))
+     return 1;
+ 
+   /* Bengali */
+   if ((c >= 0x0985 && c <= 0x098c)
+       || (c >= 0x098f && c <= 0x0990)
+       || (c >= 0x0993 && c <= 0x09a8)
+       || (c >= 0x09aa && c <= 0x09b0)
+       || (c == 0x09b2)
+       || (c >= 0x09b6 && c <= 0x09b9)
+       || (c >= 0x09dc && c <= 0x09dd)
+       || (c >= 0x09df && c <= 0x09e1)
+       || (c >= 0x09f0 && c <= 0x09f1))
+     return 1;
+ 
+   /* Gurmukhi */
+   if ((c >= 0x0a05 && c <= 0x0a0a)
+       || (c >= 0x0a0f && c <= 0x0a10)
+       || (c >= 0x0a13 && c <= 0x0a28)
+       || (c >= 0x0a2a && c <= 0x0a30)
+       || (c >= 0x0a32 && c <= 0x0a33)
+       || (c >= 0x0a35 && c <= 0x0a36)
+       || (c >= 0x0a38 && c <= 0x0a39)
+       || (c >= 0x0a59 && c <= 0x0a5c)
+       || (c == 0x0a5e))
+     return 1;
+ 
+   /* Gujarati */
+   if ((c >= 0x0a85 && c <= 0x0a8b)
+       || (c == 0x0a8d)
+       || (c >= 0x0a8f && c <= 0x0a91)
+       || (c >= 0x0a93 && c <= 0x0aa8)
+       || (c >= 0x0aaa && c <= 0x0ab0)
+       || (c >= 0x0ab2 && c <= 0x0ab3)
+       || (c >= 0x0ab5 && c <= 0x0ab9)
+       || (c == 0x0ae0))
+     return 1;
+ 
+   /* Oriya */
+   if ((c >= 0x0b05 && c <= 0x0b0c)
+       || (c >= 0x0b0f && c <= 0x0b10)
+       || (c >= 0x0b13 && c <= 0x0b28)
+       || (c >= 0x0b2a && c <= 0x0b30)
+       || (c >= 0x0b32 && c <= 0x0b33)
+       || (c >= 0x0b36 && c <= 0x0b39)
+       || (c >= 0x0b5c && c <= 0x0b5d)
+       || (c >= 0x0b5f && c <= 0x0b61))
+     return 1;
+ 
+   /* Tamil */
+   if ((c >= 0x0b85 && c <= 0x0b8a)
+       || (c >= 0x0b8e && c <= 0x0b90)
+       || (c >= 0x0b92 && c <= 0x0b95)
+       || (c >= 0x0b99 && c <= 0x0b9a)
+       || (c == 0x0b9c)
+       || (c >= 0x0b9e && c <= 0x0b9f)
+       || (c >= 0x0ba3 && c <= 0x0ba4)
+       || (c >= 0x0ba8 && c <= 0x0baa)
+       || (c >= 0x0bae && c <= 0x0bb5)
+       || (c >= 0x0bb7 && c <= 0x0bb9))
+     return 1;
+ 
+   /* Telugu */
+   if ((c >= 0x0c05 && c <= 0x0c0c)
+       || (c >= 0x0c0e && c <= 0x0c10)
+       || (c >= 0x0c12 && c <= 0x0c28)
+       || (c >= 0x0c2a && c <= 0x0c33)
+       || (c >= 0x0c35 && c <= 0x0c39)
+       || (c >= 0x0c60 && c <= 0x0c61))
+     return 1;
+ 
+   /* Kannada */
+   if ((c >= 0x0c85 && c <= 0x0c8c)
+       || (c >= 0x0c8e && c <= 0x0c90)
+       || (c >= 0x0c92 && c <= 0x0ca8)
+       || (c >= 0x0caa && c <= 0x0cb3)
+       || (c >= 0x0cb5 && c <= 0x0cb9)
+       || (c >= 0x0ce0 && c <= 0x0ce1))
+     return 1;
+ 
+   /* Malayalam */
+   if ((c >= 0x0d05 && c <= 0x0d0c)
+       || (c >= 0x0d0e && c <= 0x0d10)
+       || (c >= 0x0d12 && c <= 0x0d28)
+       || (c >= 0x0d2a && c <= 0x0d39)
+       || (c >= 0x0d60 && c <= 0x0d61))
+     return 1;
+ 
+   /* Thai */
+   if ((c >= 0x0e01 && c <= 0x0e30)
+       || (c >= 0x0e32 && c <= 0x0e33)
+       || (c >= 0x0e40 && c <= 0x0e46)
+       || (c >= 0x0e4f && c <= 0x0e5b))
+     return 1;
+ 
+   /* Lao */
+   if ((c >= 0x0e81 && c <= 0x0e82)
+       || (c == 0x0e84)
+       || (c == 0x0e87)
+       || (c == 0x0e88)
+       || (c == 0x0e8a)
+       || (c == 0x0e0d)
+       || (c >= 0x0e94 && c <= 0x0e97)
+       || (c >= 0x0e99 && c <= 0x0e9f)
+       || (c >= 0x0ea1 && c <= 0x0ea3)
+       || (c == 0x0ea5)
+       || (c == 0x0ea7)
+       || (c == 0x0eaa)
+       || (c == 0x0eab)
+       || (c >= 0x0ead && c <= 0x0eb0)
+       || (c == 0x0eb2)
+       || (c == 0x0eb3)
+       || (c == 0x0ebd)
+       || (c >= 0x0ec0 && c <= 0x0ec4)
+       || (c == 0x0ec6))
+     return 1;
+ 
+   /* Georgian */
+   if ((c >= 0x10a0 && c <= 0x10c5)
+       || (c >= 0x10d0 && c <= 0x10f6))
+     return 1;
+ 
+   /* Hiragana */
+   if ((c >= 0x3041 && c <= 0x3094)
+       || (c >= 0x309b && c <= 0x309e))
+     return 1;
+ 
+   /* Katakana */
+   if ((c >= 0x30a1 && c <= 0x30fe))
+     return 1;
+ 
+   /* Bopmofo */
+   if ((c >= 0x3105 && c <= 0x312c))
+     return 1;
+ 
+   /* Hangul */
+   if ((c >= 0x1100 && c <= 0x1159)
+       || (c >= 0x1161 && c <= 0x11a2)
+       || (c >= 0x11a8 && c <= 0x11f9))
+     return 1;
+ 
+   /* CJK Unified Ideographs */
+   if ((c >= 0xf900 && c <= 0xfa2d)
+       || (c >= 0xfb1f && c <= 0xfb36)
+       || (c >= 0xfb38 && c <= 0xfb3c)
+       || (c == 0xfb3e)
+       || (c >= 0xfb40 && c <= 0xfb41)
+       || (c >= 0xfb42 && c <= 0xfb44)
+       || (c >= 0xfb46 && c <= 0xfbb1)
+       || (c >= 0xfbd3 && c <= 0xfd3f)
+       || (c >= 0xfd50 && c <= 0xfd8f)
+       || (c >= 0xfd92 && c <= 0xfdc7)
+       || (c >= 0xfdf0 && c <= 0xfdfb)
+       || (c >= 0xfe70 && c <= 0xfe72)
+       || (c == 0xfe74)
+       || (c >= 0xfe76 && c <= 0xfefc)
+       || (c >= 0xff21 && c <= 0xff3a)
+       || (c >= 0xff41 && c <= 0xff5a)
+       || (c >= 0xff66 && c <= 0xffbe)
+       || (c >= 0xffc2 && c <= 0xffc7)
+       || (c >= 0xffca && c <= 0xffcf)
+       || (c >= 0xffd2 && c <= 0xffd7)
+       || (c >= 0xffda && c <= 0xffdc)
+       || (c >= 0x4e00 && c <= 0x9fa5))
+     return 1;
+ 
+   cpp_error_with_line (pfile, DL_ERROR,
+                        pfile->line, 1, /* XXX */
+                        "universal-character-name '\\u%04x' not valid in identifier", c);
+   return 0;
+ #endif
+ }
+ 
+ /* Add the UTF-8 representation of C to the token_buffer.  */
+ 
+ static void
+ utf8_extend_token (stack, c)
+      struct obstack *stack;
+      int c;
+ {
+   int shift, mask;
+ 
+   if      (c <= 0x0000007f)
+     {
+       obstack_1grow (stack, c);
+       return;
+     }
+   else if (c <= 0x000007ff)
+     shift = 6, mask = 0xc0;
+   else if (c <= 0x0000ffff)
+     shift = 12, mask = 0xe0;
+   else if (c <= 0x001fffff)
+     shift = 18, mask = 0xf0;
+   else if (c <= 0x03ffffff)
+     shift = 24, mask = 0xf8;
+   else
+     shift = 30, mask = 0xfc;
+ 
+   obstack_1grow (stack, mask | (c >> shift));
+   do
+     {
+       shift -= 6;
+       obstack_1grow (stack, (unsigned char) (0x80 | ((c >> shift) & 0x3f)));
+     }
+   while (shift);
+ }
+ 
+ static cppchar_t
+ utf8_to_char (pos)
+      const unsigned char **pos;
+ {
+   cppchar_t result = 0;
+   const unsigned char *s = *pos;
+   if (*s < 128)
+     {
+       result = *s;
+       *pos += 1;
+     }
+   else if (*s < 0xc0)
+     {
+       /* Cannot occur as first byte */
+       abort();
+     }
+   else if (*s < 0xE0)
+     {
+       result = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
+       *pos += 2;
+     }
+   else if (*s < 0xF0)
+     {
+       result =
+         ((s[0] & 0xf) << 12) +
+         ((s[1] & 0x3f) << 6) +
+         (s[2] & 0x3f);
+       *pos += 3;
+     }
+   else if (*s < 0xF8)
+     {
+       result =
+         ((s[0] & 0x7) << 18) +
+         ((s[1] & 0x3f) << 12) +
+         ((s[2] & 0x3f) << 6) +
+         (s[3] & 0x3f);
+       *pos += 4;
+     }
+   else
+     {
+       /* Other codes are reserved. */
+       abort ();
+     }
+   return result;
+ }
+ 
  /* Skips any escaped newlines introduced by '?' or a '\\', assumed to
     lie in buffer->cur[-1].  Returns the next byte, which will be in
     buffer->cur[-1].  This routine performs preprocessing stages 1 and
*************** parse_identifier (pfile)
*** 451,461 ****
    /* Check for slow-path cases.  */
    if (*cur == '?' || *cur == '\\' || *cur == '$')
      {
!       unsigned int len;
  
!       base = parse_slow (pfile, cur, 0, &len);
        result = (cpp_hashnode *)
  	ht_lookup (pfile->hash_table, base, len, HT_ALLOCED);
      }
    else
      {
--- 814,826 ----
    /* Check for slow-path cases.  */
    if (*cur == '?' || *cur == '\\' || *cur == '$')
      {
!       unsigned int len, utf8;
  
!       base = parse_slow (pfile, cur, 0, &len, &utf8);
        result = (cpp_hashnode *)
  	ht_lookup (pfile->hash_table, base, len, HT_ALLOCED);
+       if (utf8)
+         result->flags |= NODE_UTF8;
      }
    else
      {
*************** parse_identifier (pfile)
*** 493,503 ****
     pointer to the token's NUL-terminated spelling in permanent
     storage, and sets PLEN to its length.  */
  static uchar *
! parse_slow (pfile, cur, number_p, plen)
       cpp_reader *pfile;
       const uchar *cur;
       int number_p;
       unsigned int *plen;
  {
    cpp_buffer *buffer = pfile->buffer;
    const uchar *base = buffer->cur - 1;
--- 858,869 ----
     pointer to the token's NUL-terminated spelling in permanent
     storage, and sets PLEN to its length.  */
  static uchar *
! parse_slow (pfile, cur, number_p, plen, utf8)
       cpp_reader *pfile;
       const uchar *cur;
       int number_p;
       unsigned int *plen;
+      unsigned int *utf8;
  {
    cpp_buffer *buffer = pfile->buffer;
    const uchar *base = buffer->cur - 1;
*************** parse_slow (pfile, cur, number_p, plen)
*** 516,523 ****
--- 882,906 ----
    prevc = cur[-1];
    c = *cur++;
    buffer->cur = cur;
+   *utf8 = 0;
    for (;;)
      {
+       if (c == '\\' && (*buffer->cur == 'u'
+                         || *buffer->cur == 'U'))
+         {
+           cur = buffer->cur - 1;
+           c = *buffer->cur++;
+           if (maybe_read_ucs (pfile, &buffer->cur, buffer->rlimit, &c) == 0
+               && identifier_ucs_p (pfile, c))
+             {
+               utf8_extend_token (stack, c);
+               c = *buffer->cur++;
+               *utf8 = 1;
+               continue;
+             }
+           buffer->cur = cur;
+           c = *buffer->cur++;
+         }
        /* Potential escaped newline?  */
        buffer->backup_to = buffer->cur - 1;
        if (c == '?' || c == '\\')
*************** parse_number (pfile, number, leading_per
*** 570,575 ****
--- 953,959 ----
       int leading_period;
  {
    const uchar *cur;
+   unsigned int ignored;
  
    /* Fast-path loop.  Skim over a normal number.
       N.B. ISIDNUM does not include $.  */
*************** parse_number (pfile, number, leading_per
*** 579,585 ****
  
    /* Check for slow-path cases.  */
    if (*cur == '?' || *cur == '\\' || *cur == '$')
!     number->text = parse_slow (pfile, cur, 1 + leading_period, &number->len);
    else
      {
        const uchar *base = pfile->buffer->cur - 1;
--- 963,970 ----
  
    /* Check for slow-path cases.  */
    if (*cur == '?' || *cur == '\\' || *cur == '$')
!     number->text = parse_slow (pfile, cur, 1 + leading_period,
!                                &number->len, &ignored);
    else
      {
        const uchar *base = pfile->buffer->cur - 1;
*************** _cpp_lex_direct (pfile)
*** 1025,1031 ****
        if (c == '?')
  	result->type = CPP_QUERY;
        else if (c == '\\')
! 	goto random_char;
        else
  	goto trigraph;
        break;
--- 1410,1434 ----
        if (c == '?')
  	result->type = CPP_QUERY;
        else if (c == '\\')
!         {
!           const unsigned char *pos = buffer->cur;
!           
!           c = *buffer->cur++;
!           if ((c == 'u' || c == 'U')
!               && maybe_read_ucs (pfile, &buffer->cur,
!                                  buffer->rlimit, &c) == 0
!               && identifier_ucs_p (pfile, c))
!             {
!               buffer->cur = pos;
!               goto start_ident;
!             }
!           else
!             {
!               c = '\\';
!               buffer->cur = pos;
!               goto random_char;
!             }
!         }
        else
  	goto trigraph;
        break;
*************** cpp_output_token (token, fp)
*** 1503,1509 ****
  
      spell_ident:
      case SPELL_IDENT:
!       fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
      break;
  
      case SPELL_NUMBER:
--- 1906,1937 ----
  
      spell_ident:
      case SPELL_IDENT:
!       if ((token->val.node->flags & NODE_UTF8) == 0)
!         fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
!       else
!         {
!           const unsigned char *s = NODE_NAME (token->val.node);
!           int len = NODE_LEN (token->val.node);
!           for (; len; len--)
!             {
!               if (*s < 128)
!                 {
!                   fwrite (s, 1, 1, fp);
!                   s++;
!                   len--;
!                 }
!               else
!                 {
!                   const unsigned char *old = s;
!                   cppchar_t code = utf8_to_char (&s);
!                   if (code < 0x10000)
!                     fprintf (fp, "\\u%.4x", code);
!                   else
!                     fprintf (fp, "\\U%.8x", code);
!                   len += s - old;
!                 }
!             }
!         }
      break;
  
      case SPELL_NUMBER:
Index: cpplib.h
===================================================================
RCS file: /cvsroot/gcc/gcc/gcc/cpplib.h,v
retrieving revision 1.237
diff -c -p -r1.237 cpplib.h
*** cpplib.h	26 Sep 2002 22:25:12 -0000	1.237
--- cpplib.h	27 Oct 2002 17:35:33 -0000
*************** extern const char *progname;
*** 443,448 ****
--- 443,449 ----
  #define NODE_DIAGNOSTIC (1 << 3)	/* Possible diagnostic when lexed.  */
  #define NODE_WARN	(1 << 4)	/* Warn if redefined or undefined.  */
  #define NODE_DISABLED	(1 << 5)	/* A disabled macro.  */
+ #define NODE_UTF8       (1 << 6)        /* Node has UTF-8 bytes in it */
  
  /* Different flavors of hash node.  */
  enum node_type

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-27 23:15 Implementing Universal Character Names in identifiers Martin v. Löwis
@ 2002-10-27 23:47 ` Fergus Henderson
  2002-10-28  0:11   ` Martin v. Löwis
  2002-10-27 23:51 ` Zack Weinberg
  2002-11-07  0:09 ` Neil Booth
  2 siblings, 1 reply; 30+ messages in thread
From: Fergus Henderson @ 2002-10-27 23:47 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

On 28-Oct-2002, Martin v. L?wis <loewis@informatik.hu-berlin.de> wrote:
> Index: cpplex.c
...
> +    Note that extended character support in identifiers has not yet been
> +    implemented.  It is my personal opinion that this is not a desirable
> +    feature.  Portable code cannot count on support for more than the basic
> +    identifier character set.  */

Your patch makes that comment out-of-date, doesn't it?

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-27 23:15 Implementing Universal Character Names in identifiers Martin v. Löwis
  2002-10-27 23:47 ` Fergus Henderson
@ 2002-10-27 23:51 ` Zack Weinberg
  2002-10-28  0:53   ` Martin v. Löwis
  2002-11-07  0:09 ` Neil Booth
  2 siblings, 1 reply; 30+ messages in thread
From: Zack Weinberg @ 2002-10-27 23:51 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: gcc-patches

On Mon, Oct 28, 2002 at 08:15:39AM +0100, Martin v. LÃ¶wis wrote:
> This patch implements UCNs in cpplib. It does so by converting the
> UCN to UTF-8, putting the UTF-8 bytes into the internal
> representation of the identifier.

This is the right general idea.  Thank you for the patch.

It would be worthwhile - as a separate patch, mind - to add support
for extended characters written in bare UTF-8 in identifiers.  My plan
for general extended-character-encoding support is to convert to UTF-8
and process that representation; that plus iconv plus some glue and
heuristics will get us most of the way there.

You want to look closely at what is currently done for UCNs in wide
character constants and string literals.  I'm pretty sure it's wrong,
and I would appreciate suggestions.

> The back-ends will transparently output the UTF-8 identifiers into the
> assembler file. If GNU as is used (or any other assembler supporting
> non-ASCII identifiers), these UTF-8 strings will be copied transparently
> into the object file. If the assembler does not support UTF-8, it
> will produce a diagnostic.

I thought we had some sort of encoding schema for assemblers that
don't support UTF-8?  How does this interact with the C++ ABI? 

We should normalize identifiers before entering them in the symbol
table, and for output; otherwise there will be great confusion.  That
needs to happen as part of the initial patch.

...
> + /* Returns nonzero if C is a universal-character-name.  Give an error if it
> +    is not one which may appear in an identifier, as per [extendid].
> + 
> +    Note that extended character support in identifiers has not yet been
> +    implemented.  It is my personal opinion that this is not a desirable
> +    feature.  Portable code cannot count on support for more than the basic
> +    identifier character set.  */

(1) This routine belongs in libiberty, as part of the safe-ctype.h interface.
(2) Isn't this comment now inaccurate?  You just did implement
    extended characters in identifiers.
(3) The ranges need to be updated from the latest Unicode standard,
    and the standard version noted in commentary.

Due to the size of this routine, and the concerns with the rest of
your change, please submit a patch that does just that, all by itself;
that will get in easily, and then we can iterate on the rest of it.

> +   else if (*s < 0xc0)
> +     {
> +       /* Cannot occur as first byte */
> +       abort();
> +     }

Don't use abort in cpplib; use cpp_error (pfile, DL_ICE, ...).
Further, this can happen as a result of ill-formed user input, can't
it?  Therefore this should be a plain error, not an ICE.

>     /* Check for slow-path cases.  */
>     if (*cur == '?' || *cur == '\\' || *cur == '$')
> !     number->text = parse_slow (pfile, cur, 1 + leading_period,
> !                                &number->len, &ignored);

I don't think the UTF8 flag should be ignored at this point.  Consider
what happens if we get

  asdf ## 12\u03F8

-- that is valid, and needs to turn into a single CPP_NAME token with
the UTF8 flag set.  It seems safe to me to carry around the UTF8 bit
on all CPP_NUMBER tokens.  Naturally, cpp_classify_number should
categorize such numbers as CPP_N_INVALID (allowing digits outside the
basic source character set strikes me as a bad idea).

>       spell_ident:
>       case SPELL_IDENT:
> !       if ((token->val.node->flags & NODE_UTF8) == 0)
> !         fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
> !       else
> !         {
> !           const unsigned char *s = NODE_NAME (token->val.node);
> !           int len = NODE_LEN (token->val.node);
> !           for (; len; len--)
> !             {
> !               if (*s < 128)
> !                 {
> !                   fwrite (s, 1, 1, fp);
> !                   s++;
> !                   len--;
> !                 }
> !               else
> !                 {
> !                   const unsigned char *old = s;
> !                   cppchar_t code = utf8_to_char (&s);
> !                   if (code < 0x10000)
> !                     fprintf (fp, "\\u%.4x", code);
> !                   else
> !                     fprintf (fp, "\\U%.8x", code);
> !                   len += s - old;
> !                 }
> !             }
> !         }

Please find a more efficient way to accomplish this.  This code is
already *the* bottleneck for textual preprocessing.  (For instance, if
you implement support for raw UTF8 as input encoding, we can just
splat out the identifier as is.)

zw

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-27 23:47 ` Fergus Henderson
@ 2002-10-28  0:11   ` Martin v. Löwis
  0 siblings, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-10-28  0:11 UTC (permalink / raw)
  To: Fergus Henderson; +Cc: gcc-patches

Fergus Henderson <fjh@cs.mu.oz.au> writes:

> > +    Note that extended character support in identifiers has not yet been
> > +    implemented.  It is my personal opinion that this is not a desirable
> > +    feature.  Portable code cannot count on support for more than the basic
> > +    identifier character set.  */
> 
> Your patch makes that comment out-of-date, doesn't it?

Oops, right, this is *not* my personal opinion, but of whomever I
copied this from :-)

Thanks for noticing,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-27 23:51 ` Zack Weinberg
@ 2002-10-28  0:53   ` Martin v. Löwis
  2002-10-28  1:30     ` Fergus Henderson
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-10-28  0:53 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches

Zack Weinberg <zack@codesourcery.com> writes:

> It would be worthwhile - as a separate patch, mind - to add support
> for extended characters written in bare UTF-8 in identifiers.  

I completely agree.

> My plan for general extended-character-encoding support is to
> convert to UTF-8 and process that representation; that plus iconv
> plus some glue and heuristics will get us most of the way there.

Notice that this might be difficult to incorporate into the
parser. Parsing extended characters will require maintenance of a
shift state (mbstate_t); iconv does not directly expose the
mbstate. So you have to carefully keep the mbstate_t and the iconv_t
synchronized.

Alternatively, you could even use iconv to split the input into
individual characters, and then perform parsing on the iconv result
(conversion to UTF-8 might be appropriate); but that would be a
significant change.

> You want to look closely at what is currently done for UCNs in wide
> character constants and string literals.  I'm pretty sure it's wrong,
> and I would appreciate suggestions.

As for the preprocessor, it looks quite right to me; also, the output
is right, assuming gcc implies ISO 10646 for wchar_t on all platforms
(which is a sensible choice, and correct for GNU systems).

> I thought we had some sort of encoding schema for assemblers that
> don't support UTF-8?  How does this interact with the C++ ABI? 

The C++ ABI left this open; the current recommendation (which is not
normative) is to use UTF-8 unless something else is specified by the
vendor. Encoding schemes don't really work for C, and add complexity
for C++.

> We should normalize identifiers before entering them in the symbol
> table, and for output; otherwise there will be great confusion.
> That needs to happen as part of the initial patch.

I have now the opinion that encoding schemes (other than UTF-8) should
not be used. Compatibility with Java might be an issue; it might be
necessary to special-case extern "Java" identifiers in the C++
front-end. I could add that to the patch - although I would prefer if
the Java API would change.

As for assemblers that don't allow UTF-8 in source code: I'd rather
disable the feature for those assemblers than trying to find a
solution - this allows for compatibility should the vendor decide on
this matter later.

The tricky part is how to determine whether UTF-8 is supported in
assembler output: initially, I'd just assume that GNU as supports it,
and no other assembler does; this can then be extended as support on
other systems becomes possible.

The next question is where to block unacceptable identifiers: in
cpplib, or later? If in cpplib, or later? Later might be better since,
atleast for C++, supporting this in Java identifiers might be
desirable, plus you could use it in macro names even if the assembler
does not support it.

If UTF-8 identifiers must be rejected (or converted) in the language
front-ends, how can I efficiently determine whether an identifier uses
UTF-8? Can I use deprecated_flag on IDENTIFIERs for that?

Assuming  this  is  all  agreeable,  I'll  try  to  revise  the  patch
appropriately.

> (1) This routine belongs in libiberty, as part of the safe-ctype.h
> interface.

Really? The list of characters is quite specific to the language (and
perhaps even the language revision). I haven't even checked whether
the lists of acceptable characters are the same in C++98 and C99.

> (2) Isn't this comment now inaccurate?  You just did implement
>     extended characters in identifiers.

Yes, right :-(

> (3) The ranges need to be updated from the latest Unicode standard,
>     and the standard version noted in commentary.

No. They are mandated by the language specification. For C++, see
Annex E. For C99, see Annex D (unfortunately, I can't, since I don't
have the final copy of C99). C++ claims to have copied the table from
PDTR 10176, C from TR 10176.

*If* my C99 draft is accurate, then there are differences between
 these two tables: e.g. in C99, U+00AA (FEMININE ORDINAL INDICATOR)
is acceptable in an identifier; in C++98, it is not.

> Due to the size of this routine, and the concerns with the rest of
> your change, please submit a patch that does just that, all by itself;
> that will get in easily, and then we can iterate on the rest of it.

I will do that, when the issue of per-language tables has been
settled.

> Don't use abort in cpplib; use cpp_error (pfile, DL_ICE, ...).
> Further, this can happen as a result of ill-formed user input, can't
> it?  Therefore this should be a plain error, not an ICE.

Right, will fix.

> >     /* Check for slow-path cases.  */
> >     if (*cur == '?' || *cur == '\\' || *cur == '$')
> > !     number->text = parse_slow (pfile, cur, 1 + leading_period,
> > !                                &number->len, &ignored);
> 
> I don't think the UTF8 flag should be ignored at this point.  Consider
> what happens if we get
> 
>   asdf ## 12\u03F8
> 
> -- that is valid, and needs to turn into a single CPP_NAME token with
> the UTF8 flag set.  It seems safe to me to carry around the UTF8 bit
> on all CPP_NUMBER tokens.  

Ah, right. I missed that nondigit includes universal-character-names.

> Naturally, cpp_classify_number should categorize such numbers as
> CPP_N_INVALID (allowing digits outside the basic source character
> set strikes me as a bad idea).

Please educate me: is this taking the target language into account? If
not, there is nothing wrong with that token, as a pp-token.

> Please find a more efficient way to accomplish this.  This code is
> already *the* bottleneck for textual preprocessing.  (For instance, if
> you implement support for raw UTF8 as input encoding, we can just
> splat out the identifier as is.)

Is that necessary? Few tokens will ever have the flag set, and the
only part where I added overhead is the test for the flag.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28  0:53   ` Martin v. Löwis
@ 2002-10-28  1:30     ` Fergus Henderson
  2002-10-28  2:26     ` Joseph S. Myers
  2002-10-28 10:39     ` Zack Weinberg
  2 siblings, 0 replies; 30+ messages in thread
From: Fergus Henderson @ 2002-10-28  1:30 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Zack Weinberg, gcc-patches

On 28-Oct-2002, Martin v. LÃ¶wis <loewis@informatik.hu-berlin.de> wrote:
> > (3) The ranges need to be updated from the latest Unicode standard,
> >     and the standard version noted in commentary.
> 
> No. They are mandated by the language specification. For C++, see
> Annex E. For C99, see Annex D (unfortunately, I can't, since I don't
> have the final copy of C99). C++ claims to have copied the table from
> PDTR 10176, C from TR 10176.
> 
> *If* my C99 draft is accurate, then there are differences between
>  these two tables: e.g. in C99, U+00AA (FEMININE ORDINAL INDICATOR)
> is acceptable in an identifier; in C++98, it is not.

Your C99 draft is accurate in this respect.
I have the final C99 standard and it includes U+00AA.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28  0:53   ` Martin v. Löwis
  2002-10-28  1:30     ` Fergus Henderson
@ 2002-10-28  2:26     ` Joseph S. Myers
  2002-10-28  3:29       ` Martin v. Löwis
  2002-10-28 10:39     ` Zack Weinberg
  2 siblings, 1 reply; 30+ messages in thread
From: Joseph S. Myers @ 2002-10-28  2:26 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Zack Weinberg, gcc-patches

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 446 bytes --]

On 28 Oct 2002, Martin v. [iso-8859-1] Löwis wrote:

> Really? The list of characters is quite specific to the language (and
> perhaps even the language revision). I haven't even checked whether
> the lists of acceptable characters are the same in C++98 and C99.

They are definitely different.  For C++, see issues 131 (for a typo in the
list) and 248 (for the list being based on an old draft).

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28  2:26     ` Joseph S. Myers
@ 2002-10-28  3:29       ` Martin v. Löwis
  0 siblings, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-10-28  3:29 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Zack Weinberg, gcc-patches

"Joseph S. Myers" <jsm28@cam.ac.uk> writes:

> They are definitely different.  For C++, see issues 131 (for a typo in the
> list) and 248 (for the list being based on an old draft).

Can you explain the comment in issue 248 "extensible, as is the case
in C99"? I have the impression that the list is fixed in C99, as well...

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28  0:53   ` Martin v. Löwis
  2002-10-28  1:30     ` Fergus Henderson
  2002-10-28  2:26     ` Joseph S. Myers
@ 2002-10-28 10:39     ` Zack Weinberg
  2002-10-28 10:53       ` Joseph S. Myers
                         ` (4 more replies)
  2 siblings, 5 replies; 30+ messages in thread
From: Zack Weinberg @ 2002-10-28 10:39 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: gcc-patches, java

On Mon, Oct 28, 2002 at 09:53:35AM +0100, Martin v. LÃ¶wis wrote:
> > My plan for general extended-character-encoding support is to
> > convert to UTF-8 and process that representation; that plus iconv
> > plus some glue and heuristics will get us most of the way there.
> 
> Notice that this might be difficult to incorporate into the
> parser. Parsing extended characters will require maintenance of a
> shift state (mbstate_t); iconv does not directly expose the
> mbstate. So you have to carefully keep the mbstate_t and the iconv_t
> synchronized.
> 
> Alternatively, you could even use iconv to split the input into
> individual characters, and then perform parsing on the iconv result
> (conversion to UTF-8 might be appropriate); but that would be a
> significant change.

The plan is to implement this /as if/ the entire file is run through
iconv(3) conversion to UTF-8 before the parser ever sees it.  (The
actual implementation may do it on the fly when the parser encounters
nonwhitespace characters outside the 0x20-0x7f range.)  I won't be
using any of the <wchar.h> interfaces.

http://gcc.gnu.org/projects/cpplib.html#charset contains some
discussion of the plan - comments would be appreciated.

> > You want to look closely at what is currently done for UCNs in wide
> > character constants and string literals.  I'm pretty sure it's wrong,
> > and I would appreciate suggestions.
> 
> As for the preprocessor, it looks quite right to me; also, the output
> is right, assuming gcc implies ISO 10646 for wchar_t on all platforms
> (which is a sensible choice, and correct for GNU systems).

Glad to hear.  ISO10646 for wchar_t is not universally correct, but we
can do no better at present.

> > We should normalize identifiers before entering them in the symbol
> > table, and for output; otherwise there will be great confusion.
> > That needs to happen as part of the initial patch.

What you wrote in response to this is interesting but doesn't address
the issue of Unicode normalization of identifiers.  It sounds more
like an extended discussion of the previous point.  I'm talking about
the process described in UAX 15 (http://www.unicode.org/unicode/reports/tr15/)
and in particular annex 7 of that document ("Programming Language
Identifiers").

> The C++ ABI left this open; the current recommendation (which is not
> normative) is to use UTF-8 unless something else is specified by the
> vendor. Encoding schemes don't really work for C, and add complexity
> for C++.
> 
> I have now the opinion that encoding schemes (other than UTF-8) should
> not be used. Compatibility with Java might be an issue; it might be
> necessary to special-case extern "Java" identifiers in the C++
> front-end. I could add that to the patch - although I would prefer if
> the Java API would change.
> 
> As for assemblers that don't allow UTF-8 in source code: I'd rather
> disable the feature for those assemblers than trying to find a
> solution - this allows for compatibility should the vendor decide on
> this matter later.
>
> The tricky part is how to determine whether UTF-8 is supported in
> assembler output: initially, I'd just assume that GNU as supports it,
> and no other assembler does; this can then be extended as support on
> other systems becomes possible.

This all seems entirely reasonable, but please do communicate with the
Java folks about their requirements.  I've added java@ to the cc list.

> The next question is where to block unacceptable identifiers: in
> cpplib, or later? If in cpplib, or later? Later might be better since,
> atleast for C++, supporting this in Java identifiers might be
> desirable, plus you could use it in macro names even if the assembler
> does not support it.

At the language level, yes, and perhaps only for identifiers that will
map to assembly symbols.  I would suggest copying your NODE_UTF8 bit
into a flag on IDENTIFIER_NODEs and then checking that in make_decl_rtl.

> If UTF-8 identifiers must be rejected (or converted) in the language
> front-ends, how can I efficiently determine whether an identifier uses
> UTF-8? Can I use deprecated_flag on IDENTIFIERs for that?

If there is no other use for that bit in IDENTIFIER_NODEs, then yes
(make sure to document it appropriately).

I think it would be more appropriate to call the bit
"USES_EXTENDED_CHARACTERS" rather than "UTF8" as technically 7-bit
ASCII is UTF8 too.  It will be referenced rarely enough that a long,
meaningful name is best.

> > (1) This routine belongs in libiberty, as part of the safe-ctype.h
> > interface.
> 
> Really? The list of characters is quite specific to the language (and
> perhaps even the language revision). I haven't even checked whether
> the lists of acceptable characters are the same in C++98 and C99.
> 
> > (3) The ranges need to be updated from the latest Unicode standard,
> >     and the standard version noted in commentary.
> 
> No. They are mandated by the language specification. For C++, see
> Annex E. For C99, see Annex D (unfortunately, I can't, since I don't
> have the final copy of C99). C++ claims to have copied the table from
> PDTR 10176, C from TR 10176.
> 
> *If* my C99 draft is accurate, then there are differences between
>  these two tables: e.g. in C99, U+00AA (FEMININE ORDINAL INDICATOR)
> is acceptable in an identifier; in C++98, it is not.

Ugh.  IMO, this is a defect in both standards - they should simply
reference UAX15a7 and be done with it.  It's been around since 1998,
so they don't really have an excuse for not using it.

I suggest:

 - In libiberty, provide interfaces that implement UAX15.  On
   reflection, this should be a new <unicode.h> interface set, not
   tacked onto <safe-ctype.h>.

 - In cpplib, provide routines that validate individual identifiers
   against the precise lists in C99 and C++98.

 - GCC enforces the precise lists in C99 and C++98 only in -pedantic
   mode.

 - We file a couple of Defect Reports.

> > Naturally, cpp_classify_number should categorize such numbers as
> > CPP_N_INVALID (allowing digits outside the basic source character
> > set strikes me as a bad idea).
> 
> Please educate me: is this taking the target language into account? If
> not, there is nothing wrong with that token, as a pp-token.

cpp_classify_number is used in the conversion from pp-tokens to
tokens.  While extended characters are valid nondigits and therefore
valid in pp-number tokens, they are not valid in phase 7's
integer-constants and floating-constants (see C99 6.4.4).

Incidentally, references to standard sections in commentary should be
of the form "C++98 Annex E [extendid]" not just "[extendid]".  If
you're referring to a specific paragraph, write e.g. "C99 6.4.2.1p1"
as "6.4.2.1.1" is ambiguous.  (Alas, C99 doesn't have section name tags.)

> > Please find a more efficient way to accomplish this.  This code is
> > already *the* bottleneck for textual preprocessing.  (For instance, if
> > you implement support for raw UTF8 as input encoding, we can just
> > splat out the identifier as is.)
> 
> Is that necessary? Few tokens will ever have the flag set, and the
> only part where I added overhead is the test for the flag.

I am not sure about "few" some years down the road, when people start
_using_ the ability to write identifiers in their own languages.  In
any case, using "fwrite(ptr, 1, 1, file)" is just silly when
"putc(*ptr, file)" will do.

zw

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28 10:39     ` Zack Weinberg
@ 2002-10-28 10:53       ` Joseph S. Myers
  2002-10-29  1:39       ` Martin v. Löwis
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Joseph S. Myers @ 2002-10-28 10:53 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Martin v. Löwis, gcc-patches, java

On Mon, 28 Oct 2002, Zack Weinberg wrote:

> What you wrote in response to this is interesting but doesn't address
> the issue of Unicode normalization of identifiers.  It sounds more
> like an extended discussion of the previous point.  I'm talking about
> the process described in UAX 15 (http://www.unicode.org/unicode/reports/tr15/)
> and in particular annex 7 of that document ("Programming Language
> Identifiers").

I don't think there's anything in the language standards to permit
normalization to NFC as described there.  (It could be done in "phase 0"  
for UTF-8 in the input file, like we ignore whitespace at end of line, but
not for UCNs.  And do we really want to build in the large character
tables required for normalization?)

>  - In cpplib, provide routines that validate individual identifiers
>    against the precise lists in C99 and C++98.
> 
>  - GCC enforces the precise lists in C99 and C++98 only in -pedantic
>    mode.

There's still the typo in the C++98 list that's a recognised Defect that
should be corrected (following existing practice of implementing
resolutions to Defect Reports before they make it into a TC).  But 
non-pedantic should use the current Unicode ranges of identifier 
characters for both languages.

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28 10:39     ` Zack Weinberg
  2002-10-28 10:53       ` Joseph S. Myers
@ 2002-10-29  1:39       ` Martin v. Löwis
  2002-10-29 12:04       ` Joseph S. Myers
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-10-29  1:39 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches, java

Zack Weinberg <zack@codesourcery.com> writes:

> http://gcc.gnu.org/projects/cpplib.html#charset contains some
> discussion of the plan - comments would be appreciated.

Sounds all good. For mangling, I'd take a step back, though:
if the assembler does not support UTF-8, you lose.

As for Java: gcj mangles, say, \u0388 as __U388_.

> What you wrote in response to this is interesting but doesn't
> address the issue of Unicode normalization of identifiers.  It
> sounds more like an extended discussion of the previous point.  I'm
> talking about the process described in UAX 15
> (http://www.unicode.org/unicode/reports/tr15/) and in particular
> annex 7 of that document ("Programming Language Identifiers").

I see. For the characters allowed in C99, normalization is almost a
non-issue, since nearly every identifier will already be in NFC.

One exception are 20 characters which have a different canonical
equivalent (e.g. U+1f71 or U+2126); that might be a defect in ISO TR
10176 (they should not be allowed in identifiers - even though TR#15
allows them as well).

The other exception are 80 characters which have a non-zero combining
class; those would might need to be re-ordered under a normal form.

I agree with Joseph that we are not entitled to perform normalization
of the identifiers - there is nothing in the standard that says that
\u1f71 is the same identifier as \u03ac.

So *if* normalization matters, I think we should require the input to
be already normalized, and refuse compilation if it isn't. This could
be accomplished with a small database: we need to ban additional
characters, and we need to record the combining class for those 80
combining characters.

> Ugh.  IMO, this is a defect in both standards - they should simply
> reference UAX15a7 and be done with it.  It's been around since 1998,
> so they don't really have an excuse for not using it.

I completely disagree. The real problem with normalization (until
Unicode 3.1, 2001-05-16) is that it depends on the version of the
Unicode database: different normalization algorithms might normalize
the same string into different code point sequences. This is really
bad, and has been fixed, by setting Unicode 3.1 as the composition
version.

Even with that change, C++ would *still* have needed to pick a
particular version of the Unicode database, or else programs using
extended identifiers would not be portable across conforming
implementations.

So I think the approach of ISO TR 10176 is really sensible: it is
*better* than UAX15a7, as it gives a precise guideline, instead of a
wishy-washy one.

It is unfortunate that C++ would use a draft of that, but there is
really nobody to blame for that, I guess.

>  - GCC enforces the precise lists in C99 and C++98 only in -pedantic
>    mode.

I think GCC should be really careful when extending the language, and
there should be really good reasons for doing so. In the specific
case, I would require a real user with a real need before accepting
more characters than mandated by the language spec.

Merely saying that "C++ does not allow the FEMININE ORDINAL INDICATOR"
is not good enough. The user would need to say: I want the identifier
Internet\u00aa - which won't happen, because (to my understanding),
the FEMININE ORDINAL INDICATOR only applies to numbers, and it could
not be used there, anyway. I believe the same holds for many of the
other characters that would be allowed under UAX15a7 but are currently
banned - nobody would use them in identifiers even if they could.

So I very much doubt that the need for more characters than allowed in
C++98 shows up in the near time. When it does, the standards might
have been revised, so we just update the implementation. If some users
do have a real need, we can still reconsider, and maybe incorporate
the Unicode 4.5 database (in 2006).

> I am not sure about "few" some years down the road, when people start
> _using_ the ability to write identifiers in their own languages.  In
> any case, using "fwrite(ptr, 1, 1, file)" is just silly when
> "putc(*ptr, file)" will do.

Oops, yes. On the general remark: I'm not sure how to significantly
improve performance. I think cpplib should output \u00c0 if the input
was \u00c0 - and perhaps even if the input was in a native character
set. So direct copying to the output will not be appropriate.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28 10:39     ` Zack Weinberg
  2002-10-28 10:53       ` Joseph S. Myers
  2002-10-29  1:39       ` Martin v. Löwis
@ 2002-10-29 12:04       ` Joseph S. Myers
  2002-10-31 11:08       ` Tom Tromey
  2002-11-10 10:39       ` Neil Booth
  4 siblings, 0 replies; 30+ messages in thread
From: Joseph S. Myers @ 2002-10-29 12:04 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Martin v. Löwis, gcc-patches, java

On Mon, 28 Oct 2002, Zack Weinberg wrote:

> Ugh.  IMO, this is a defect in both standards - they should simply
> reference UAX15a7 and be done with it.  It's been around since 1998,
> so they don't really have an excuse for not using it.

The normative references are to ISO 10646, not Unicode.  I think that the
normalization forms are part of Unicode only, not ISO 10646, and the ISO
rules may well not permit a normative reference to Unicode.

(And it's still necessary to reference a particular version, and even
normative references to particular versions of ISO standards seem to have
problems.  We were assured at a UK C Panel meeting (by someone at the
level in the UK corresponding to SC22, I think) that when C99 was approved
and C90 became superseded and withdrawn, the normative references in
ISO/IEC 14882 (C++98) to C90 automagically became ones to the superseding
standard, even though such a change is manifestly broken and not intended
by the C++ standard.)

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28 10:39     ` Zack Weinberg
                         ` (2 preceding siblings ...)
  2002-10-29 12:04       ` Joseph S. Myers
@ 2002-10-31 11:08       ` Tom Tromey
  2002-11-01  1:41         ` Martin v. Löwis
  2002-11-10 10:39       ` Neil Booth
  4 siblings, 1 reply; 30+ messages in thread
From: Tom Tromey @ 2002-10-31 11:08 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Martin v.Löwis, gcc-patches, java

>>>>> "Zack" == Zack Weinberg <zack@codesourcery.com> writes:

Zack> This all seems entirely reasonable, but please do communicate
Zack> with the Java folks about their requirements.  I've added java@
Zack> to the cc list.

For Java there are two issues that I know of.

One is how non-ascii characters are mangled in symbol names.  We have
something in place now, but I don't think we have a strong requirement
for a particular approach.  If something else is preferred for C++, I
imagine we could change gcj for compatibility.  Note that we don't yet
make ABI stability promises about gcj's output.

The second issue is that of representing Java method and variable
names in C++.  We generate C++ header files from Java .class files,
and the user can make Java method calls, etc, from C++.  So if a Java
method or field has a name containing a non-ascii character, we want
to be able to represent that compatibly in a C++ header.

I assume this is solved by emitting \u escapes in the .h file.  I
haven't really looked into it.  (We haven't ever seen a bug report for
this, so it has had a very low priority.)

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-31 11:08       ` Tom Tromey
@ 2002-11-01  1:41         ` Martin v. Löwis
  2002-11-01 11:17           ` Tom Tromey
  0 siblings, 1 reply; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-01  1:41 UTC (permalink / raw)
  To: tromey; +Cc: Zack Weinberg, gcc-patches, java

Tom Tromey <tromey@redhat.com> writes:

> One is how non-ascii characters are mangled in symbol names.  We have
> something in place now, but I don't think we have a strong requirement
> for a particular approach.  If something else is preferred for C++, I
> imagine we could change gcj for compatibility.  Note that we don't yet
> make ABI stability promises about gcj's output.

That sounds good. Are you currently making use of non-ASCII
identifiers anywhere? If not, would it be acceptable to not provide
them on platforms that lack assembler capabilities?

My plan would be to use UTF-8 in object files everywhere *unless* the
system vendor defines a different mapping (for C99). To my knowledge,
this escape clause would apply on no system at the moment.

[generating C++ header files]
> I assume this is solved by emitting \u escapes in the .h file.

Correct, that should work fine.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01  1:41         ` Martin v. Löwis
@ 2002-11-01 11:17           ` Tom Tromey
  2002-11-01 11:57             ` Martin v. Löwis
  0 siblings, 1 reply; 30+ messages in thread
From: Tom Tromey @ 2002-11-01 11:17 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Zack Weinberg, gcc-patches, java

>>>>> "Martin" == Martin v LÃ¶wis <loewis@informatik.hu-berlin.de> writes:

Martin> That sounds good. Are you currently making use of non-ASCII
Martin> identifiers anywhere? If not, would it be acceptable to not
Martin> provide them on platforms that lack assembler capabilities?

We don't use non-ascii identifiers in libgcj.

However, the ability to use these is part of the Java language
specification.  So we have a strong preference for supporting them on
all platforms.  We already do that by mangling the identifiers when we
see a non-ascii character.

We're only concerned with compatibility with g++ here.  It doesn't
matter to us whether C does or does not mangle symbols on a given
platform.

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01 11:17           ` Tom Tromey
@ 2002-11-01 11:57             ` Martin v. Löwis
  2002-11-01 14:56               ` Tom Tromey
  0 siblings, 1 reply; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-01 11:57 UTC (permalink / raw)
  To: tromey; +Cc: Zack Weinberg, gcc-patches, java

Tom Tromey <tromey@redhat.com> writes:

> However, the ability to use these is part of the Java language
> specification.  So we have a strong preference for supporting them on
> all platforms.  We already do that by mangling the identifiers when we
> see a non-ascii character.

There are many things in the Java language specification that gcj does
not do, or does not do equally well on all systems. Since you can get
GNU binutils for all systems, you can fulfill this specific
requirement everywhere if needed.

> We're only concerned with compatibility with g++ here.  It doesn't
> matter to us whether C does or does not mangle symbols on a given
> platform.

OTOH, I would not like to see two different approaches for g++
depending on the platform, and compatibility with C *is* important for
g++ - so I'd let g++ be guided rather by the C requirements than by
the Java requirements.

Actually, the requirement is the same for C, C++ and Java: UCNs are
part of the language specification. I can't derive a preference to
have that somehow work on all systems - if you must simultaneously
impose additional restrictions (such as reserving more identifiers),
or risk forward compatibility (if the system vendor decides to solve
this issue differently).

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01 11:57             ` Martin v. Löwis
@ 2002-11-01 14:56               ` Tom Tromey
  2002-11-01 14:59                 ` Andrew Pinski
  2002-11-03  6:05                 ` Martin v. Löwis
  0 siblings, 2 replies; 30+ messages in thread
From: Tom Tromey @ 2002-11-01 14:56 UTC (permalink / raw)
  To: Martin v. Löwis; +Cc: Zack Weinberg, gcc-patches, java

>>>>> "Martin" == Martin v LÃ¶wis <loewis@informatik.hu-berlin.de> writes:

>> However, the ability to use these is part of the Java language
>> specification.  So we have a strong preference for supporting them on
>> all platforms.  We already do that by mangling the identifiers when we
>> see a non-ascii character.

Martin> There are many things in the Java language specification that
Martin> gcj does not do, or does not do equally well on all systems.

I don't understand this point.  It is true that we don't implement
everything perfectly.  But that doesn't imply that we're willing to
implement fewer things well.

Martin> Since you can get GNU binutils for all systems

Is that really true?

Martin> OTOH, I would not like to see two different approaches for g++
Martin> depending on the platform, and compatibility with C *is*
Martin> important for g++ - so I'd let g++ be guided rather by the C
Martin> requirements than by the Java requirements.

This seems reasonable.  I think we're unlikely to see a bug report if
this change is made.  However, if one comes it, it would clearly be a
regression.

Tom

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01 14:56               ` Tom Tromey
@ 2002-11-01 14:59                 ` Andrew Pinski
  2002-11-03  6:08                   ` Martin v. Löwis
  2002-11-03  6:05                 ` Martin v. Löwis
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Pinski @ 2002-11-01 14:59 UTC (permalink / raw)
  To: tromey; +Cc: Martin v. Löwis, Zack Weinberg, gcc-patches, java


On Friday, Nov 1, 2002, at 14:49 US/Pacific, Tom Tromey wrote:

>>>>>> "Martin" == Martin v Löwis <loewis@informatik.hu-berlin.de> 
>>>>>> writes:
>
>>> However, the ability to use these is part of the Java language
>>> specification.  So we have a strong preference for supporting them on
>>> all platforms.  We already do that by mangling the identifiers when 
>>> we
>>> see a non-ascii character.
>
> Martin> There are many things in the Java language specification that
> Martin> gcj does not do, or does not do equally well on all systems.
>
> I don't understand this point.  It is true that we don't implement
> everything perfectly.  But that doesn't imply that we're willing to
> implement fewer things well.
>
> Martin> Since you can get GNU binutils for all systems
>
> Is that really true?

No it is not true because you cannot get a recent version of GNU 
binutils for Darwin,
You can get a heavily modified old version for Darwin though.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01 14:56               ` Tom Tromey
  2002-11-01 14:59                 ` Andrew Pinski
@ 2002-11-03  6:05                 ` Martin v. Löwis
  1 sibling, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-03  6:05 UTC (permalink / raw)
  To: tromey; +Cc: Zack Weinberg, gcc-patches, java

Tom Tromey <tromey@redhat.com> writes:

> I don't understand this point.  It is true that we don't implement
> everything perfectly.  But that doesn't imply that we're willing to
> implement fewer things well.

Ok. I'll revise my patch so that gcj uses C++ compatibility on systems
with sufficient assembler support, and the current Java mapping on
other systems. Should some of these other systems gain the necessary
assembler support at some time, gcj would break its ABI (for UCNs).

> Martin> Since you can get GNU binutils for all systems
> 
> Is that really true?

No, but I think it is true for the majority of platforms on which
people would want to use the feature.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-01 14:59                 ` Andrew Pinski
@ 2002-11-03  6:08                   ` Martin v. Löwis
  0 siblings, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-03  6:08 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: tromey, Zack Weinberg, gcc-patches, java

Andrew Pinski <pinskia@physics.uc.edu> writes:

> > Martin> Since you can get GNU binutils for all systems
> >
> > Is that really true?
> 
> No it is not true because you cannot get a recent version of GNU
> binutils for Darwin,
> You can get a heavily modified old version for Darwin though.

My statement might not be true in general, but Darwin is a bad
example.

The assembler that Apple ships with Darwin *is* a GNU assembler, and
(probably by coincidence) it does support UTF-8 in identifiers.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-27 23:15 Implementing Universal Character Names in identifiers Martin v. Löwis
  2002-10-27 23:47 ` Fergus Henderson
  2002-10-27 23:51 ` Zack Weinberg
@ 2002-11-07  0:09 ` Neil Booth
  2002-11-07  0:12   ` Neil Booth
  2002-11-07  1:01   ` Martin v. Löwis
  2 siblings, 2 replies; 30+ messages in thread
From: Neil Booth @ 2002-11-07  0:09 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

Martin v. L?wis wrote:-

> This patch implements UCNs in cpplib. It does so by converting the
> UCN to UTF-8, putting the UTF-8 bytes into the internal
> representation of the identifier.
> 
> The back-ends will transparently output the UTF-8 identifiers into the
> assembler file. If GNU as is used (or any other assembler supporting
> non-ASCII identifiers), these UTF-8 strings will be copied transparently
> into the object file. If the assembler does not support UTF-8, it
> will produce a diagnostic.
> 
> As a result of this strategy, UCNs are now allowed in all places
> mandated by the relevant standards, i.e. both in C99 and C++, and in
> all identifiers, including macro names.
> 
> Regards,
> Martin
> 
> 2002-10-27  Martin v. L?wis  <loewis@informatik.hu-berlin.de>
> 
> 	* c-lex.c (is_extended_char, utf8_extend_token): Remove.
> 	* cpplex.c (identifier_ucs_p, utf8_extend_token, 
> 	utf8_to_char): New functions.
> 	(parse_slow): Add utf8 parameter. Parse UCS names.
> 	(parse_identifier, parse_number): Adjust.
> 	(_cpp_lex_direct): Parse UCS names.
> 	(cpp_output_token): Print UCS names.
> 	* cpplib.h (NODE_UTF8): New flag.

It would be nice if you could handle escaped newline issues in
the UCS; I don't think your patch does that.  I think it's a bit
painful, and is one of the reasons I'd not added support for them
yet.  It would be easier if there was a prescan of phases 1 and 2
(a logical line at a time) of translation, which Zack and I
keep wondering whether to do or not.

Also, as a QOI issue I'd like token pasting to work for UCS's,
though the standard does not require it.  Does your patch handle
that?

Thanks,

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07  0:09 ` Neil Booth
@ 2002-11-07  0:12   ` Neil Booth
  2002-11-07  1:01   ` Martin v. Löwis
  1 sibling, 0 replies; 30+ messages in thread
From: Neil Booth @ 2002-11-07  0:12 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

Neil Booth wrote:-

> Also, as a QOI issue I'd like token pasting to work for UCS's,
> though the standard does not require it.  Does your patch handle
> that?

b.t.w. it wouldn't surprise me if you get this for free, given how
pasting is now implemented.

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07  0:09 ` Neil Booth
  2002-11-07  0:12   ` Neil Booth
@ 2002-11-07  1:01   ` Martin v. Löwis
  2002-11-07  1:11     ` Neil Booth
  1 sibling, 1 reply; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-07  1:01 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc-patches

Neil Booth <neil@daikokuya.co.uk> writes:

> It would be nice if you could handle escaped newline issues in
> the UCS; I don't think your patch does that.  

You mean, like

\u00\
c0

? This is undefined behaviour, in 2.1/1.2, and I think it should be an
error. It is an error indeed in my patch; the compiler reports

non-hex digit '\' in universal-character-name

> Also, as a QOI issue I'd like token pasting to work for UCS's,
> though the standard does not require it.  Does your patch handle
> that?

You mean, like

#define Foo(x,y) x##y

void Foo(bar\u00, c0){}

I think this *must* be an error; it's not an option to accept it,
since bar\u00 is not a token.

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07  1:01   ` Martin v. Löwis
@ 2002-11-07  1:11     ` Neil Booth
  2002-11-07  1:47       ` Martin v. Löwis
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Booth @ 2002-11-07  1:11 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

Martin v. L?wis wrote:-

> > It would be nice if you could handle escaped newline issues in
> > the UCS; I don't think your patch does that.  
> 
> You mean, like
> 
> \u00\
> c0
> 
> ? This is undefined behaviour, in 2.1/1.2, and I think it should be an
> error. It is an error indeed in my patch; the compiler reports
> 
> non-hex digit '\' in universal-character-name

We should definitely accept it.  Why should UCNs be different from
everything else?  I can see that C++ calls it undefined behaviour, but
C99 appears to require it.   It's also important, to me at least, from
a QOI perspective.

> > Also, as a QOI issue I'd like token pasting to work for UCS's,
> > though the standard does not require it.  Does your patch handle
> > that?
> 
> You mean, like
> 
> #define Foo(x,y) x##y
> 
> void Foo(bar\u00, c0){}
> 
> I think this *must* be an error; it's not an option to accept it,
> since bar\u00 is not a token.

A backslash is a token; so is u00c0.  Your example is indeed an
error, but was not what I had in mind.  I suspect pasting just works,
anyway.

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07  1:11     ` Neil Booth
@ 2002-11-07  1:47       ` Martin v. Löwis
  2002-11-07 11:40         ` Neil Booth
  0 siblings, 1 reply; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-07  1:47 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc-patches

Neil Booth <neil@daikokuya.co.uk> writes:

> We should definitely accept it.  Why should UCNs be different from
> everything else?  I can see that C++ calls it undefined behaviour, but
> C99 appears to require it.   It's also important, to me at least, from
> a QOI perspective.

I don't think C99 requires it. 5.1.1.2/1.2 says

# If, as a result, a character sequence that matches the syntax of a
# universal character name is produced, the behavior is undefined.

I think UCNs are rightfully different from nearly everything else;
they are quite similar to multi-byte characters. If you have an
escaped newline in the middle of a multi-byte character, you would not
expect concatenation to create a new multi-byte character, either,
would you?

I cannot see any important use cases for such a
feature. Implementations are allowed to reject this case, and it
simplifies the implementation to reject it, so I can see really no
reason to make life more complicated than necessary. Producing an
error now still gives the opportunity to provide an extension later.

Notice that the compiler deliberatly abstains from providing a
well-definition of undefined behaviour in some cases, to point out
portability issues. Users often complain that GCC provides too many
extensions, so I think every single extension must be judged very
carefully.

> A backslash is a token; so is u00c0.  Your example is indeed an
> error, but was not what I had in mind.  I suspect pasting just works,
> anyway.

Can you please give an example for what you had in mind?

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07  1:47       ` Martin v. Löwis
@ 2002-11-07 11:40         ` Neil Booth
  2002-11-08  3:51           ` Martin v. Löwis
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Booth @ 2002-11-07 11:40 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

Martin v. L?wis wrote:-

> I think UCNs are rightfully different from nearly everything else;
> they are quite similar to multi-byte characters. If you have an
> escaped newline in the middle of a multi-byte character, you would not
> expect concatenation to create a new multi-byte character, either,
> would you?

It's not analogous.  The idiom is that an escaped newline between
characters in the source character set are invisible.  You are
proposing breaking that idiom.  Multibyte chars are single chars
in the source character set, and so your counterexample does not apply.
The UCN stuff is really a phase 3 thing; escaped newlines are phase 2.

I really want this implemented in whatever patch goes in.  It's not
hard to do; instead of reading chars directly through a pointer,
call get_effective_char() instead, like the other parts of cpplex.c do.
It handles skipping the escaped newlines, if any.

> I cannot see any important use cases for such a
> feature. Implementations are allowed to reject this case, and it
> simplifies the implementation to reject it, so I can see really no
> reason to make life more complicated than necessary. Producing an
> error now still gives the opportunity to provide an extension later.

EDG accepts escaped newlines in UCNs; I've just tried it, so it's not
without precedent.

> > A backslash is a token; so is u00c0.  Your example is indeed an
> > error, but was not what I had in mind.  I suspect pasting just works,
> > anyway.
> 
> Can you please give an example for what you had in mind?

#define f(x, y) x ## y
f(\, uc00c0)

Which reminds me, the anti-accidental-paste code might need an extra
line or two.

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-07 11:40         ` Neil Booth
@ 2002-11-08  3:51           ` Martin v. Löwis
  2002-11-08 11:45             ` Neil Booth
  0 siblings, 1 reply; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-08  3:51 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc-patches

Neil Booth <neil@daikokuya.co.uk> writes:

> I really want this implemented in whatever patch goes in.  It's not
> hard to do; instead of reading chars directly through a pointer,
> call get_effective_char() instead, like the other parts of cpplex.c do.
> It handles skipping the escaped newlines, if any.

That is a complex change. I have to change maybe_read_ucs to use
get_effective_char. Since get_effective_char gets the position from
the reader, I have to remove the char** arguments from maybe_read_ucs.
This, in turn, means that all callers must change their calling
conventions. In particular, cpp_parse_escape cannot use the pstr/limit
approach anymore. However, I cannot see how I can change it, since
it sometimes parses things that do not come from a reader.

I will need further advise before being able to carry out this
change. 

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-08  3:51           ` Martin v. Löwis
@ 2002-11-08 11:45             ` Neil Booth
  0 siblings, 0 replies; 30+ messages in thread
From: Neil Booth @ 2002-11-08 11:45 UTC (permalink / raw)
  To: Martin v. L?wis; +Cc: gcc-patches

Martin v. L?wis wrote:-

> Neil Booth <neil@daikokuya.co.uk> writes:
> 
> > I really want this implemented in whatever patch goes in.  It's not
> > hard to do; instead of reading chars directly through a pointer,
> > call get_effective_char() instead, like the other parts of cpplex.c do.
> > It handles skipping the escaped newlines, if any.
> 
> That is a complex change. I have to change maybe_read_ucs to use
> get_effective_char. Since get_effective_char gets the position from
> the reader, I have to remove the char** arguments from maybe_read_ucs.
> This, in turn, means that all callers must change their calling
> conventions. In particular, cpp_parse_escape cannot use the pstr/limit
> approach anymore. However, I cannot see how I can change it, since
> it sometimes parses things that do not come from a reader.

I'll take a look; I don't think it's that hard.

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-10-28 10:39     ` Zack Weinberg
                         ` (3 preceding siblings ...)
  2002-10-31 11:08       ` Tom Tromey
@ 2002-11-10 10:39       ` Neil Booth
  2002-11-11  8:36         ` Martin v. Löwis
  4 siblings, 1 reply; 30+ messages in thread
From: Neil Booth @ 2002-11-10 10:39 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Martin v. L?wis, gcc-patches, java

Zack Weinberg wrote:-

> Ugh.  IMO, this is a defect in both standards - they should simply
> reference UAX15a7 and be done with it.  It's been around since 1998,
> so they don't really have an excuse for not using it.
> 
> I suggest:
> 
>  - In libiberty, provide interfaces that implement UAX15.  On
>    reflection, this should be a new <unicode.h> interface set, not
>    tacked onto <safe-ctype.h>.
> 
>  - In cpplib, provide routines that validate individual identifiers
>    against the precise lists in C99 and C++98.

Martin,

Are you going to do this part?  It would be a good start.  We could
do with a function that confirms whether a number is in the ranges
specified by the standards (separating the two if necessary, although
IMO that is pedantry in extremis).  We also need something like
ucs_digit_p(), since a UCS digit cannot start an identifier (something
I think you missed in your patch).

I've got something reasonable for the lexer I think; the best thing is
not to force it to use maybe_read_ucs().  I'm still waiting for
assignment issues to be resolved with my new employer, unfortunately.

Neil.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Implementing Universal Character Names in identifiers
  2002-11-10 10:39       ` Neil Booth
@ 2002-11-11  8:36         ` Martin v. Löwis
  0 siblings, 0 replies; 30+ messages in thread
From: Martin v. Löwis @ 2002-11-11  8:36 UTC (permalink / raw)
  To: Neil Booth; +Cc: Zack Weinberg, gcc-patches, java

Neil Booth <neil@daikokuya.co.uk> writes:

> >  - In libiberty, provide interfaces that implement UAX15.  On
> >    reflection, this should be a new <unicode.h> interface set, not
> >    tacked onto <safe-ctype.h>.
> > 
> >  - In cpplib, provide routines that validate individual identifiers
> >    against the precise lists in C99 and C++98.
> 
> Are you going to do this part?  It would be a good start.  We could
> do with a function that confirms whether a number is in the ranges
> specified by the standards (separating the two if necessary, although
> IMO that is pedantry in extremis).  

I'm still not sure why this should be in libiberty. The list of
acceptable characters is quite specific to the preprocessor.

However, I am working on updating this function for C99.

> We also need something like ucs_digit_p(), since a UCS digit cannot
> start an identifier (something I think you missed in your patch).

It's not an issue for C++: it does not allow UCS digits in an
identifier (nor does it allow what C99 calls "Special characters" -
I'm not even certain whether C99 intends to allow them in identifiers,
and if so, whether on arbitrary positions).

Regards,
Martin

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2002-11-11 16:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-27 23:15 Implementing Universal Character Names in identifiers Martin v. Löwis
2002-10-27 23:47 ` Fergus Henderson
2002-10-28  0:11   ` Martin v. Löwis
2002-10-27 23:51 ` Zack Weinberg
2002-10-28  0:53   ` Martin v. Löwis
2002-10-28  1:30     ` Fergus Henderson
2002-10-28  2:26     ` Joseph S. Myers
2002-10-28  3:29       ` Martin v. Löwis
2002-10-28 10:39     ` Zack Weinberg
2002-10-28 10:53       ` Joseph S. Myers
2002-10-29  1:39       ` Martin v. Löwis
2002-10-29 12:04       ` Joseph S. Myers
2002-10-31 11:08       ` Tom Tromey
2002-11-01  1:41         ` Martin v. Löwis
2002-11-01 11:17           ` Tom Tromey
2002-11-01 11:57             ` Martin v. Löwis
2002-11-01 14:56               ` Tom Tromey
2002-11-01 14:59                 ` Andrew Pinski
2002-11-03  6:08                   ` Martin v. Löwis
2002-11-03  6:05                 ` Martin v. Löwis
2002-11-10 10:39       ` Neil Booth
2002-11-11  8:36         ` Martin v. Löwis
2002-11-07  0:09 ` Neil Booth
2002-11-07  0:12   ` Neil Booth
2002-11-07  1:01   ` Martin v. Löwis
2002-11-07  1:11     ` Neil Booth
2002-11-07  1:47       ` Martin v. Löwis
2002-11-07 11:40         ` Neil Booth
2002-11-08  3:51           ` Martin v. Löwis
2002-11-08 11:45             ` Neil Booth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).