[PATCH] utf-16 and utf-32 support in C and C++

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] utf-16 and utf-32 support in C and C++
@ 2008-03-13 19:33 Kris Van Hees
  2008-03-13 19:34 ` Kris Van Hees
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Kris Van Hees @ 2008-03-13 19:33 UTC (permalink / raw)
  To: gcc-patches

Oracle has a full copyright assignment in place with the FSF.

This patch provides an implementation for support of UTF-16 and UTF-32
character data types in C and C++, based on the ISO/IEC draft technical
report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
(ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
encoding for UTF-16.  This implementation uses the target endianness
to determine whether UTF-16BE or UTF-16LE will be used.

Support is added for the following wide character datatypes (internal
for C, primitive types for C++) with the given underlying data types:

	char16_t		short unsigned int
	char32_t		unsigned int

Support is added to the tokenizer to accept the following new character
and string literal notations:

	u'c-char-sequence'	char16_t character literal (UTF-16)
	U'c-char-sequence'	char32_t character literal (UTF-32)

	u"s-char-sequence"	array of char16_t (UTF-16)
	U"s-char-sequence"	array of char32_t (UTF-32)

The aforementioned proposals do not specifically state what should be
done when a UTF-16 (char16_t) character literal contains a 32-bit
universal character (\Unnnnnnnn).  This implementation will issue an
error about the constant being too long.

Support is added to the C parser and the C++ parser to handle the
following concatenations of string literals:

	 "a" u"a"	-> u"ab"
	u"a"  "b"	-> u"ab"
	u"a" u"b"	-> u"ab"

	 "a" U"b"	-> U"ab"
	U"a"  "b"	-> U"ab"
	U"a" U"b"	-> U"ab"

The proposals do not exclude the implementation of additional rules
for concatenation.  This implementation also provides for the following
valid concatenations.  The rationale behind this choice is that the
concatenation of strings shall result in a string with the highest width,
according to the ascending order: char - char16_t - char32_t - wchar.

	u"a" U"a"	-> U"ab"
	U"a" u"b"	-> U"ab"
	u"a" L"a"	-> L"ab"
	L"a" u"b"	-> L"ab"
	U"a" L"b"	-> L"ab"
	L"a" U"b"	-> L"ab"

Changes were also needed in some parts of the tokenizer and the parser
to change the existing logic from distinguishing between non-wide and
wide character to supporting characters of varying widths.

Testcases:
----------
This patch adds testcases for all functionality described above.  The
test cases ensure that the literals are parsed correctly, and that the
resulting values are correct.  The tests also ensure that the width of
the character literals is correct.  All combinations of string
concatenation are exercised as well.  Finally, tests were added to
ensure that errors are flagged for empty characters (u'' and U''),
warnings for constants that are too long (u'ab', U'ab' and u"\Unnnnnnnn"
where \Unnnnnnnn is outside the BMP), and warnings for implicit truncation
of values (char16_t c = U'\Unnnnnnnn' or char32_t c = u'\Unnnnnnnn'
where \Unnnnnnnn is outside the BMP).

ChangeLog entries:
------------------
libcpp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * include/cpp-id-data.h (UC): Was U, conflicts with U"..." literal.
        * include/cpplib.h (CHAR16, CHAR32, STRING16, STRING32): New tokens.
        (cpp_interpret_string): Update prototype.
        (cpp_interpret_string_notranslate): Idem.
        * charset.c (init_iconv_desc): New width member in cset_converter.
        (cpp_init_iconv): Add support for char{16,32}_cset_desc.
        (convert_ucn): Idem.
        (emit_numeric_escape): Idem.
        (convert_hex): Idem.
        (convert_oct): Idem.
        (convert_escape): Idem.
        (convertor_for_type): New function.
        (cpp_interpret_string): Use convertor_for_type, support u and U prefix.
        (cpp_interpret_string_notranslate): Match changed prototype.
        (wide_str_to_charconst): Use convertor_for_type.
        (cpp_interpret_charconst): Add support for CPP_CHAR{16,32}.
        * directives.c (linemarker_dir): Macro U changed to UC.
        (parse_include): Idem.
        (register_pragma_1): Idem.
        (restore_registered_pragmas): Idem.
        (get__Pragma_string): Support CPP_STRING{16,32}.
        * expr.c (eval_token): Support CPP_CHAR{16,32}.
        * internal.h (struct cset_converter) <width>: New field.
        (struct cpp_reader) <char16_cset_desc>: Idem.
        (struct cpp_reader) <char32_cset_desc>: Idem.
        * lex.c (digraph_spellings): Macro U changed to UC.
        (OP, TK): Idem.
        (lex_string): Add support for u'...', U'...', u"..." and U"...".
        (_cpp_lex_direct): Idem.
        * macro.c (_cpp_builtin_macro_text): Macro U changed to UC.
        (stringify_arg): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.

gcc/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>
          
        * c-common.c (CHAR16_TYPE, CHAR32_TYPE): New macros.
        (fname_as_string): Match updated cpp_interpret_string prototype.
        (fix_string_type): Support char16_t* and char32_t*.
        (c_common_nodes_and_builtins): Add char16_t and char32_t (and
        derivative) nodes.
        (c_parse_error): Support CPP_CHAR{16,32}.
        * c-common.h (RID_CHAR16, RID_CHAR32): New elements. 
        (enum c_tree_index) <CTI_CHAR16_TYPE, CTI_SIGNED_CHAR16_TYPE,
        CTI_UNSIGNED_CHAR16_TYPE, CTI_CHAR32_TYPE, CTI_SIGNED_CHAR32_TYPE,
        CTI_UNSIGNED_CHAR32_TYPE, CTI_CHAR16_ARRAY_TYPE,
        CTI_CHAR32_ARRAY_TYPE>: New elements.
        (char16_type_node, signed_char16_type_node, unsigned_char16_type_node,
        char32_type_node, signed_char32_type_node, char16_array_type_node,
        char32_array_type_node): New defines.
        * c-lex.c (cb_ident): Match updated cpp_interpret_string prototype.
        (c_lex_with_flags): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.
        (lex_string): Support CPP_STRING{16,32}, match updated
        cpp_interpret_string and cpp_interpret_string_notranslate prototypes.
        (lex_charconst): Support CPP_CHAR{16,32}.
        * c-parser.c (c_parser_postfix_expression): Support CPP_CHAR{16,32}
        and CPP_STRING{16,32}.

gcc/cp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * parser.c (cp_lexer_next_token_is_decl_specifier_ke): Support
        RID_CHAR{16,32}.
        (cp_lexer_print_token): Support CPP_STRING{16,32}.
        (cp_parser_is_string_literal): Idem.
        (cp_parser_string_literal): Idem.
        (cp_parser_primary_expression): Support CPP_CHAR{16,32} and
        CPP_STRING{16,32}.
        (cp_parser_simple_type_specifier): Support RID_CHAR{16,32}. 
        * tree.c (char_type_p): Support char16_t and char32_t as char types.

gcc/testsuite/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        Tests for char16_t and char32_t support.
        * g++.dg/other/utf16-1.C: New
        * g++.dg/other/utf16-2.C: New
        * g++.dg/other/utf16-3.C: New
        * g++.dg/other/utf16-4.C: New
        * g++.dg/other/utf32-1.C: New
        * g++.dg/other/utf32-2.C: New
        * g++.dg/other/utf32-3.C: New
        * g++.dg/other/utf32-4.C: New
        * gcc.dg/utf16-1.c: New
        * gcc.dg/utf16-2.c: New
        * gcc.dg/utf16-3.c: New
        * gcc.dg/utf16-4.c: New
        * gcc.dg/utf32-1.c: New
        * gcc.dg/utf32-2.c: New
        * gcc.dg/utf32-3.c: New
        * gcc.dg/utf32-4.c: New

Bootstrapping and testing:
--------------------------
The source tree was built on the following platforms (target == host):

	i686-linux
	x86_64-linux
	ppc64-linux

Builds were done for both the unpatched tree and the patched tree, and
testsuite (make -k check) summary results were verified to be identical,
except for the added tests in the patched tree.  This was done to ensure
that the patch does not introduce regressions.

Index: gcc/c-lex.c
===================================================================
--- gcc/c-lex.c	(revision 133117)
+++ gcc/c-lex.c	(working copy)
@@ -174,7 +174,7 @@ cb_ident (cpp_reader * ARG_UNUSED (pfile
     {
       /* Convert escapes in the string.  */
       cpp_string cstr = { 0, 0 };
-      if (cpp_interpret_string (pfile, str, 1, &cstr, false))
+      if (cpp_interpret_string (pfile, str, 1, &cstr, CPP_STRING))
 	{
 	  ASM_OUTPUT_IDENT (asm_out_file, (const char *) cstr.text);
 	  free (CONST_CAST (unsigned char *, cstr.text));
@@ -361,6 +361,8 @@ c_lex_with_flags (tree *value, location_
 
 	    case CPP_STRING:
 	    case CPP_WSTRING:
+	    case CPP_STRING16:
+	    case CPP_STRING32:
 	      type = lex_string (tok, value, true, true);
 	      break;
 
@@ -410,11 +412,15 @@ c_lex_with_flags (tree *value, location_
 
     case CPP_CHAR:
     case CPP_WCHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       *value = lex_charconst (tok);
       break;
 
     case CPP_STRING:
     case CPP_WSTRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
       if ((lex_flags & C_LEX_RAW_STRINGS) == 0)
 	{
 	  type = lex_string (tok, value, false,
@@ -822,12 +828,12 @@ interpret_fixed (const cpp_token *token,
   return value;
 }
 
-/* Convert a series of STRING and/or WSTRING tokens into a tree,
-   performing string constant concatenation.  TOK is the first of
-   these.  VALP is the location to write the string into.  OBJC_STRING
-   indicates whether an '@' token preceded the incoming token.
+/* Convert a series of STRING, WSTRING, STRING16 and/or STRING32 tokens
+   into a tree, performing string constant concatenation.  TOK is the
+   first of these.  VALP is the location to write the string into.
+   OBJC_STRING indicates whether an '@' token preceded the incoming token.
    Returns the CPP token type of the result (CPP_STRING, CPP_WSTRING,
-   or CPP_OBJC_STRING).
+   CPP_STRING32, CPP_STRING16, or CPP_OBJC_STRING).
 
    This is unfortunately more work than it should be.  If any of the
    strings in the series has an L prefix, the result is a wide string
@@ -842,19 +848,16 @@ static enum cpp_ttype
 lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 {
   tree value;
-  bool wide = false;
   size_t concats = 0;
   struct obstack str_ob;
   cpp_string istr;
+  enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
   cpp_string *strs = &str;
 
-  if (tok->type == CPP_WSTRING)
-    wide = true;
-
  retry:
   tok = cpp_get_token (parse_in);
   switch (tok->type)
@@ -873,10 +876,21 @@ lex_string (const cpp_token *tok, tree *
       break;
 
     case CPP_WSTRING:
-      wide = true;
-      /* FALLTHROUGH */
+      type = CPP_WSTRING;
+      goto concat;
+
+    case CPP_STRING32:
+      if (type != CPP_WSTRING)
+	type = CPP_STRING32;
+      goto concat;
+
+    case CPP_STRING16:
+      if (type == CPP_STRING)
+	type = CPP_STRING16;
+      goto concat;
 
     case CPP_STRING:
+  concat:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
@@ -899,7 +913,7 @@ lex_string (const cpp_token *tok, tree *
 
   if ((translate
        ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, concats + 1, &istr, wide))
+      (parse_in, strs, concats + 1, &istr, type))
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
@@ -909,22 +923,50 @@ lex_string (const cpp_token *tok, tree *
       /* Callers cannot generally handle error_mark_node in this context,
 	 so return the empty string instead.  cpp_interpret_string has
 	 issued an error.  */
-      if (wide)
-	value = build_string (TYPE_PRECISION (wchar_type_node)
-			      / TYPE_PRECISION (char_type_node),
-			      "\0\0\0");  /* widest supported wchar_t
-					     is 32 bits */
-      else
-	value = build_string (1, "");
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  value = build_string (1, "");
+	  break;
+	case CPP_STRING16:
+	  value = build_string (TYPE_PRECISION (char16_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0");  /* char16_t is 16 bits */
+	  break;
+	case CPP_STRING32:
+	  value = build_string (TYPE_PRECISION (char32_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* char32_t is 32 bits */
+	  break;
+	case CPP_WSTRING:
+	  value = build_string (TYPE_PRECISION (wchar_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* widest supported wchar_t
+					       is 32 bits */
+	  break;
+      }
     }
 
-  TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+  switch (type) {
+    default:
+    case CPP_STRING:
+      TREE_TYPE (value) = char_array_type_node;
+      break;
+    case CPP_STRING16:
+      TREE_TYPE (value) = char16_array_type_node;
+      break;
+    case CPP_STRING32:
+      TREE_TYPE (value) = char32_array_type_node;
+      break;
+    case CPP_WSTRING:
+      TREE_TYPE (value) = wchar_array_type_node;
+  }
   *valp = fix_string_type (value);
 
   if (concats)
     obstack_free (&str_ob, 0);
 
-  return objc_string ? CPP_OBJC_STRING : wide ? CPP_WSTRING : CPP_STRING;
+  return objc_string ? CPP_OBJC_STRING : type;
 }
 
 /* Converts a (possibly wide) character constant token into a tree.  */
@@ -941,6 +983,10 @@ lex_charconst (const cpp_token *token)
 
   if (token->type == CPP_WCHAR)
     type = wchar_type_node;
+  else if (token->type == CPP_CHAR32)
+    type = char32_type_node;
+  else if (token->type == CPP_CHAR16)
+    type = char16_type_node;
   /* In C, a character constant has type 'int'.
      In C++ 'char', but multi-char charconsts have type 'int'.  */
   else if (!c_dialect_cxx () || chars_seen > 1)
Index: gcc/cp/tree.c
===================================================================
--- gcc/cp/tree.c	(revision 133117)
+++ gcc/cp/tree.c	(working copy)
@@ -2474,6 +2474,8 @@ char_type_p (tree type)
   return (same_type_p (type, char_type_node)
 	  || same_type_p (type, unsigned_char_type_node)
 	  || same_type_p (type, signed_char_type_node)
+	  || same_type_p (type, char16_type_node)
+	  || same_type_p (type, char32_type_node)
 	  || same_type_p (type, wchar_type_node));
 }
 
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 133117)
+++ gcc/cp/parser.c	(working copy)
@@ -556,6 +556,8 @@ cp_lexer_next_token_is_decl_specifier_ke
     case RID_TYPENAME:
       /* Simple type specifiers.  */
     case RID_CHAR:
+    case RID_CHAR16:
+    case RID_CHAR32:
     case RID_WCHAR:
     case RID_BOOL:
     case RID_SHORT:
@@ -789,6 +791,8 @@ cp_lexer_print_token (FILE * stream, cp_
       break;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       fprintf (stream, " \"%s\"", TREE_STRING_POINTER (token->u.value));
       break;
@@ -2033,7 +2037,10 @@ cp_parser_parsing_tentatively (cp_parser
 static bool
 cp_parser_is_string_literal (cp_token* token)
 {
-  return (token->type == CPP_STRING || token->type == CPP_WSTRING);
+  return (token->type == CPP_STRING ||
+	  token->type == CPP_STRING16 ||
+	  token->type == CPP_STRING32 ||
+	  token->type == CPP_WSTRING);
 }
 
 /* Returns nonzero if TOKEN is the indicated KEYWORD.  */
@@ -2861,11 +2868,11 @@ static tree
 cp_parser_string_literal (cp_parser *parser, bool translate, bool wide_ok)
 {
   tree value;
-  bool wide = false;
   size_t count;
   struct obstack str_ob;
   cpp_string str, istr, *strs;
   cp_token *tok;
+  enum cpp_ttype type;
 
   tok = cp_lexer_peek_token (parser->lexer);
   if (!cp_parser_is_string_literal (tok))
@@ -2874,6 +2881,8 @@ cp_parser_string_literal (cp_parser *par
       return error_mark_node;
     }
 
+  type = tok->type;
+
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   if (!cp_parser_is_string_literal
@@ -2884,8 +2893,6 @@ cp_parser_string_literal (cp_parser *par
       str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
       str.len = TREE_STRING_LENGTH (tok->u.value);
       count = 1;
-      if (tok->type == CPP_WSTRING)
-	wide = true;
 
       strs = &str;
     }
@@ -2900,8 +2907,24 @@ cp_parser_string_literal (cp_parser *par
 	  count++;
 	  str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
 	  str.len = TREE_STRING_LENGTH (tok->u.value);
-	  if (tok->type == CPP_WSTRING)
-	    wide = true;
+
+	  switch (tok->type) {
+	    case CPP_STRING:
+	      break;
+	    case CPP_STRING16:
+	      if (type == CPP_STRING)
+		type = CPP_STRING16;
+
+	      break;
+	    case CPP_STRING32:
+	      if (type != CPP_WSTRING)
+		type = CPP_STRING32;
+
+	      break;
+	    case CPP_WSTRING:
+	      type = CPP_WSTRING;
+	      break;
+	  }
 
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
 
@@ -2912,19 +2935,34 @@ cp_parser_string_literal (cp_parser *par
       strs = (cpp_string *) obstack_finish (&str_ob);
     }
 
-  if (wide && !wide_ok)
+  if (type != CPP_STRING && !wide_ok)
     {
       cp_parser_error (parser, "a wide string is invalid in this context");
-      wide = false;
+      type = CPP_STRING;
     }
 
   if ((translate ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, count, &istr, wide))
+      (parse_in, strs, count, &istr, type))
     {
       value = build_string (istr.len, (const char *)istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
 
-      TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  TREE_TYPE (value) = char_array_type_node;
+	  break;
+	case CPP_STRING16:
+	  TREE_TYPE (value) = char16_array_type_node;
+	  break;
+	case CPP_STRING32:
+	  TREE_TYPE (value) = char32_array_type_node;
+	  break;
+	case CPP_WSTRING:
+	  TREE_TYPE (value) = wchar_array_type_node;
+	  break;
+      }
+
       value = fix_string_type (value);
     }
   else
@@ -3079,6 +3117,8 @@ cp_parser_primary_expression (cp_parser 
 	   string-literal
 	   boolean-literal  */
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
     case CPP_NUMBER:
       token = cp_lexer_consume_token (parser->lexer);
@@ -3130,6 +3170,8 @@ cp_parser_primary_expression (cp_parser 
       return token->u.value;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       /* ??? Should wide strings be allowed when parser->translate_strings_p
 	 is false (i.e. in attributes)?  If not, we can kill the third
@@ -10762,6 +10804,12 @@ cp_parser_simple_type_specifier (cp_pars
 	decl_specs->explicit_char_p = true;
       type = char_type_node;
       break;
+    case RID_CHAR16:
+      type = char16_type_node;
+      break;
+    case RID_CHAR32:
+      type = char32_type_node;
+      break;
     case RID_WCHAR:
       type = wchar_type_node;
       break;
Index: gcc/c-common.c
===================================================================
--- gcc/c-common.c	(revision 133117)
+++ gcc/c-common.c	(working copy)
@@ -66,6 +66,14 @@ cpp_reader *parse_in;		/* Declared in c-
 #define PID_TYPE "int"
 #endif
 
+#ifndef CHAR16_TYPE
+#define CHAR16_TYPE "short unsigned int"
+#endif
+
+#ifndef CHAR32_TYPE
+#define CHAR32_TYPE "unsigned int"
+#endif
+
 #ifndef WCHAR_TYPE
 #define WCHAR_TYPE "int"
 #endif
@@ -123,6 +131,13 @@ cpp_reader *parse_in;		/* Declared in c-
 	tree signed_wchar_type_node;
 	tree unsigned_wchar_type_node;
 
+	tree char16_type_node;
+	tree signed_char16_type_node;
+	tree unsigned_char16_type_node;
+	tree char32_type_node;
+	tree signed_char32_type_node;
+	tree unsigned_char32_type_node;
+
 	tree float_type_node;
 	tree double_type_node;
 	tree long_double_type_node;
@@ -174,6 +189,16 @@ cpp_reader *parse_in;		/* Declared in c-
 
 	tree wchar_array_type_node;
 
+   Type `char16_t[SOMENUMBER]' or something like it.
+   Used when a UTF-16 string literal is created.
+
+	tree char16_array_type_node;
+
+   Type `char32_t[SOMENUMBER]' or something like it.
+   Used when a UTF-32 string literal is created.
+
+	tree char32_array_type_node;
+
    Type `int ()' -- used for implicit declaration of functions.
 
 	tree default_function_type;
@@ -777,7 +802,7 @@ fname_as_string (int pretty_p)
   strname.text = (unsigned char *) namep;
   strname.len = len - 1;
 
-  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, false))
+  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, CPP_STRING))
     {
       XDELETEVEC (namep);
       return (const char *) cstr.text;
@@ -857,14 +882,28 @@ fname_decl (unsigned int rid, tree id)
 tree
 fix_string_type (tree value)
 {
-  const int wchar_bytes = TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT;
-  const int wide_flag = TREE_TYPE (value) == wchar_array_type_node;
+  const bool wide = TREE_TYPE (value)
+		    && TREE_TYPE (value) != char_array_type_node;
   int length = TREE_STRING_LENGTH (value);
   int nchars;
   tree e_type, i_type, a_type;
 
   /* Compute the number of elements, for the array type.  */
-  nchars = wide_flag ? length / wchar_bytes : length;
+  if (wide) {
+    if (TREE_TYPE (value) == char16_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char16_type_node) / BITS_PER_UNIT);
+      e_type = char16_type_node;
+    } else if (TREE_TYPE (value) == char32_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char32_type_node) / BITS_PER_UNIT);
+      e_type = char32_type_node;
+    } else {
+      nchars = length / (TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT);
+      e_type = wchar_type_node;
+    }
+  } else {
+    nchars = length;
+    e_type = char_type_node;
+  }
 
   /* C89 2.2.4.1, C99 5.2.4.1 (Translation limits).  The analogous
      limit in C++98 Annex B is very large (65536) and is not normative,
@@ -899,7 +938,6 @@ fix_string_type (tree value)
      construct the matching unqualified array type first.  The C front
      end does not require this, but it does no harm, so we do it
      unconditionally.  */
-  e_type = wide_flag ? wchar_type_node : char_type_node;
   i_type = build_index_type (build_int_cst (NULL_TREE, nchars - 1));
   a_type = build_array_type (e_type, i_type);
   if (c_dialect_cxx() || warn_write_strings)
@@ -3625,6 +3663,8 @@ c_define_builtins (tree va_list_ref_type
 void
 c_common_nodes_and_builtins (void)
 {
+  int char16_type_size;
+  int char32_type_size;
   int wchar_type_size;
   tree array_domain_type;
   tree va_list_ref_type_node;
@@ -3874,6 +3914,50 @@ c_common_nodes_and_builtins (void)
   wchar_array_type_node
     = build_array_type (wchar_type_node, array_domain_type);
 
+  /* Define 'char16_t', `signed char16_t' and `unsigned char16_t'.  */
+  char16_type_node = get_identifier (CHAR16_TYPE);
+  char16_type_node = TREE_TYPE (identifier_global_value (char16_type_node));
+  char16_type_size = TYPE_PRECISION (char16_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char16_type_node))
+	char16_type_node = make_unsigned_type (char16_type_size);
+      else
+	char16_type_node = make_signed_type (char16_type_size);
+      record_builtin_type (RID_CHAR16, "char16_t", char16_type_node);
+    }
+  else
+    {
+      signed_char16_type_node = c_common_signed_type (char16_type_node);
+      unsigned_char16_type_node = c_common_unsigned_type (char16_type_node);
+    }
+
+  /* This is for UTF-16 string constants.  */
+  char16_array_type_node
+    = build_array_type (char16_type_node, array_domain_type);
+
+  /* Define 'char32_t', `signed char32_t' and `unsigned char32_t'.  */
+  char32_type_node = get_identifier (CHAR32_TYPE);
+  char32_type_node = TREE_TYPE (identifier_global_value (char32_type_node));
+  char32_type_size = TYPE_PRECISION (char32_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char32_type_node))
+	char32_type_node = make_unsigned_type (char32_type_size);
+      else
+	char32_type_node = make_signed_type (char32_type_size);
+      record_builtin_type (RID_CHAR32, "char32_t", char32_type_node);
+    }
+  else
+    {
+      signed_char32_type_node = c_common_signed_type (char32_type_node);
+      unsigned_char32_type_node = c_common_unsigned_type (char32_type_node);
+    }
+
+  /* This is for UTF-32 string constants.  */
+  char32_array_type_node
+    = build_array_type (char32_type_node, array_domain_type);
+
   wint_type_node =
     TREE_TYPE (identifier_global_value (get_identifier (WINT_TYPE)));
 
@@ -6652,20 +6736,38 @@ c_parse_error (const char *gmsgid, enum 
 
   if (token == CPP_EOF)
     message = catenate_messages (gmsgid, " at end of input");
-  else if (token == CPP_CHAR || token == CPP_WCHAR)
+  else if (token == CPP_CHAR || token == CPP_WCHAR || token == CPP_CHAR16
+	   || token == CPP_CHAR32)
     {
       unsigned int val = TREE_INT_CST_LOW (value);
-      const char *const ell = (token == CPP_CHAR) ? "" : "L";
+      const char *prefix;
+
+      switch (token) {
+	default:
+	  prefix = "";
+	  break;
+	case CPP_WCHAR:
+	  prefix = "L";
+	  break;
+	case CPP_CHAR16:
+	  prefix = "u";
+	  break;
+	case CPP_CHAR32:
+	  prefix = "U";
+	  break;
+      }
+
       if (val <= UCHAR_MAX && ISGRAPH (val))
 	message = catenate_messages (gmsgid, " before %s'%c'");
       else
 	message = catenate_messages (gmsgid, " before %s'\\x%x'");
 
-      error (message, ell, val);
+      error (message, prefix, val);
       free (message);
       message = NULL;
     }
-  else if (token == CPP_STRING || token == CPP_WSTRING)
+  else if (token == CPP_STRING || token == CPP_WSTRING || token == CPP_STRING16
+	   || token == CPP_STRING32)
     message = catenate_messages (gmsgid, " before string constant");
   else if (token == CPP_NUMBER)
     message = catenate_messages (gmsgid, " before numeric constant");
Index: gcc/c-common.h
===================================================================
--- gcc/c-common.h	(revision 133117)
+++ gcc/c-common.h	(working copy)
@@ -85,7 +85,7 @@ enum rid
   RID_NEW,      RID_OFFSETOF, RID_OPERATOR,
   RID_THIS,     RID_THROW,    RID_TRUE,
   RID_TRY,      RID_TYPENAME, RID_TYPEID,
-  RID_USING,
+  RID_USING,    RID_CHAR16,   RID_CHAR32,
 
   /* casts */
   RID_CONSTCAST, RID_DYNCAST, RID_REINTCAST, RID_STATCAST,
@@ -143,6 +143,12 @@ extern GTY ((length ("(int) RID_MAX"))) 
 
 enum c_tree_index
 {
+    CTI_CHAR16_TYPE,
+    CTI_SIGNED_CHAR16_TYPE,
+    CTI_UNSIGNED_CHAR16_TYPE,
+    CTI_CHAR32_TYPE,
+    CTI_SIGNED_CHAR32_TYPE,
+    CTI_UNSIGNED_CHAR32_TYPE,
     CTI_WCHAR_TYPE,
     CTI_SIGNED_WCHAR_TYPE,
     CTI_UNSIGNED_WCHAR_TYPE,
@@ -155,6 +161,8 @@ enum c_tree_index
     CTI_WIDEST_UINT_LIT_TYPE,
 
     CTI_CHAR_ARRAY_TYPE,
+    CTI_CHAR16_ARRAY_TYPE,
+    CTI_CHAR32_ARRAY_TYPE,
     CTI_WCHAR_ARRAY_TYPE,
     CTI_INT_ARRAY_TYPE,
     CTI_STRING_TYPE,
@@ -190,6 +198,12 @@ struct c_common_identifier GTY(())
   struct cpp_hashnode node;
 };
 
+#define char16_type_node		c_global_trees[CTI_CHAR16_TYPE]
+#define signed_char16_type_node		c_global_trees[CTI_SIGNED_CHAR16_TYPE]
+#define unsigned_char16_type_node	c_global_trees[CTI_UNSIGNED_CHAR16_TYPE]
+#define char32_type_node		c_global_trees[CTI_CHAR32_TYPE]
+#define signed_char32_type_node		c_global_trees[CTI_SIGNED_CHAR32_TYPE]
+#define unsigned_char32_type_node	c_global_trees[CTI_UNSIGNED_CHAR32_TYPE]
 #define wchar_type_node			c_global_trees[CTI_WCHAR_TYPE]
 #define signed_wchar_type_node		c_global_trees[CTI_SIGNED_WCHAR_TYPE]
 #define unsigned_wchar_type_node	c_global_trees[CTI_UNSIGNED_WCHAR_TYPE]
@@ -206,6 +220,8 @@ struct c_common_identifier GTY(())
 #define truthvalue_false_node		c_global_trees[CTI_TRUTHVALUE_FALSE]
 
 #define char_array_type_node		c_global_trees[CTI_CHAR_ARRAY_TYPE]
+#define char16_array_type_node		c_global_trees[CTI_CHAR16_ARRAY_TYPE]
+#define char32_array_type_node		c_global_trees[CTI_CHAR32_ARRAY_TYPE]
 #define wchar_array_type_node		c_global_trees[CTI_WCHAR_ARRAY_TYPE]
 #define int_array_type_node		c_global_trees[CTI_INT_ARRAY_TYPE]
 #define string_type_node		c_global_trees[CTI_STRING_TYPE]
Index: gcc/c-parser.c
===================================================================
--- gcc/c-parser.c	(revision 133117)
+++ gcc/c-parser.c	(working copy)
@@ -5168,12 +5168,16 @@ c_parser_postfix_expression (c_parser *p
     {
     case CPP_NUMBER:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = ERROR_MARK;
       c_parser_consume_token (parser);
       break;
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = STRING_CST;
Index: libcpp/macro.c
===================================================================
--- libcpp/macro.c	(revision 133117)
+++ libcpp/macro.c	(working copy)
@@ -158,7 +158,7 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 		  {
 		    cpp_errno (pfile, CPP_DL_WARNING,
 			"could not determine file timestamp");
-		    pbuffer->timestamp = U"\"??? ??? ?? ??:??:?? ????\"";
+		    pbuffer->timestamp = UC"\"??? ??? ?? ??:??:?? ????\"";
 		  }
 	      }
 	  }
@@ -256,8 +256,8 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 	      cpp_errno (pfile, CPP_DL_WARNING,
 			 "could not determine date and time");
 		
-	      pfile->date = U"\"??? ?? ????\"";
-	      pfile->time = U"\"??:??:??\"";
+	      pfile->date = UC"\"??? ?? ????\"";
+	      pfile->time = UC"\"??:??:??\"";
 	    }
 	}
 
@@ -375,8 +375,10 @@ stringify_arg (cpp_reader *pfile, macro_
 	  continue;
 	}
 
-      escape_it = (token->type == CPP_STRING || token->type == CPP_WSTRING
-		   || token->type == CPP_CHAR || token->type == CPP_WCHAR);
+      escape_it = (token->type == CPP_STRING || token->type == CPP_CHAR
+		   || token->type == CPP_WSTRING || token->type == CPP_STRING
+		   || token->type == CPP_STRING32 || token->type == CPP_CHAR32
+		   || token->type == CPP_STRING16 || token->type == CPP_CHAR16);
 
       /* Room for each char being written in octal, initial space and
 	 final quote and NUL.  */
Index: libcpp/directives.c
===================================================================
--- libcpp/directives.c	(revision 133117)
+++ libcpp/directives.c	(working copy)
@@ -188,7 +188,7 @@ DIRECTIVE_TABLE
    did use this notation in its preprocessed output.  */
 static const directive linemarker_dir =
 {
-  do_linemarker, U"#", 1, KANDR, IN_I
+  do_linemarker, UC"#", 1, KANDR, IN_I
 };
 
 #define SEEN_EOL() (pfile->cur_token[-1].type == CPP_EOF)
@@ -689,7 +689,7 @@ parse_include (cpp_reader *pfile, int *p
       const unsigned char *dir;
 
       if (pfile->directive == &dtable[T_PRAGMA])
-	dir = U"pragma dependency";
+	dir = UC"pragma dependency";
       else
 	dir = pfile->directive->name;
       cpp_error (pfile, CPP_DL_ERROR, "#%s expects \"FILENAME\" or <FILENAME>",
@@ -1077,7 +1077,7 @@ register_pragma_1 (cpp_reader *pfile, co
 
   if (space)
     {
-      node = cpp_lookup (pfile, U space, strlen (space));
+      node = cpp_lookup (pfile, UC space, strlen (space));
       entry = lookup_pragma_entry (*chain, node);
       if (!entry)
 	{
@@ -1106,7 +1106,7 @@ register_pragma_1 (cpp_reader *pfile, co
     }
 
   /* Check for duplicates.  */
-  node = cpp_lookup (pfile, U name, strlen (name));
+  node = cpp_lookup (pfile, UC name, strlen (name));
   entry = lookup_pragma_entry (*chain, node);
   if (entry == NULL)
     {
@@ -1254,7 +1254,7 @@ restore_registered_pragmas (cpp_reader *
     {
       if (pe->is_nspace)
 	sd = restore_registered_pragmas (pfile, pe->u.space, sd);
-      pe->pragma = cpp_lookup (pfile, U *sd, strlen (*sd));
+      pe->pragma = cpp_lookup (pfile, UC *sd, strlen (*sd));
       free (*sd);
       sd++;
     }
@@ -1483,7 +1483,8 @@ get__Pragma_string (cpp_reader *pfile)
   string = get_token_no_padding (pfile);
   if (string->type == CPP_EOF)
     _cpp_backup_tokens (pfile, 1);
-  if (string->type != CPP_STRING && string->type != CPP_WSTRING)
+  if (string->type != CPP_STRING && string->type != CPP_WSTRING
+      && string->type != CPP_STRING32 && string->type != CPP_STRING16)
     return NULL;
 
   paren = get_token_no_padding (pfile);
Index: libcpp/include/cpplib.h
===================================================================
--- libcpp/include/cpplib.h	(revision 133117)
+++ libcpp/include/cpplib.h	(working copy)
@@ -123,10 +123,14 @@ struct _cpp_file;
 									\
   TK(CHAR,		LITERAL) /* 'char' */				\
   TK(WCHAR,		LITERAL) /* L'char' */				\
+  TK(CHAR16,		LITERAL) /* u'char' */				\
+  TK(CHAR32,		LITERAL) /* U'char' */				\
   TK(OTHER,		LITERAL) /* stray punctuation */		\
 									\
   TK(STRING,		LITERAL) /* "string" */				\
   TK(WSTRING,		LITERAL) /* L"string" */			\
+  TK(STRING16,		LITERAL) /* u"string" */			\
+  TK(STRING32,		LITERAL) /* U"string" */			\
   TK(OBJC_STRING,	LITERAL) /* @"string" - Objective-C */		\
   TK(HEADER_NAME,	LITERAL) /* <stdio.h> in #include */		\
 									\
@@ -703,10 +707,10 @@ extern cppchar_t cpp_interpret_charconst
 /* Evaluate a vector of CPP_STRING or CPP_WSTRING tokens.  */
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
-				  cpp_string *, bool);
+				  cpp_string *, enum cpp_ttype);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
-					      cpp_string *, bool);
+					      cpp_string *, enum cpp_ttype);
 
 /* Convert a host character constant to the execution character set.  */
 extern cppchar_t cpp_host_to_exec_charset (cpp_reader *, cppchar_t);
Index: libcpp/include/cpp-id-data.h
===================================================================
--- libcpp/include/cpp-id-data.h	(revision 133117)
+++ libcpp/include/cpp-id-data.h	(working copy)
@@ -22,7 +22,7 @@ Foundation, 51 Franklin Street, Fifth Fl
 typedef unsigned char uchar;
 #endif
 
-#define U (const unsigned char *)  /* Intended use: U"string" */
+#define UC (const unsigned char *)  /* Intended use: UC"string" */
 
 /* Chained list of answers to an assertion.  */
 struct answer GTY(())
Index: libcpp/expr.c
===================================================================
--- libcpp/expr.c	(revision 133117)
+++ libcpp/expr.c	(working copy)
@@ -691,6 +691,8 @@ eval_token (cpp_reader *pfile, const cpp
 
     case CPP_WCHAR:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       {
 	cppchar_t cc = cpp_interpret_charconst (pfile, token,
 						&temp, &unsignedp);
@@ -849,6 +851,8 @@ _cpp_parse_expr (cpp_reader *pfile)
 	case CPP_NUMBER:
 	case CPP_CHAR:
 	case CPP_WCHAR:
+	case CPP_CHAR16:
+	case CPP_CHAR32:
 	case CPP_NAME:
 	case CPP_HASH:
 	  if (!want_value)
Index: libcpp/internal.h
===================================================================
--- libcpp/internal.h	(revision 133117)
+++ libcpp/internal.h	(working copy)
@@ -48,6 +48,7 @@ struct cset_converter
 {
   convert_f func;
   iconv_t cd;
+  int width;
 };
 
 #define BITS_PER_CPPCHAR_T (CHAR_BIT * sizeof (cppchar_t))
@@ -399,6 +400,14 @@ struct cpp_reader
   struct cset_converter narrow_cset_desc;
 
   /* Descriptor for converting from the source character set to the
+     UTF-16 execution character set.  */
+  struct cset_converter char16_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
+     UTF-32 execution character set.  */
+  struct cset_converter char32_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
      wide execution character set.  */
   struct cset_converter wide_cset_desc;
 
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 133117)
+++ libcpp/lex.c	(working copy)
@@ -39,10 +39,10 @@ struct token_spelling
 };
 
 static const unsigned char *const digraph_spellings[] =
-{ U"%:", U"%:%:", U"<:", U":>", U"<%", U"%>" };
+{ UC"%:", UC"%:%:", UC"<:", UC":>", UC"<%", UC"%>" };
 
-#define OP(e, s) { SPELL_OPERATOR, U s  },
-#define TK(e, s) { SPELL_ ## s,    U #e },
+#define OP(e, s) { SPELL_OPERATOR, UC s  },
+#define TK(e, s) { SPELL_ ## s,    UC #e },
 static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
 #undef OP
 #undef TK
@@ -611,8 +611,8 @@ create_literal (cpp_reader *pfile, cpp_t
 
 /* Lexes a string, character constant, or angle-bracketed header file
    name.  The stored string contains the spelling, including opening
-   quote and leading any leading 'L'.  It returns the type of the
-   literal, or CPP_OTHER if it was not properly terminated.
+   quote and leading any leading 'L', 'u' or 'U'.  It returns the type
+   of the literal, or CPP_OTHER if it was not properly terminated.
 
    The spelling is NUL-terminated, but it is not guaranteed that this
    is the first NUL since embedded NULs are preserved.  */
@@ -626,12 +626,17 @@ lex_string (cpp_reader *pfile, cpp_token
 
   cur = base;
   terminator = *cur++;
-  if (terminator == 'L')
+  if (terminator == 'L' || terminator == 'u' || terminator == 'U')
     terminator = *cur++;
   if (terminator == '\"')
-    type = *base == 'L' ? CPP_WSTRING: CPP_STRING;
+    type = *base == 'L' ? CPP_WSTRING
+			: *base == 'U' ? CPP_STRING32
+				       : *base == 'u' ? CPP_STRING16
+						      : CPP_STRING;
   else if (terminator == '\'')
-    type = *base == 'L' ? CPP_WCHAR: CPP_CHAR;
+    type = *base == 'L' ? CPP_WCHAR
+			: *base == 'U' ? CPP_CHAR32
+				       : *base == 'u' ? CPP_CHAR16 : CPP_CHAR;
   else
     terminator = '>', type = CPP_HEADER_NAME;
 
@@ -965,7 +970,9 @@ _cpp_lex_direct (cpp_reader *pfile)
       }
 
     case 'L':
-      /* 'L' may introduce wide characters or strings.  */
+    case 'u':
+    case 'U':
+      /* 'L', 'u' or 'U' may introduce wide characters or strings.  */
       if (*buffer->cur == '\'' || *buffer->cur == '"')
 	{
 	  lex_string (pfile, result, buffer->cur - 1);
@@ -977,12 +984,12 @@ _cpp_lex_direct (cpp_reader *pfile)
     case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
     case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
     case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
-    case 's': case 't': case 'u': case 'v': case 'w': case 'x':
+    case 's': case 't':           case 'v': case 'w': case 'x':
     case 'y': case 'z':
     case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
     case 'G': case 'H': case 'I': case 'J': case 'K':
     case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
-    case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
+    case 'S': case 'T':           case 'V': case 'W': case 'X':
     case 'Y': case 'Z':
       result->type = CPP_NAME;
       {
Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 133117)
+++ libcpp/charset.c	(working copy)
@@ -642,6 +642,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
       return ret;
     }
 
@@ -655,6 +656,7 @@ init_iconv_desc (cpp_reader *pfile, cons
       {
 	ret.func = conversion_tab[i].func;
 	ret.cd = conversion_tab[i].fake_cd;
+	ret.width = -1;
 	return ret;
       }
 
@@ -663,6 +665,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_using_iconv;
       ret.cd = iconv_open (to, from);
+      ret.width = -1;
 
       if (ret.cd == (iconv_t) -1)
 	{
@@ -683,6 +686,7 @@ init_iconv_desc (cpp_reader *pfile, cons
 		 from, to);
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
     }
   return ret;
 }
@@ -716,7 +720,17 @@ cpp_init_iconv (cpp_reader *pfile)
     wcset = default_wcset;
 
   pfile->narrow_cset_desc = init_iconv_desc (pfile, ncset, SOURCE_CHARSET);
+  pfile->narrow_cset_desc.width = CPP_OPTION (pfile, char_precision);
+  pfile->char16_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-16BE" : "UTF-16LE",
+					     SOURCE_CHARSET);
+  pfile->char16_cset_desc.width = 16;
+  pfile->char32_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-32BE" : "UTF-32LE",
+					     SOURCE_CHARSET);
+  pfile->char32_cset_desc.width = 32;
   pfile->wide_cset_desc = init_iconv_desc (pfile, wcset, SOURCE_CHARSET);
+  pfile->wide_cset_desc.width = CPP_OPTION (pfile, wchar_precision);
 }
 
 /* Destroy iconv(3) descriptors set up by cpp_init_iconv, if necessary.  */
@@ -1051,15 +1065,13 @@ _cpp_valid_ucn (cpp_reader *pfile, const
    An advanced pointer is returned.  Issues all relevant diagnostics.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t ucn;
   uchar buf[6];
   uchar *bufp = buf;
   size_t bytesleft = 6;
   int rval;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
   from++;  /* Skip u/U.  */
@@ -1086,14 +1098,15 @@ convert_ucn (cpp_reader *pfile, const uc
    function issues no diagnostics and never fails.  */
 static void
 emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
-		     struct _cpp_strbuf *tbuf, bool wide)
+		     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
-  if (wide)
+  size_t width = cvt.width;
+
+  if (width != CPP_OPTION(pfile, char_precision))
     {
       /* We have to render this into the target byte order, which may not
 	 be our byte order.  */
       bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-      size_t width = CPP_OPTION (pfile, wchar_precision);
       size_t cwidth = CPP_OPTION (pfile, char_precision);
       size_t cmask = width_to_mask (cwidth);
       size_t nbwc = width / cwidth;
@@ -1136,12 +1149,11 @@ emit_numeric_escape (cpp_reader *pfile, 
    number.  You can, e.g. generate surrogate pairs this way.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
   if (CPP_WTRADITIONAL (pfile))
@@ -1174,7 +1186,7 @@ convert_hex (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1187,12 +1199,11 @@ convert_hex (cpp_reader *pfile, const uc
    number.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
@@ -1213,7 +1224,7 @@ convert_oct (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1224,7 +1235,7 @@ convert_oct (cpp_reader *pfile, const uc
    pointer.  Handles all relevant diagnostics.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, bool wide)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1236,23 +1247,21 @@ convert_escape (cpp_reader *pfile, const
 #endif
 
   uchar c;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
 
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, wide);
+      return convert_ucn (pfile, from, limit, tbuf, cvt);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, wide);
+      return convert_hex (pfile, from, limit, tbuf, cvt);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, wide);
+      return convert_oct (pfile, from, limit, tbuf, cvt);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1312,6 +1321,26 @@ convert_escape (cpp_reader *pfile, const
   return from + 1;
 }
 \f
+/* TYPE is a token type.  The return value is the conversion needed to
+   convert from source to execution character set for the given type. */
+static struct cset_converter
+convertor_for_type (cpp_reader *pfile, enum cpp_ttype type)
+{
+  switch (type) {
+    default:
+	return pfile->narrow_cset_desc;
+    case CPP_CHAR16:
+    case CPP_STRING16:
+	return pfile->char16_cset_desc;
+    case CPP_CHAR32:
+    case CPP_STRING32:
+	return pfile->char32_cset_desc;
+    case CPP_WCHAR:
+    case CPP_WSTRING:
+	return pfile->wide_cset_desc;
+  }
+}
+
 /* FROM is an array of cpp_string structures of length COUNT.  These
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
@@ -1320,13 +1349,12 @@ convert_escape (cpp_reader *pfile, const
    false for failure.  */
 bool
 cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to, bool wide)
+		      cpp_string *to,  enum cpp_ttype type)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
+  struct cset_converter cvt = convertor_for_type (pfile, type);
 
   tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
   tbuf.text = XNEWVEC (uchar, tbuf.asize);
@@ -1335,7 +1363,7 @@ cpp_interpret_string (cpp_reader *pfile,
   for (i = 0; i < count; i++)
     {
       p = from[i].text;
-      if (*p == 'L') p++;
+      if (*p == 'L' || *p == 'u' || *p == 'U') p++;
       p++; /* Skip leading quote.  */
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
@@ -1354,12 +1382,12 @@ cpp_interpret_string (cpp_reader *pfile,
 	  if (p == limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, wide);
+	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
 	}
     }
   /* NUL-terminate the 'to' buffer and translate it to a cpp_string
      structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, wide);
+  emit_numeric_escape (pfile, 0, &tbuf, cvt);
   tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
   to->text = tbuf.text;
   to->len = tbuf.len;
@@ -1375,7 +1403,8 @@ cpp_interpret_string (cpp_reader *pfile,
    in a string, but do not perform character set conversion.  */
 bool
 cpp_interpret_string_notranslate (cpp_reader *pfile, const cpp_string *from,
-				  size_t count,	cpp_string *to, bool wide)
+				  size_t count,	cpp_string *to,
+				  enum cpp_ttype type ATTRIBUTE_UNUSED)
 {
   struct cset_converter save_narrow_cset_desc = pfile->narrow_cset_desc;
   bool retval;
@@ -1383,7 +1412,7 @@ cpp_interpret_string_notranslate (cpp_re
   pfile->narrow_cset_desc.func = convert_no_conversion;
   pfile->narrow_cset_desc.cd = (iconv_t) -1;
 
-  retval = cpp_interpret_string (pfile, from, count, to, wide);
+  retval = cpp_interpret_string (pfile, from, count, to, CPP_STRING);
 
   pfile->narrow_cset_desc = save_narrow_cset_desc;
   return retval;
@@ -1462,13 +1491,14 @@ narrow_str_to_charconst (cpp_reader *pfi
 /* Subroutine of cpp_interpret_charconst which performs the conversion
    to a number, for wide strings.  STR is the string structure returned
    by cpp_interpret_string.  PCHARS_SEEN and UNSIGNEDP are as for
-   cpp_interpret_charconst.  */
+   cpp_interpret_charconst.  TYPE is the token type.  */
 static cppchar_t
 wide_str_to_charconst (cpp_reader *pfile, cpp_string str,
-		       unsigned int *pchars_seen, int *unsignedp)
+		       unsigned int *pchars_seen, int *unsignedp,
+		       enum cpp_ttype type)
 {
   bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-  size_t width = CPP_OPTION (pfile, wchar_precision);
+  size_t width = convertor_for_type (pfile, type).width;
   size_t cwidth = CPP_OPTION (pfile, char_precision);
   size_t mask = width_to_mask (width);
   size_t cmask = width_to_mask (cwidth);
@@ -1490,7 +1520,7 @@ wide_str_to_charconst (cpp_reader *pfile
   /* Wide character constants have type wchar_t, and a single
      character exactly fills a wchar_t, so a multi-character wide
      character constant is guaranteed to overflow.  */
-  if (off > 0)
+  if (str.len > nbwc * 2)
     cpp_error (pfile, CPP_DL_WARNING,
 	       "character constant too long for its type");
 
@@ -1518,20 +1548,21 @@ cpp_interpret_charconst (cpp_reader *pfi
 			 unsigned int *pchars_seen, int *unsignedp)
 {
   cpp_string str = { 0, 0 };
-  bool wide = (token->type == CPP_WCHAR);
+  bool wide = (token->type != CPP_CHAR);
   cppchar_t result;
 
-  /* an empty constant will appear as L'' or '' */
+  /* an empty constant will appear as L'', u'', U'' or '' */
   if (token->val.str.len == (size_t) (2 + wide))
     {
       cpp_error (pfile, CPP_DL_ERROR, "empty character constant");
       return 0;
     }
-  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, wide))
+  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, token->type))
     return 0;
 
   if (wide)
-    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp);
+    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp,
+				    token->type);
   else
     result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
@ 2008-03-13 19:34 ` Kris Van Hees
  2008-03-13 19:56 ` Andrew Pinski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Kris Van Hees @ 2008-03-13 19:34 UTC (permalink / raw)
  To: gcc-patches

Oracle has a full copyright assignment in place with the FSF.

This patch provides an implementation for support of UTF-16 and UTF-32
character data types in C and C++, based on the ISO/IEC draft technical
report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
(ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
encoding for UTF-16.  This implementation uses the target endianness
to determine whether UTF-16BE or UTF-16LE will be used.

Support is added for the following wide character datatypes (internal
for C, primitive types for C++) with the given underlying data types:

	char16_t		short unsigned int
	char32_t		unsigned int

Support is added to the tokenizer to accept the following new character
and string literal notations:

	u'c-char-sequence'	char16_t character literal (UTF-16)
	U'c-char-sequence'	char32_t character literal (UTF-32)

	u"s-char-sequence"	array of char16_t (UTF-16)
	U"s-char-sequence"	array of char32_t (UTF-32)

The aforementioned proposals do not specifically state what should be
done when a UTF-16 (char16_t) character literal contains a 32-bit
universal character (\Unnnnnnnn).  This implementation will issue an
error about the constant being too long.

Support is added to the C parser and the C++ parser to handle the
following concatenations of string literals:

	 "a" u"a"	-> u"ab"
	u"a"  "b"	-> u"ab"
	u"a" u"b"	-> u"ab"

	 "a" U"b"	-> U"ab"
	U"a"  "b"	-> U"ab"
	U"a" U"b"	-> U"ab"

The proposals do not exclude the implementation of additional rules
for concatenation.  This implementation also provides for the following
valid concatenations.  The rationale behind this choice is that the
concatenation of strings shall result in a string with the highest width,
according to the ascending order: char - char16_t - char32_t - wchar.

	u"a" U"a"	-> U"ab"
	U"a" u"b"	-> U"ab"
	u"a" L"a"	-> L"ab"
	L"a" u"b"	-> L"ab"
	U"a" L"b"	-> L"ab"
	L"a" U"b"	-> L"ab"

Changes were also needed in some parts of the tokenizer and the parser
to change the existing logic from distinguishing between non-wide and
wide character to supporting characters of varying widths.

Testcases:
----------
This patch adds testcases for all functionality described above.  The
test cases ensure that the literals are parsed correctly, and that the
resulting values are correct.  The tests also ensure that the width of
the character literals is correct.  All combinations of string
concatenation are exercised as well.  Finally, tests were added to
ensure that errors are flagged for empty characters (u'' and U''),
warnings for constants that are too long (u'ab', U'ab' and u"\Unnnnnnnn"
where \Unnnnnnnn is outside the BMP), and warnings for implicit truncation
of values (char16_t c = U'\Unnnnnnnn' or char32_t c = u'\Unnnnnnnn'
where \Unnnnnnnn is outside the BMP).

ChangeLog entries:
------------------
libcpp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * include/cpp-id-data.h (UC): Was U, conflicts with U"..." literal.
        * include/cpplib.h (CHAR16, CHAR32, STRING16, STRING32): New tokens.
        (cpp_interpret_string): Update prototype.
        (cpp_interpret_string_notranslate): Idem.
        * charset.c (init_iconv_desc): New width member in cset_converter.
        (cpp_init_iconv): Add support for char{16,32}_cset_desc.
        (convert_ucn): Idem.
        (emit_numeric_escape): Idem.
        (convert_hex): Idem.
        (convert_oct): Idem.
        (convert_escape): Idem.
        (convertor_for_type): New function.
        (cpp_interpret_string): Use convertor_for_type, support u and U prefix.
        (cpp_interpret_string_notranslate): Match changed prototype.
        (wide_str_to_charconst): Use convertor_for_type.
        (cpp_interpret_charconst): Add support for CPP_CHAR{16,32}.
        * directives.c (linemarker_dir): Macro U changed to UC.
        (parse_include): Idem.
        (register_pragma_1): Idem.
        (restore_registered_pragmas): Idem.
        (get__Pragma_string): Support CPP_STRING{16,32}.
        * expr.c (eval_token): Support CPP_CHAR{16,32}.
        * internal.h (struct cset_converter) <width>: New field.
        (struct cpp_reader) <char16_cset_desc>: Idem.
        (struct cpp_reader) <char32_cset_desc>: Idem.
        * lex.c (digraph_spellings): Macro U changed to UC.
        (OP, TK): Idem.
        (lex_string): Add support for u'...', U'...', u"..." and U"...".
        (_cpp_lex_direct): Idem.
        * macro.c (_cpp_builtin_macro_text): Macro U changed to UC.
        (stringify_arg): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.

gcc/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>
          
        * c-common.c (CHAR16_TYPE, CHAR32_TYPE): New macros.
        (fname_as_string): Match updated cpp_interpret_string prototype.
        (fix_string_type): Support char16_t* and char32_t*.
        (c_common_nodes_and_builtins): Add char16_t and char32_t (and
        derivative) nodes.
        (c_parse_error): Support CPP_CHAR{16,32}.
        * c-common.h (RID_CHAR16, RID_CHAR32): New elements. 
        (enum c_tree_index) <CTI_CHAR16_TYPE, CTI_SIGNED_CHAR16_TYPE,
        CTI_UNSIGNED_CHAR16_TYPE, CTI_CHAR32_TYPE, CTI_SIGNED_CHAR32_TYPE,
        CTI_UNSIGNED_CHAR32_TYPE, CTI_CHAR16_ARRAY_TYPE,
        CTI_CHAR32_ARRAY_TYPE>: New elements.
        (char16_type_node, signed_char16_type_node, unsigned_char16_type_node,
        char32_type_node, signed_char32_type_node, char16_array_type_node,
        char32_array_type_node): New defines.
        * c-lex.c (cb_ident): Match updated cpp_interpret_string prototype.
        (c_lex_with_flags): Support CPP_CHAR{16,32} and CPP_STRING{16,32}.
        (lex_string): Support CPP_STRING{16,32}, match updated
        cpp_interpret_string and cpp_interpret_string_notranslate prototypes.
        (lex_charconst): Support CPP_CHAR{16,32}.
        * c-parser.c (c_parser_postfix_expression): Support CPP_CHAR{16,32}
        and CPP_STRING{16,32}.

gcc/cp/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        * parser.c (cp_lexer_next_token_is_decl_specifier_ke): Support
        RID_CHAR{16,32}.
        (cp_lexer_print_token): Support CPP_STRING{16,32}.
        (cp_parser_is_string_literal): Idem.
        (cp_parser_string_literal): Idem.
        (cp_parser_primary_expression): Support CPP_CHAR{16,32} and
        CPP_STRING{16,32}.
        (cp_parser_simple_type_specifier): Support RID_CHAR{16,32}. 
        * tree.c (char_type_p): Support char16_t and char32_t as char types.

gcc/testsuite/ChangeLog:
2008-03-13  Kris Van Hees <kris.van.hees@oracle.com>

        Tests for char16_t and char32_t support.
        * g++.dg/other/utf16-1.C: New
        * g++.dg/other/utf16-2.C: New
        * g++.dg/other/utf16-3.C: New
        * g++.dg/other/utf16-4.C: New
        * g++.dg/other/utf32-1.C: New
        * g++.dg/other/utf32-2.C: New
        * g++.dg/other/utf32-3.C: New
        * g++.dg/other/utf32-4.C: New
        * gcc.dg/utf16-1.c: New
        * gcc.dg/utf16-2.c: New
        * gcc.dg/utf16-3.c: New
        * gcc.dg/utf16-4.c: New
        * gcc.dg/utf32-1.c: New
        * gcc.dg/utf32-2.c: New
        * gcc.dg/utf32-3.c: New
        * gcc.dg/utf32-4.c: New

Bootstrapping and testing:
--------------------------
The source tree was built on the following platforms (target == host):

	i686-linux
	x86_64-linux
	ppc64-linux

Builds were done for both the unpatched tree and the patched tree, and
testsuite (make -k check) summary results were verified to be identical,
except for the added tests in the patched tree.  This was done to ensure
that the patch does not introduce regressions.

Index: gcc/c-lex.c
===================================================================
--- gcc/c-lex.c	(revision 133117)
+++ gcc/c-lex.c	(working copy)
@@ -174,7 +174,7 @@ cb_ident (cpp_reader * ARG_UNUSED (pfile
     {
       /* Convert escapes in the string.  */
       cpp_string cstr = { 0, 0 };
-      if (cpp_interpret_string (pfile, str, 1, &cstr, false))
+      if (cpp_interpret_string (pfile, str, 1, &cstr, CPP_STRING))
 	{
 	  ASM_OUTPUT_IDENT (asm_out_file, (const char *) cstr.text);
 	  free (CONST_CAST (unsigned char *, cstr.text));
@@ -361,6 +361,8 @@ c_lex_with_flags (tree *value, location_
 
 	    case CPP_STRING:
 	    case CPP_WSTRING:
+	    case CPP_STRING16:
+	    case CPP_STRING32:
 	      type = lex_string (tok, value, true, true);
 	      break;
 
@@ -410,11 +412,15 @@ c_lex_with_flags (tree *value, location_
 
     case CPP_CHAR:
     case CPP_WCHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       *value = lex_charconst (tok);
       break;
 
     case CPP_STRING:
     case CPP_WSTRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
       if ((lex_flags & C_LEX_RAW_STRINGS) == 0)
 	{
 	  type = lex_string (tok, value, false,
@@ -822,12 +828,12 @@ interpret_fixed (const cpp_token *token,
   return value;
 }
 
-/* Convert a series of STRING and/or WSTRING tokens into a tree,
-   performing string constant concatenation.  TOK is the first of
-   these.  VALP is the location to write the string into.  OBJC_STRING
-   indicates whether an '@' token preceded the incoming token.
+/* Convert a series of STRING, WSTRING, STRING16 and/or STRING32 tokens
+   into a tree, performing string constant concatenation.  TOK is the
+   first of these.  VALP is the location to write the string into.
+   OBJC_STRING indicates whether an '@' token preceded the incoming token.
    Returns the CPP token type of the result (CPP_STRING, CPP_WSTRING,
-   or CPP_OBJC_STRING).
+   CPP_STRING32, CPP_STRING16, or CPP_OBJC_STRING).
 
    This is unfortunately more work than it should be.  If any of the
    strings in the series has an L prefix, the result is a wide string
@@ -842,19 +848,16 @@ static enum cpp_ttype
 lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 {
   tree value;
-  bool wide = false;
   size_t concats = 0;
   struct obstack str_ob;
   cpp_string istr;
+  enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
   cpp_string *strs = &str;
 
-  if (tok->type == CPP_WSTRING)
-    wide = true;
-
  retry:
   tok = cpp_get_token (parse_in);
   switch (tok->type)
@@ -873,10 +876,21 @@ lex_string (const cpp_token *tok, tree *
       break;
 
     case CPP_WSTRING:
-      wide = true;
-      /* FALLTHROUGH */
+      type = CPP_WSTRING;
+      goto concat;
+
+    case CPP_STRING32:
+      if (type != CPP_WSTRING)
+	type = CPP_STRING32;
+      goto concat;
+
+    case CPP_STRING16:
+      if (type == CPP_STRING)
+	type = CPP_STRING16;
+      goto concat;
 
     case CPP_STRING:
+  concat:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
@@ -899,7 +913,7 @@ lex_string (const cpp_token *tok, tree *
 
   if ((translate
        ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, concats + 1, &istr, wide))
+      (parse_in, strs, concats + 1, &istr, type))
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
@@ -909,22 +923,50 @@ lex_string (const cpp_token *tok, tree *
       /* Callers cannot generally handle error_mark_node in this context,
 	 so return the empty string instead.  cpp_interpret_string has
 	 issued an error.  */
-      if (wide)
-	value = build_string (TYPE_PRECISION (wchar_type_node)
-			      / TYPE_PRECISION (char_type_node),
-			      "\0\0\0");  /* widest supported wchar_t
-					     is 32 bits */
-      else
-	value = build_string (1, "");
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  value = build_string (1, "");
+	  break;
+	case CPP_STRING16:
+	  value = build_string (TYPE_PRECISION (char16_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0");  /* char16_t is 16 bits */
+	  break;
+	case CPP_STRING32:
+	  value = build_string (TYPE_PRECISION (char32_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* char32_t is 32 bits */
+	  break;
+	case CPP_WSTRING:
+	  value = build_string (TYPE_PRECISION (wchar_type_node)
+				/ TYPE_PRECISION (char_type_node),
+				"\0\0\0");  /* widest supported wchar_t
+					       is 32 bits */
+	  break;
+      }
     }
 
-  TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+  switch (type) {
+    default:
+    case CPP_STRING:
+      TREE_TYPE (value) = char_array_type_node;
+      break;
+    case CPP_STRING16:
+      TREE_TYPE (value) = char16_array_type_node;
+      break;
+    case CPP_STRING32:
+      TREE_TYPE (value) = char32_array_type_node;
+      break;
+    case CPP_WSTRING:
+      TREE_TYPE (value) = wchar_array_type_node;
+  }
   *valp = fix_string_type (value);
 
   if (concats)
     obstack_free (&str_ob, 0);
 
-  return objc_string ? CPP_OBJC_STRING : wide ? CPP_WSTRING : CPP_STRING;
+  return objc_string ? CPP_OBJC_STRING : type;
 }
 
 /* Converts a (possibly wide) character constant token into a tree.  */
@@ -941,6 +983,10 @@ lex_charconst (const cpp_token *token)
 
   if (token->type == CPP_WCHAR)
     type = wchar_type_node;
+  else if (token->type == CPP_CHAR32)
+    type = char32_type_node;
+  else if (token->type == CPP_CHAR16)
+    type = char16_type_node;
   /* In C, a character constant has type 'int'.
      In C++ 'char', but multi-char charconsts have type 'int'.  */
   else if (!c_dialect_cxx () || chars_seen > 1)
Index: gcc/cp/tree.c
===================================================================
--- gcc/cp/tree.c	(revision 133117)
+++ gcc/cp/tree.c	(working copy)
@@ -2474,6 +2474,8 @@ char_type_p (tree type)
   return (same_type_p (type, char_type_node)
 	  || same_type_p (type, unsigned_char_type_node)
 	  || same_type_p (type, signed_char_type_node)
+	  || same_type_p (type, char16_type_node)
+	  || same_type_p (type, char32_type_node)
 	  || same_type_p (type, wchar_type_node));
 }
 
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 133117)
+++ gcc/cp/parser.c	(working copy)
@@ -556,6 +556,8 @@ cp_lexer_next_token_is_decl_specifier_ke
     case RID_TYPENAME:
       /* Simple type specifiers.  */
     case RID_CHAR:
+    case RID_CHAR16:
+    case RID_CHAR32:
     case RID_WCHAR:
     case RID_BOOL:
     case RID_SHORT:
@@ -789,6 +791,8 @@ cp_lexer_print_token (FILE * stream, cp_
       break;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       fprintf (stream, " \"%s\"", TREE_STRING_POINTER (token->u.value));
       break;
@@ -2033,7 +2037,10 @@ cp_parser_parsing_tentatively (cp_parser
 static bool
 cp_parser_is_string_literal (cp_token* token)
 {
-  return (token->type == CPP_STRING || token->type == CPP_WSTRING);
+  return (token->type == CPP_STRING ||
+	  token->type == CPP_STRING16 ||
+	  token->type == CPP_STRING32 ||
+	  token->type == CPP_WSTRING);
 }
 
 /* Returns nonzero if TOKEN is the indicated KEYWORD.  */
@@ -2861,11 +2868,11 @@ static tree
 cp_parser_string_literal (cp_parser *parser, bool translate, bool wide_ok)
 {
   tree value;
-  bool wide = false;
   size_t count;
   struct obstack str_ob;
   cpp_string str, istr, *strs;
   cp_token *tok;
+  enum cpp_ttype type;
 
   tok = cp_lexer_peek_token (parser->lexer);
   if (!cp_parser_is_string_literal (tok))
@@ -2874,6 +2881,8 @@ cp_parser_string_literal (cp_parser *par
       return error_mark_node;
     }
 
+  type = tok->type;
+
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   if (!cp_parser_is_string_literal
@@ -2884,8 +2893,6 @@ cp_parser_string_literal (cp_parser *par
       str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
       str.len = TREE_STRING_LENGTH (tok->u.value);
       count = 1;
-      if (tok->type == CPP_WSTRING)
-	wide = true;
 
       strs = &str;
     }
@@ -2900,8 +2907,24 @@ cp_parser_string_literal (cp_parser *par
 	  count++;
 	  str.text = (const unsigned char *)TREE_STRING_POINTER (tok->u.value);
 	  str.len = TREE_STRING_LENGTH (tok->u.value);
-	  if (tok->type == CPP_WSTRING)
-	    wide = true;
+
+	  switch (tok->type) {
+	    case CPP_STRING:
+	      break;
+	    case CPP_STRING16:
+	      if (type == CPP_STRING)
+		type = CPP_STRING16;
+
+	      break;
+	    case CPP_STRING32:
+	      if (type != CPP_WSTRING)
+		type = CPP_STRING32;
+
+	      break;
+	    case CPP_WSTRING:
+	      type = CPP_WSTRING;
+	      break;
+	  }
 
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
 
@@ -2912,19 +2935,34 @@ cp_parser_string_literal (cp_parser *par
       strs = (cpp_string *) obstack_finish (&str_ob);
     }
 
-  if (wide && !wide_ok)
+  if (type != CPP_STRING && !wide_ok)
     {
       cp_parser_error (parser, "a wide string is invalid in this context");
-      wide = false;
+      type = CPP_STRING;
     }
 
   if ((translate ? cpp_interpret_string : cpp_interpret_string_notranslate)
-      (parse_in, strs, count, &istr, wide))
+      (parse_in, strs, count, &istr, type))
     {
       value = build_string (istr.len, (const char *)istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
 
-      TREE_TYPE (value) = wide ? wchar_array_type_node : char_array_type_node;
+      switch (type) {
+	default:
+	case CPP_STRING:
+	  TREE_TYPE (value) = char_array_type_node;
+	  break;
+	case CPP_STRING16:
+	  TREE_TYPE (value) = char16_array_type_node;
+	  break;
+	case CPP_STRING32:
+	  TREE_TYPE (value) = char32_array_type_node;
+	  break;
+	case CPP_WSTRING:
+	  TREE_TYPE (value) = wchar_array_type_node;
+	  break;
+      }
+
       value = fix_string_type (value);
     }
   else
@@ -3079,6 +3117,8 @@ cp_parser_primary_expression (cp_parser 
 	   string-literal
 	   boolean-literal  */
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
     case CPP_NUMBER:
       token = cp_lexer_consume_token (parser->lexer);
@@ -3130,6 +3170,8 @@ cp_parser_primary_expression (cp_parser 
       return token->u.value;
 
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       /* ??? Should wide strings be allowed when parser->translate_strings_p
 	 is false (i.e. in attributes)?  If not, we can kill the third
@@ -10762,6 +10804,12 @@ cp_parser_simple_type_specifier (cp_pars
 	decl_specs->explicit_char_p = true;
       type = char_type_node;
       break;
+    case RID_CHAR16:
+      type = char16_type_node;
+      break;
+    case RID_CHAR32:
+      type = char32_type_node;
+      break;
     case RID_WCHAR:
       type = wchar_type_node;
       break;
Index: gcc/c-common.c
===================================================================
--- gcc/c-common.c	(revision 133117)
+++ gcc/c-common.c	(working copy)
@@ -66,6 +66,14 @@ cpp_reader *parse_in;		/* Declared in c-
 #define PID_TYPE "int"
 #endif
 
+#ifndef CHAR16_TYPE
+#define CHAR16_TYPE "short unsigned int"
+#endif
+
+#ifndef CHAR32_TYPE
+#define CHAR32_TYPE "unsigned int"
+#endif
+
 #ifndef WCHAR_TYPE
 #define WCHAR_TYPE "int"
 #endif
@@ -123,6 +131,13 @@ cpp_reader *parse_in;		/* Declared in c-
 	tree signed_wchar_type_node;
 	tree unsigned_wchar_type_node;
 
+	tree char16_type_node;
+	tree signed_char16_type_node;
+	tree unsigned_char16_type_node;
+	tree char32_type_node;
+	tree signed_char32_type_node;
+	tree unsigned_char32_type_node;
+
 	tree float_type_node;
 	tree double_type_node;
 	tree long_double_type_node;
@@ -174,6 +189,16 @@ cpp_reader *parse_in;		/* Declared in c-
 
 	tree wchar_array_type_node;
 
+   Type `char16_t[SOMENUMBER]' or something like it.
+   Used when a UTF-16 string literal is created.
+
+	tree char16_array_type_node;
+
+   Type `char32_t[SOMENUMBER]' or something like it.
+   Used when a UTF-32 string literal is created.
+
+	tree char32_array_type_node;
+
    Type `int ()' -- used for implicit declaration of functions.
 
 	tree default_function_type;
@@ -777,7 +802,7 @@ fname_as_string (int pretty_p)
   strname.text = (unsigned char *) namep;
   strname.len = len - 1;
 
-  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, false))
+  if (cpp_interpret_string (parse_in, &strname, 1, &cstr, CPP_STRING))
     {
       XDELETEVEC (namep);
       return (const char *) cstr.text;
@@ -857,14 +882,28 @@ fname_decl (unsigned int rid, tree id)
 tree
 fix_string_type (tree value)
 {
-  const int wchar_bytes = TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT;
-  const int wide_flag = TREE_TYPE (value) == wchar_array_type_node;
+  const bool wide = TREE_TYPE (value)
+		    && TREE_TYPE (value) != char_array_type_node;
   int length = TREE_STRING_LENGTH (value);
   int nchars;
   tree e_type, i_type, a_type;
 
   /* Compute the number of elements, for the array type.  */
-  nchars = wide_flag ? length / wchar_bytes : length;
+  if (wide) {
+    if (TREE_TYPE (value) == char16_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char16_type_node) / BITS_PER_UNIT);
+      e_type = char16_type_node;
+    } else if (TREE_TYPE (value) == char32_array_type_node) {
+      nchars = length / (TYPE_PRECISION (char32_type_node) / BITS_PER_UNIT);
+      e_type = char32_type_node;
+    } else {
+      nchars = length / (TYPE_PRECISION (wchar_type_node) / BITS_PER_UNIT);
+      e_type = wchar_type_node;
+    }
+  } else {
+    nchars = length;
+    e_type = char_type_node;
+  }
 
   /* C89 2.2.4.1, C99 5.2.4.1 (Translation limits).  The analogous
      limit in C++98 Annex B is very large (65536) and is not normative,
@@ -899,7 +938,6 @@ fix_string_type (tree value)
      construct the matching unqualified array type first.  The C front
      end does not require this, but it does no harm, so we do it
      unconditionally.  */
-  e_type = wide_flag ? wchar_type_node : char_type_node;
   i_type = build_index_type (build_int_cst (NULL_TREE, nchars - 1));
   a_type = build_array_type (e_type, i_type);
   if (c_dialect_cxx() || warn_write_strings)
@@ -3625,6 +3663,8 @@ c_define_builtins (tree va_list_ref_type
 void
 c_common_nodes_and_builtins (void)
 {
+  int char16_type_size;
+  int char32_type_size;
   int wchar_type_size;
   tree array_domain_type;
   tree va_list_ref_type_node;
@@ -3874,6 +3914,50 @@ c_common_nodes_and_builtins (void)
   wchar_array_type_node
     = build_array_type (wchar_type_node, array_domain_type);
 
+  /* Define 'char16_t', `signed char16_t' and `unsigned char16_t'.  */
+  char16_type_node = get_identifier (CHAR16_TYPE);
+  char16_type_node = TREE_TYPE (identifier_global_value (char16_type_node));
+  char16_type_size = TYPE_PRECISION (char16_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char16_type_node))
+	char16_type_node = make_unsigned_type (char16_type_size);
+      else
+	char16_type_node = make_signed_type (char16_type_size);
+      record_builtin_type (RID_CHAR16, "char16_t", char16_type_node);
+    }
+  else
+    {
+      signed_char16_type_node = c_common_signed_type (char16_type_node);
+      unsigned_char16_type_node = c_common_unsigned_type (char16_type_node);
+    }
+
+  /* This is for UTF-16 string constants.  */
+  char16_array_type_node
+    = build_array_type (char16_type_node, array_domain_type);
+
+  /* Define 'char32_t', `signed char32_t' and `unsigned char32_t'.  */
+  char32_type_node = get_identifier (CHAR32_TYPE);
+  char32_type_node = TREE_TYPE (identifier_global_value (char32_type_node));
+  char32_type_size = TYPE_PRECISION (char32_type_node);
+  if (c_dialect_cxx ())
+    {
+      if (TYPE_UNSIGNED (char32_type_node))
+	char32_type_node = make_unsigned_type (char32_type_size);
+      else
+	char32_type_node = make_signed_type (char32_type_size);
+      record_builtin_type (RID_CHAR32, "char32_t", char32_type_node);
+    }
+  else
+    {
+      signed_char32_type_node = c_common_signed_type (char32_type_node);
+      unsigned_char32_type_node = c_common_unsigned_type (char32_type_node);
+    }
+
+  /* This is for UTF-32 string constants.  */
+  char32_array_type_node
+    = build_array_type (char32_type_node, array_domain_type);
+
   wint_type_node =
     TREE_TYPE (identifier_global_value (get_identifier (WINT_TYPE)));
 
@@ -6652,20 +6736,38 @@ c_parse_error (const char *gmsgid, enum 
 
   if (token == CPP_EOF)
     message = catenate_messages (gmsgid, " at end of input");
-  else if (token == CPP_CHAR || token == CPP_WCHAR)
+  else if (token == CPP_CHAR || token == CPP_WCHAR || token == CPP_CHAR16
+	   || token == CPP_CHAR32)
     {
       unsigned int val = TREE_INT_CST_LOW (value);
-      const char *const ell = (token == CPP_CHAR) ? "" : "L";
+      const char *prefix;
+
+      switch (token) {
+	default:
+	  prefix = "";
+	  break;
+	case CPP_WCHAR:
+	  prefix = "L";
+	  break;
+	case CPP_CHAR16:
+	  prefix = "u";
+	  break;
+	case CPP_CHAR32:
+	  prefix = "U";
+	  break;
+      }
+
       if (val <= UCHAR_MAX && ISGRAPH (val))
 	message = catenate_messages (gmsgid, " before %s'%c'");
       else
 	message = catenate_messages (gmsgid, " before %s'\\x%x'");
 
-      error (message, ell, val);
+      error (message, prefix, val);
       free (message);
       message = NULL;
     }
-  else if (token == CPP_STRING || token == CPP_WSTRING)
+  else if (token == CPP_STRING || token == CPP_WSTRING || token == CPP_STRING16
+	   || token == CPP_STRING32)
     message = catenate_messages (gmsgid, " before string constant");
   else if (token == CPP_NUMBER)
     message = catenate_messages (gmsgid, " before numeric constant");
Index: gcc/c-common.h
===================================================================
--- gcc/c-common.h	(revision 133117)
+++ gcc/c-common.h	(working copy)
@@ -85,7 +85,7 @@ enum rid
   RID_NEW,      RID_OFFSETOF, RID_OPERATOR,
   RID_THIS,     RID_THROW,    RID_TRUE,
   RID_TRY,      RID_TYPENAME, RID_TYPEID,
-  RID_USING,
+  RID_USING,    RID_CHAR16,   RID_CHAR32,
 
   /* casts */
   RID_CONSTCAST, RID_DYNCAST, RID_REINTCAST, RID_STATCAST,
@@ -143,6 +143,12 @@ extern GTY ((length ("(int) RID_MAX"))) 
 
 enum c_tree_index
 {
+    CTI_CHAR16_TYPE,
+    CTI_SIGNED_CHAR16_TYPE,
+    CTI_UNSIGNED_CHAR16_TYPE,
+    CTI_CHAR32_TYPE,
+    CTI_SIGNED_CHAR32_TYPE,
+    CTI_UNSIGNED_CHAR32_TYPE,
     CTI_WCHAR_TYPE,
     CTI_SIGNED_WCHAR_TYPE,
     CTI_UNSIGNED_WCHAR_TYPE,
@@ -155,6 +161,8 @@ enum c_tree_index
     CTI_WIDEST_UINT_LIT_TYPE,
 
     CTI_CHAR_ARRAY_TYPE,
+    CTI_CHAR16_ARRAY_TYPE,
+    CTI_CHAR32_ARRAY_TYPE,
     CTI_WCHAR_ARRAY_TYPE,
     CTI_INT_ARRAY_TYPE,
     CTI_STRING_TYPE,
@@ -190,6 +198,12 @@ struct c_common_identifier GTY(())
   struct cpp_hashnode node;
 };
 
+#define char16_type_node		c_global_trees[CTI_CHAR16_TYPE]
+#define signed_char16_type_node		c_global_trees[CTI_SIGNED_CHAR16_TYPE]
+#define unsigned_char16_type_node	c_global_trees[CTI_UNSIGNED_CHAR16_TYPE]
+#define char32_type_node		c_global_trees[CTI_CHAR32_TYPE]
+#define signed_char32_type_node		c_global_trees[CTI_SIGNED_CHAR32_TYPE]
+#define unsigned_char32_type_node	c_global_trees[CTI_UNSIGNED_CHAR32_TYPE]
 #define wchar_type_node			c_global_trees[CTI_WCHAR_TYPE]
 #define signed_wchar_type_node		c_global_trees[CTI_SIGNED_WCHAR_TYPE]
 #define unsigned_wchar_type_node	c_global_trees[CTI_UNSIGNED_WCHAR_TYPE]
@@ -206,6 +220,8 @@ struct c_common_identifier GTY(())
 #define truthvalue_false_node		c_global_trees[CTI_TRUTHVALUE_FALSE]
 
 #define char_array_type_node		c_global_trees[CTI_CHAR_ARRAY_TYPE]
+#define char16_array_type_node		c_global_trees[CTI_CHAR16_ARRAY_TYPE]
+#define char32_array_type_node		c_global_trees[CTI_CHAR32_ARRAY_TYPE]
 #define wchar_array_type_node		c_global_trees[CTI_WCHAR_ARRAY_TYPE]
 #define int_array_type_node		c_global_trees[CTI_INT_ARRAY_TYPE]
 #define string_type_node		c_global_trees[CTI_STRING_TYPE]
Index: gcc/c-parser.c
===================================================================
--- gcc/c-parser.c	(revision 133117)
+++ gcc/c-parser.c	(working copy)
@@ -5168,12 +5168,16 @@ c_parser_postfix_expression (c_parser *p
     {
     case CPP_NUMBER:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
     case CPP_WCHAR:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = ERROR_MARK;
       c_parser_consume_token (parser);
       break;
     case CPP_STRING:
+    case CPP_STRING16:
+    case CPP_STRING32:
     case CPP_WSTRING:
       expr.value = c_parser_peek_token (parser)->value;
       expr.original_code = STRING_CST;
Index: libcpp/macro.c
===================================================================
--- libcpp/macro.c	(revision 133117)
+++ libcpp/macro.c	(working copy)
@@ -158,7 +158,7 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 		  {
 		    cpp_errno (pfile, CPP_DL_WARNING,
 			"could not determine file timestamp");
-		    pbuffer->timestamp = U"\"??? ??? ?? ??:??:?? ????\"";
+		    pbuffer->timestamp = UC"\"??? ??? ?? ??:??:?? ????\"";
 		  }
 	      }
 	  }
@@ -256,8 +256,8 @@ _cpp_builtin_macro_text (cpp_reader *pfi
 	      cpp_errno (pfile, CPP_DL_WARNING,
 			 "could not determine date and time");
 		
-	      pfile->date = U"\"??? ?? ????\"";
-	      pfile->time = U"\"??:??:??\"";
+	      pfile->date = UC"\"??? ?? ????\"";
+	      pfile->time = UC"\"??:??:??\"";
 	    }
 	}
 
@@ -375,8 +375,10 @@ stringify_arg (cpp_reader *pfile, macro_
 	  continue;
 	}
 
-      escape_it = (token->type == CPP_STRING || token->type == CPP_WSTRING
-		   || token->type == CPP_CHAR || token->type == CPP_WCHAR);
+      escape_it = (token->type == CPP_STRING || token->type == CPP_CHAR
+		   || token->type == CPP_WSTRING || token->type == CPP_STRING
+		   || token->type == CPP_STRING32 || token->type == CPP_CHAR32
+		   || token->type == CPP_STRING16 || token->type == CPP_CHAR16);
 
       /* Room for each char being written in octal, initial space and
 	 final quote and NUL.  */
Index: libcpp/directives.c
===================================================================
--- libcpp/directives.c	(revision 133117)
+++ libcpp/directives.c	(working copy)
@@ -188,7 +188,7 @@ DIRECTIVE_TABLE
    did use this notation in its preprocessed output.  */
 static const directive linemarker_dir =
 {
-  do_linemarker, U"#", 1, KANDR, IN_I
+  do_linemarker, UC"#", 1, KANDR, IN_I
 };
 
 #define SEEN_EOL() (pfile->cur_token[-1].type == CPP_EOF)
@@ -689,7 +689,7 @@ parse_include (cpp_reader *pfile, int *p
       const unsigned char *dir;
 
       if (pfile->directive == &dtable[T_PRAGMA])
-	dir = U"pragma dependency";
+	dir = UC"pragma dependency";
       else
 	dir = pfile->directive->name;
       cpp_error (pfile, CPP_DL_ERROR, "#%s expects \"FILENAME\" or <FILENAME>",
@@ -1077,7 +1077,7 @@ register_pragma_1 (cpp_reader *pfile, co
 
   if (space)
     {
-      node = cpp_lookup (pfile, U space, strlen (space));
+      node = cpp_lookup (pfile, UC space, strlen (space));
       entry = lookup_pragma_entry (*chain, node);
       if (!entry)
 	{
@@ -1106,7 +1106,7 @@ register_pragma_1 (cpp_reader *pfile, co
     }
 
   /* Check for duplicates.  */
-  node = cpp_lookup (pfile, U name, strlen (name));
+  node = cpp_lookup (pfile, UC name, strlen (name));
   entry = lookup_pragma_entry (*chain, node);
   if (entry == NULL)
     {
@@ -1254,7 +1254,7 @@ restore_registered_pragmas (cpp_reader *
     {
       if (pe->is_nspace)
 	sd = restore_registered_pragmas (pfile, pe->u.space, sd);
-      pe->pragma = cpp_lookup (pfile, U *sd, strlen (*sd));
+      pe->pragma = cpp_lookup (pfile, UC *sd, strlen (*sd));
       free (*sd);
       sd++;
     }
@@ -1483,7 +1483,8 @@ get__Pragma_string (cpp_reader *pfile)
   string = get_token_no_padding (pfile);
   if (string->type == CPP_EOF)
     _cpp_backup_tokens (pfile, 1);
-  if (string->type != CPP_STRING && string->type != CPP_WSTRING)
+  if (string->type != CPP_STRING && string->type != CPP_WSTRING
+      && string->type != CPP_STRING32 && string->type != CPP_STRING16)
     return NULL;
 
   paren = get_token_no_padding (pfile);
Index: libcpp/include/cpplib.h
===================================================================
--- libcpp/include/cpplib.h	(revision 133117)
+++ libcpp/include/cpplib.h	(working copy)
@@ -123,10 +123,14 @@ struct _cpp_file;
 									\
   TK(CHAR,		LITERAL) /* 'char' */				\
   TK(WCHAR,		LITERAL) /* L'char' */				\
+  TK(CHAR16,		LITERAL) /* u'char' */				\
+  TK(CHAR32,		LITERAL) /* U'char' */				\
   TK(OTHER,		LITERAL) /* stray punctuation */		\
 									\
   TK(STRING,		LITERAL) /* "string" */				\
   TK(WSTRING,		LITERAL) /* L"string" */			\
+  TK(STRING16,		LITERAL) /* u"string" */			\
+  TK(STRING32,		LITERAL) /* U"string" */			\
   TK(OBJC_STRING,	LITERAL) /* @"string" - Objective-C */		\
   TK(HEADER_NAME,	LITERAL) /* <stdio.h> in #include */		\
 									\
@@ -703,10 +707,10 @@ extern cppchar_t cpp_interpret_charconst
 /* Evaluate a vector of CPP_STRING or CPP_WSTRING tokens.  */
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
-				  cpp_string *, bool);
+				  cpp_string *, enum cpp_ttype);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
-					      cpp_string *, bool);
+					      cpp_string *, enum cpp_ttype);
 
 /* Convert a host character constant to the execution character set.  */
 extern cppchar_t cpp_host_to_exec_charset (cpp_reader *, cppchar_t);
Index: libcpp/include/cpp-id-data.h
===================================================================
--- libcpp/include/cpp-id-data.h	(revision 133117)
+++ libcpp/include/cpp-id-data.h	(working copy)
@@ -22,7 +22,7 @@ Foundation, 51 Franklin Street, Fifth Fl
 typedef unsigned char uchar;
 #endif
 
-#define U (const unsigned char *)  /* Intended use: U"string" */
+#define UC (const unsigned char *)  /* Intended use: UC"string" */
 
 /* Chained list of answers to an assertion.  */
 struct answer GTY(())
Index: libcpp/expr.c
===================================================================
--- libcpp/expr.c	(revision 133117)
+++ libcpp/expr.c	(working copy)
@@ -691,6 +691,8 @@ eval_token (cpp_reader *pfile, const cpp
 
     case CPP_WCHAR:
     case CPP_CHAR:
+    case CPP_CHAR16:
+    case CPP_CHAR32:
       {
 	cppchar_t cc = cpp_interpret_charconst (pfile, token,
 						&temp, &unsignedp);
@@ -849,6 +851,8 @@ _cpp_parse_expr (cpp_reader *pfile)
 	case CPP_NUMBER:
 	case CPP_CHAR:
 	case CPP_WCHAR:
+	case CPP_CHAR16:
+	case CPP_CHAR32:
 	case CPP_NAME:
 	case CPP_HASH:
 	  if (!want_value)
Index: libcpp/internal.h
===================================================================
--- libcpp/internal.h	(revision 133117)
+++ libcpp/internal.h	(working copy)
@@ -48,6 +48,7 @@ struct cset_converter
 {
   convert_f func;
   iconv_t cd;
+  int width;
 };
 
 #define BITS_PER_CPPCHAR_T (CHAR_BIT * sizeof (cppchar_t))
@@ -399,6 +400,14 @@ struct cpp_reader
   struct cset_converter narrow_cset_desc;
 
   /* Descriptor for converting from the source character set to the
+     UTF-16 execution character set.  */
+  struct cset_converter char16_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
+     UTF-32 execution character set.  */
+  struct cset_converter char32_cset_desc;
+
+  /* Descriptor for converting from the source character set to the
      wide execution character set.  */
   struct cset_converter wide_cset_desc;
 
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 133117)
+++ libcpp/lex.c	(working copy)
@@ -39,10 +39,10 @@ struct token_spelling
 };
 
 static const unsigned char *const digraph_spellings[] =
-{ U"%:", U"%:%:", U"<:", U":>", U"<%", U"%>" };
+{ UC"%:", UC"%:%:", UC"<:", UC":>", UC"<%", UC"%>" };
 
-#define OP(e, s) { SPELL_OPERATOR, U s  },
-#define TK(e, s) { SPELL_ ## s,    U #e },
+#define OP(e, s) { SPELL_OPERATOR, UC s  },
+#define TK(e, s) { SPELL_ ## s,    UC #e },
 static const struct token_spelling token_spellings[N_TTYPES] = { TTYPE_TABLE };
 #undef OP
 #undef TK
@@ -611,8 +611,8 @@ create_literal (cpp_reader *pfile, cpp_t
 
 /* Lexes a string, character constant, or angle-bracketed header file
    name.  The stored string contains the spelling, including opening
-   quote and leading any leading 'L'.  It returns the type of the
-   literal, or CPP_OTHER if it was not properly terminated.
+   quote and leading any leading 'L', 'u' or 'U'.  It returns the type
+   of the literal, or CPP_OTHER if it was not properly terminated.
 
    The spelling is NUL-terminated, but it is not guaranteed that this
    is the first NUL since embedded NULs are preserved.  */
@@ -626,12 +626,17 @@ lex_string (cpp_reader *pfile, cpp_token
 
   cur = base;
   terminator = *cur++;
-  if (terminator == 'L')
+  if (terminator == 'L' || terminator == 'u' || terminator == 'U')
     terminator = *cur++;
   if (terminator == '\"')
-    type = *base == 'L' ? CPP_WSTRING: CPP_STRING;
+    type = *base == 'L' ? CPP_WSTRING
+			: *base == 'U' ? CPP_STRING32
+				       : *base == 'u' ? CPP_STRING16
+						      : CPP_STRING;
   else if (terminator == '\'')
-    type = *base == 'L' ? CPP_WCHAR: CPP_CHAR;
+    type = *base == 'L' ? CPP_WCHAR
+			: *base == 'U' ? CPP_CHAR32
+				       : *base == 'u' ? CPP_CHAR16 : CPP_CHAR;
   else
     terminator = '>', type = CPP_HEADER_NAME;
 
@@ -965,7 +970,9 @@ _cpp_lex_direct (cpp_reader *pfile)
       }
 
     case 'L':
-      /* 'L' may introduce wide characters or strings.  */
+    case 'u':
+    case 'U':
+      /* 'L', 'u' or 'U' may introduce wide characters or strings.  */
       if (*buffer->cur == '\'' || *buffer->cur == '"')
 	{
 	  lex_string (pfile, result, buffer->cur - 1);
@@ -977,12 +984,12 @@ _cpp_lex_direct (cpp_reader *pfile)
     case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
     case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
     case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
-    case 's': case 't': case 'u': case 'v': case 'w': case 'x':
+    case 's': case 't':           case 'v': case 'w': case 'x':
     case 'y': case 'z':
     case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
     case 'G': case 'H': case 'I': case 'J': case 'K':
     case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
-    case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
+    case 'S': case 'T':           case 'V': case 'W': case 'X':
     case 'Y': case 'Z':
       result->type = CPP_NAME;
       {
Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 133117)
+++ libcpp/charset.c	(working copy)
@@ -642,6 +642,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
       return ret;
     }
 
@@ -655,6 +656,7 @@ init_iconv_desc (cpp_reader *pfile, cons
       {
 	ret.func = conversion_tab[i].func;
 	ret.cd = conversion_tab[i].fake_cd;
+	ret.width = -1;
 	return ret;
       }
 
@@ -663,6 +665,7 @@ init_iconv_desc (cpp_reader *pfile, cons
     {
       ret.func = convert_using_iconv;
       ret.cd = iconv_open (to, from);
+      ret.width = -1;
 
       if (ret.cd == (iconv_t) -1)
 	{
@@ -683,6 +686,7 @@ init_iconv_desc (cpp_reader *pfile, cons
 		 from, to);
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
+      ret.width = -1;
     }
   return ret;
 }
@@ -716,7 +720,17 @@ cpp_init_iconv (cpp_reader *pfile)
     wcset = default_wcset;
 
   pfile->narrow_cset_desc = init_iconv_desc (pfile, ncset, SOURCE_CHARSET);
+  pfile->narrow_cset_desc.width = CPP_OPTION (pfile, char_precision);
+  pfile->char16_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-16BE" : "UTF-16LE",
+					     SOURCE_CHARSET);
+  pfile->char16_cset_desc.width = 16;
+  pfile->char32_cset_desc = init_iconv_desc (pfile,
+					     be ? "UTF-32BE" : "UTF-32LE",
+					     SOURCE_CHARSET);
+  pfile->char32_cset_desc.width = 32;
   pfile->wide_cset_desc = init_iconv_desc (pfile, wcset, SOURCE_CHARSET);
+  pfile->wide_cset_desc.width = CPP_OPTION (pfile, wchar_precision);
 }
 
 /* Destroy iconv(3) descriptors set up by cpp_init_iconv, if necessary.  */
@@ -1051,15 +1065,13 @@ _cpp_valid_ucn (cpp_reader *pfile, const
    An advanced pointer is returned.  Issues all relevant diagnostics.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t ucn;
   uchar buf[6];
   uchar *bufp = buf;
   size_t bytesleft = 6;
   int rval;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
   from++;  /* Skip u/U.  */
@@ -1086,14 +1098,15 @@ convert_ucn (cpp_reader *pfile, const uc
    function issues no diagnostics and never fails.  */
 static void
 emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
-		     struct _cpp_strbuf *tbuf, bool wide)
+		     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
-  if (wide)
+  size_t width = cvt.width;
+
+  if (width != CPP_OPTION(pfile, char_precision))
     {
       /* We have to render this into the target byte order, which may not
 	 be our byte order.  */
       bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-      size_t width = CPP_OPTION (pfile, wchar_precision);
       size_t cwidth = CPP_OPTION (pfile, char_precision);
       size_t cmask = width_to_mask (cwidth);
       size_t nbwc = width / cwidth;
@@ -1136,12 +1149,11 @@ emit_numeric_escape (cpp_reader *pfile, 
    number.  You can, e.g. generate surrogate pairs this way.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
   if (CPP_WTRADITIONAL (pfile))
@@ -1174,7 +1186,7 @@ convert_hex (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1187,12 +1199,11 @@ convert_hex (cpp_reader *pfile, const uc
    number.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, bool wide)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
-  size_t width = (wide ? CPP_OPTION (pfile, wchar_precision)
-		  : CPP_OPTION (pfile, char_precision));
+  size_t width = cvt.width;
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
@@ -1213,7 +1224,7 @@ convert_oct (cpp_reader *pfile, const uc
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, wide);
+  emit_numeric_escape (pfile, n, tbuf, cvt);
 
   return from;
 }
@@ -1224,7 +1235,7 @@ convert_oct (cpp_reader *pfile, const uc
    pointer.  Handles all relevant diagnostics.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, bool wide)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1236,23 +1247,21 @@ convert_escape (cpp_reader *pfile, const
 #endif
 
   uchar c;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
 
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, wide);
+      return convert_ucn (pfile, from, limit, tbuf, cvt);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, wide);
+      return convert_hex (pfile, from, limit, tbuf, cvt);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, wide);
+      return convert_oct (pfile, from, limit, tbuf, cvt);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1312,6 +1321,26 @@ convert_escape (cpp_reader *pfile, const
   return from + 1;
 }
 \f
+/* TYPE is a token type.  The return value is the conversion needed to
+   convert from source to execution character set for the given type. */
+static struct cset_converter
+convertor_for_type (cpp_reader *pfile, enum cpp_ttype type)
+{
+  switch (type) {
+    default:
+	return pfile->narrow_cset_desc;
+    case CPP_CHAR16:
+    case CPP_STRING16:
+	return pfile->char16_cset_desc;
+    case CPP_CHAR32:
+    case CPP_STRING32:
+	return pfile->char32_cset_desc;
+    case CPP_WCHAR:
+    case CPP_WSTRING:
+	return pfile->wide_cset_desc;
+  }
+}
+
 /* FROM is an array of cpp_string structures of length COUNT.  These
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
@@ -1320,13 +1349,12 @@ convert_escape (cpp_reader *pfile, const
    false for failure.  */
 bool
 cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to, bool wide)
+		      cpp_string *to,  enum cpp_ttype type)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
-  struct cset_converter cvt
-    = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc;
+  struct cset_converter cvt = convertor_for_type (pfile, type);
 
   tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
   tbuf.text = XNEWVEC (uchar, tbuf.asize);
@@ -1335,7 +1363,7 @@ cpp_interpret_string (cpp_reader *pfile,
   for (i = 0; i < count; i++)
     {
       p = from[i].text;
-      if (*p == 'L') p++;
+      if (*p == 'L' || *p == 'u' || *p == 'U') p++;
       p++; /* Skip leading quote.  */
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
@@ -1354,12 +1382,12 @@ cpp_interpret_string (cpp_reader *pfile,
 	  if (p == limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, wide);
+	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
 	}
     }
   /* NUL-terminate the 'to' buffer and translate it to a cpp_string
      structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, wide);
+  emit_numeric_escape (pfile, 0, &tbuf, cvt);
   tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
   to->text = tbuf.text;
   to->len = tbuf.len;
@@ -1375,7 +1403,8 @@ cpp_interpret_string (cpp_reader *pfile,
    in a string, but do not perform character set conversion.  */
 bool
 cpp_interpret_string_notranslate (cpp_reader *pfile, const cpp_string *from,
-				  size_t count,	cpp_string *to, bool wide)
+				  size_t count,	cpp_string *to,
+				  enum cpp_ttype type ATTRIBUTE_UNUSED)
 {
   struct cset_converter save_narrow_cset_desc = pfile->narrow_cset_desc;
   bool retval;
@@ -1383,7 +1412,7 @@ cpp_interpret_string_notranslate (cpp_re
   pfile->narrow_cset_desc.func = convert_no_conversion;
   pfile->narrow_cset_desc.cd = (iconv_t) -1;
 
-  retval = cpp_interpret_string (pfile, from, count, to, wide);
+  retval = cpp_interpret_string (pfile, from, count, to, CPP_STRING);
 
   pfile->narrow_cset_desc = save_narrow_cset_desc;
   return retval;
@@ -1462,13 +1491,14 @@ narrow_str_to_charconst (cpp_reader *pfi
 /* Subroutine of cpp_interpret_charconst which performs the conversion
    to a number, for wide strings.  STR is the string structure returned
    by cpp_interpret_string.  PCHARS_SEEN and UNSIGNEDP are as for
-   cpp_interpret_charconst.  */
+   cpp_interpret_charconst.  TYPE is the token type.  */
 static cppchar_t
 wide_str_to_charconst (cpp_reader *pfile, cpp_string str,
-		       unsigned int *pchars_seen, int *unsignedp)
+		       unsigned int *pchars_seen, int *unsignedp,
+		       enum cpp_ttype type)
 {
   bool bigend = CPP_OPTION (pfile, bytes_big_endian);
-  size_t width = CPP_OPTION (pfile, wchar_precision);
+  size_t width = convertor_for_type (pfile, type).width;
   size_t cwidth = CPP_OPTION (pfile, char_precision);
   size_t mask = width_to_mask (width);
   size_t cmask = width_to_mask (cwidth);
@@ -1490,7 +1520,7 @@ wide_str_to_charconst (cpp_reader *pfile
   /* Wide character constants have type wchar_t, and a single
      character exactly fills a wchar_t, so a multi-character wide
      character constant is guaranteed to overflow.  */
-  if (off > 0)
+  if (str.len > nbwc * 2)
     cpp_error (pfile, CPP_DL_WARNING,
 	       "character constant too long for its type");
 
@@ -1518,20 +1548,21 @@ cpp_interpret_charconst (cpp_reader *pfi
 			 unsigned int *pchars_seen, int *unsignedp)
 {
   cpp_string str = { 0, 0 };
-  bool wide = (token->type == CPP_WCHAR);
+  bool wide = (token->type != CPP_CHAR);
   cppchar_t result;
 
-  /* an empty constant will appear as L'' or '' */
+  /* an empty constant will appear as L'', u'', U'' or '' */
   if (token->val.str.len == (size_t) (2 + wide))
     {
       cpp_error (pfile, CPP_DL_ERROR, "empty character constant");
       return 0;
     }
-  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, wide))
+  else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, token->type))
     return 0;
 
   if (wide)
-    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp);
+    result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp,
+				    token->type);
   else
     result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
  2008-03-13 19:34 ` Kris Van Hees
@ 2008-03-13 19:56 ` Andrew Pinski
  2008-03-13 20:14   ` Kris Van Hees
  2008-03-13 19:59 ` Paul Koning
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Andrew Pinski @ 2008-03-13 19:56 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: gcc-patches

On Thu, Mar 13, 2008 at 12:32 PM, Kris Van Hees
<kris.van.hees@oracle.com> wrote:
>  This patch provides an implementation for support of UTF-16 and UTF-32
>  character data types in C and C++, based on the ISO/IEC draft technical
>  report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
>  (ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
>  encoding for UTF-16.  This implementation uses the target endianness
>  to determine whether UTF-16BE or UTF-16LE will be used.

I have a couple of questions about the ABI with this patch, how does
char16_t and char32_t get mangled for C++ code.  Is this documented
anywhere?  How does promotion work with these types in C++ and C and
is this tested?  I remember reading the technical draft for C and it
mentioned that the size does not have to exactly 16 (or 32) bytes, so
it might be best if you added documentation to the extension page
about this extension.

I don't see any of the testcases attached.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
  2008-03-13 19:34 ` Kris Van Hees
  2008-03-13 19:56 ` Andrew Pinski
@ 2008-03-13 19:59 ` Paul Koning
  2008-03-13 19:59   ` Andrew Pinski
  2008-03-14  1:18 ` Joseph S. Myers
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2008-03-13 19:59 UTC (permalink / raw)
  To: kris.van.hees; +Cc: gcc-patches

Is the u"foo" and U"foo" notation a standard?  Other modifiers on
literals (like 1L or 42U) are case insensitive.  I wonder if it
wouldn't be better for that to be true here as well.  Perhaps the
modifiers could be U"foo" and UL"foo" ?

	  paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:59 ` Paul Koning
@ 2008-03-13 19:59   ` Andrew Pinski
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Pinski @ 2008-03-13 19:59 UTC (permalink / raw)
  To: Paul Koning; +Cc: kris.van.hees, gcc-patches

On Thu, Mar 13, 2008 at 12:46 PM, Paul Koning <Paul_Koning@dell.com> wrote:
> Is the u"foo" and U"foo" notation a standard?

Yes, I know it seems weird but that is what the C draft technical
report and the C++ proposal both say.

I would agree that U and UL is better but I did not make this up.

--- Pinski

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:56 ` Andrew Pinski
@ 2008-03-13 20:14   ` Kris Van Hees
  2008-03-13 20:34     ` Kris Van Hees
  0 siblings, 1 reply; 11+ messages in thread
From: Kris Van Hees @ 2008-03-13 20:14 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-patches

On Thu, Mar 13, 2008 at 12:45:36PM -0700, Andrew Pinski wrote:
> On Thu, Mar 13, 2008 at 12:32 PM, Kris Van Hees
> <kris.van.hees@oracle.com> wrote:
> >  This patch provides an implementation for support of UTF-16 and UTF-32
> >  character data types in C and C++, based on the ISO/IEC draft technical
> >  report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
> >  (ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
> >  encoding for UTF-16.  This implementation uses the target endianness
> >  to determine whether UTF-16BE or UTF-16LE will be used.
> 
> I have a couple of questions about the ABI with this patch, how does
> char16_t and char32_t get mangled for C++ code.  Is this documented
> anywhere?  How does promotion work with these types in C++ and C and
> is this tested?  I remember reading the technical draft for C and it
> mentioned that the size does not have to exactly 16 (or 32) bytes, so
> it might be best if you added documentation to the extension page
> about this extension.

Let me get back to you on this, because I probably should solve the
following first...

> I don't see any of the testcases attached.

Oops - that is a stupid mistake on my end.  Generated the diff without
having it include new files.  I'll correct that immediately.

	Kris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 20:14   ` Kris Van Hees
@ 2008-03-13 20:34     ` Kris Van Hees
  2008-03-14  1:14       ` Joseph S. Myers
  0 siblings, 1 reply; 11+ messages in thread
From: Kris Van Hees @ 2008-03-13 20:34 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-patches

On Thu, Mar 13, 2008 at 03:58:54PM -0400, Kris Van Hees wrote:
> On Thu, Mar 13, 2008 at 12:45:36PM -0700, Andrew Pinski wrote:
> > On Thu, Mar 13, 2008 at 12:32 PM, Kris Van Hees
> > <kris.van.hees@oracle.com> wrote:
> > >  This patch provides an implementation for support of UTF-16 and UTF-32
> > >  character data types in C and C++, based on the ISO/IEC draft technical
> > >  report for C (ISO/IEC JTC1 SC22 WG14 N1040) and the proposal for C++
> > >  (ISO/IEC JTC1 SC22 WG21 N2249).  Neither proposal defines a specific
> > >  encoding for UTF-16.  This implementation uses the target endianness
> > >  to determine whether UTF-16BE or UTF-16LE will be used.
> >
> > I don't see any of the testcases attached.
> 
> Oops - that is a stupid mistake on my end.  Generated the diff without
> having it include new files.  I'll correct that immediately.
> 
> 	Kris

Here you go.  Sorry about that.

Index: gcc/testsuite/gcc.dg/utf32-2.c
===================================================================
--- gcc/testsuite/gcc.dg/utf32-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf32-2.c	(revision 0)
@@ -0,0 +1,30 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char32_t* string constants. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+
+typedef unsigned int char32_t;
+
+extern void abort (void);
+
+char32_t	*s0 = U"ab";
+char32_t	*s1 = U"a\u0024";
+char32_t	*s2 = U"a\u2029";
+char32_t	*s3 = U"a\U00064321";
+
+#define A	0x00000061
+#define B	0x00000062
+#define D	0x00000024
+#define X	0x00002029
+#define Y	0x00064321
+
+int main () {
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x00000000)
+	abort();
+    if (s1[0] != A || s1[1] != D || s0[2] != 0x00000000)
+	abort();
+    if (s2[0] != A || s2[1] != X || s0[2] != 0x00000000)
+	abort();
+    if (s3[0] != A || s3[1] != Y || s3[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/gcc.dg/utf32-4.c
===================================================================
--- gcc/testsuite/gcc.dg/utf32-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf32-4.c	(revision 0)
@@ -0,0 +1,22 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Expected errors for char32_t character constants. */
+/* { dg-do compile } */
+/* { dg-options "-std=c99" } */
+/* { dg-warning "implicitly truncated" "" { target *-*-* } 16 } */
+
+typedef unsigned int char32_t;
+
+char32_t	c0 = U'';		/* { dg-error "empty character" } */
+char32_t	c1 = U'ab';		/* { dg-warning "constant too long" } */
+char32_t	c2 = U'\U00064321';
+
+char32_t	c3 = 'a';
+char32_t	c4 = u'a';
+char32_t	c5 = u'\u2029';
+char32_t	c6 = u'\U00064321';	/* { dg-warning "constant too long" } */
+char32_t	c7 = L'a';
+char32_t	c8 = L'\u2029';
+char32_t	c9 = L'\U00064321';
+
+int main () {
+}
Index: gcc/testsuite/gcc.dg/utf16-2.c
===================================================================
--- gcc/testsuite/gcc.dg/utf16-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf16-2.c	(revision 0)
@@ -0,0 +1,31 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char16_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+
+typedef short unsigned int char16_t;
+
+extern void abort (void);
+
+char16_t	*s0 = u"ab";
+char16_t	*s1 = u"a\u0024";
+char16_t	*s2 = u"a\u2029";
+char16_t	*s3 = u"a\U00064321";
+
+#define A	0x0061
+#define B	0x0062
+#define D	0x0024
+#define X	0x2029
+#define Y1	0xD950
+#define Y2	0xDF21
+
+int main () {
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x0000)
+	abort();
+    if (s1[0] != A || s1[1] != D || s0[2] != 0x0000)
+	abort();
+    if (s2[0] != A || s2[1] != X || s0[2] != 0x0000)
+	abort();
+    if (s3[0] != A || s3[1] != Y1 || s3[2] != Y2 || s3[3] != 0x0000)
+	abort();
+}
Index: gcc/testsuite/gcc.dg/utf16-4.c
===================================================================
--- gcc/testsuite/gcc.dg/utf16-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf16-4.c	(revision 0)
@@ -0,0 +1,21 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Expected errors for char16_t character constants. */
+/* { dg-do compile } */
+/* { dg-options "-std=c99" } */
+
+typedef short unsigned int char16_t;
+
+char16_t	c0 = u'';		/* { dg-error "empty character" } */
+char16_t	c1 = u'ab';		/* { dg-warning "constant too long" } */
+char16_t	c2 = u'\U00064321';	/* { dg-warning "constant too long" } */
+
+char16_t	c3 = 'a';
+char16_t	c4 = U'a';
+char16_t	c5 = U'\u2029';
+char16_t	c6 = U'\U00064321';	/* { dg-warning "implicitly truncated" } */
+char16_t	c7 = L'a';
+char16_t	c8 = L'\u2029';
+char16_t	c9 = L'\U00064321';	/* { dg-warning "implicitly truncated" } */
+
+int main () {
+}
Index: gcc/testsuite/gcc.dg/utf32-1.c
===================================================================
--- gcc/testsuite/gcc.dg/utf32-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf32-1.c	(revision 0)
@@ -0,0 +1,43 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char32_t character constants. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+
+typedef unsigned int char32_t;
+
+extern void abort (void);
+
+char32_t	c0 = U'a';
+char32_t	c1 = U'\0';
+char32_t	c2 = U'\u0024';
+char32_t	c3 = U'\u2029';
+char32_t	c4 = U'\U00064321';
+
+#define A	0x00000061
+#define D	0x00000024
+#define X	0x00002029
+#define Y	0x00064321
+
+int main () {
+    if (sizeof(U'a') != 4)
+	abort();
+    if (sizeof(U'\0') != 4)
+	abort();
+    if (sizeof(U'\u0024') != 4)
+	abort();
+    if (sizeof(U'\u2029') != 4)
+	abort();
+    if (sizeof(U'\U00064321') != 4)
+	abort();
+
+    if (c0 != A)
+	abort();
+    if (c1 != 0x0000)
+	abort();
+    if (c2 != D)
+	abort();
+    if (c3 != X)
+	abort();
+    if (c4 != Y)
+	abort();
+}
Index: gcc/testsuite/gcc.dg/utf32-3.c
===================================================================
--- gcc/testsuite/gcc.dg/utf32-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf32-3.c	(revision 0)
@@ -0,0 +1,92 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test concatenation of char32_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+#include <stddef.h>
+
+typedef unsigned int char32_t;
+
+extern void abort (void);
+
+char32_t	*s0 = U"a" U"b";
+
+char32_t	*s1 = U"a" "b";
+char32_t	*s2 = "a" U"b";
+char32_t	*s3 = U"a" "\u2029";
+char32_t	*s4 = "\u2029" U"b";
+char32_t	*s5 = U"a" "\U00064321";
+char32_t	*s6 = "\U00064321" U"b";
+
+char32_t	*s7 = U"a" u"b";
+char32_t	*s8 = u"a" U"b";
+char32_t	*s9 = U"a" u"\u2029";
+char32_t	*sa = u"\u2029" U"b";
+char32_t	*sb = U"a" u"\U00064321";
+char32_t	*sc = u"\U00064321" U"b";
+
+wchar_t		*sd = U"a" L"b";
+wchar_t		*se = L"a" U"b";
+wchar_t		*sf = U"\u2029" L"b";
+wchar_t		*sg = L"a" U"\u2029";
+wchar_t		*sh = U"\U00064321" L"b";
+wchar_t		*si = L"a" U"\U00064321";
+
+#define A	0x00000061
+#define B	0x00000062
+#define X	0x00002029
+#define Y	0x00064321
+
+int main () {
+    if (sizeof((u"a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a"  "b")[0]) != 2)
+	abort();
+    if (sizeof(( "a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a" L"b")[0]) != 4)
+	abort();
+    if (sizeof((L"a" u"b")[0]) != 4)
+	abort();
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x00000000)
+	abort();
+
+    if (s1[0] != A || s1[1] != B || s1[2] != 0x00000000)
+	abort();
+    if (s2[0] != A || s2[1] != B || s2[2] != 0x00000000)
+	abort();
+    if (s3[0] != A || s3[1] != X || s3[2] != 0x00000000)
+	abort();
+    if (s4[0] != X || s4[1] != B || s4[2] != 0x00000000)
+	abort();
+    if (s5[0] != A || s5[1] != Y || s5[2] != 0x00000000)
+	abort();
+    if (s6[0] != Y || s6[1] != B || s6[2] != 0x00000000)
+	abort();
+
+    if (s7[0] != A || s7[1] != B || s7[2] != 0x00000000)
+	abort();
+    if (s8[0] != A || s8[1] != B || s8[2] != 0x00000000)
+	abort();
+    if (s9[0] != A || s9[1] != X || s9[2] != 0x00000000)
+	abort();
+    if (sa[0] != X || sa[1] != B || sa[2] != 0x00000000)
+	abort();
+    if (sb[0] != A || sb[1] != Y || sb[2] != 0x00000000)
+	abort();
+    if (sc[0] != Y || sc[1] != B || sc[2] != 0x00000000)
+	abort();
+
+    if (sd[0] != A || sd[1] != B || sd[2] != 0x00000000)
+	abort();
+    if (se[0] != A || se[1] != B || se[2] != 0x00000000)
+	abort();
+    if (sf[0] != X || sf[1] != B || sf[2] != 0x00000000)
+	abort();
+    if (sg[0] != A || sg[1] != X || sg[2] != 0x00000000)
+	abort();
+    if (sh[0] != Y || sh[1] != B || sh[2] != 0x00000000)
+	abort();
+    if (si[0] != A || si[1] != Y || si[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/gcc.dg/utf16-1.c
===================================================================
--- gcc/testsuite/gcc.dg/utf16-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf16-1.c	(revision 0)
@@ -0,0 +1,54 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char16_t character constants. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+
+typedef short unsigned int char16_t;
+
+extern void abort (void);
+
+char16_t	c0 = u'a';
+char16_t	c1 = u'\0';
+char16_t	c2 = u'\u0024';
+char16_t	c3 = u'\u2029';
+
+char16_t	c4 = 'a';
+char16_t	c5 = U'a';
+char16_t	c6 = U'\u2029';
+char16_t	c7 = L'a';
+char16_t	c8 = L'\u2029';
+
+#define A	0x0061
+#define D	0x0024
+#define X	0x2029
+
+int main () {
+    if (sizeof(u'a') != 2)
+	abort();
+    if (sizeof(u'\0') != 2)
+	abort();
+    if (sizeof(u'\u0024') != 2)
+	abort();
+    if (sizeof(u'\u2029') != 2)
+	abort();
+
+    if (c0 != A)
+	abort();
+    if (c1 != 0x0000)
+	abort();
+    if (c2 != D)
+	abort();
+    if (c3 != X)
+	abort();
+
+    if (c4 != A)
+	abort();
+    if (c5 != A)
+	abort();
+    if (c6 != X)
+	abort();
+    if (c7 != A)
+	abort();
+    if (c8 != X)
+	abort();
+}
Index: gcc/testsuite/gcc.dg/utf16-3.c
===================================================================
--- gcc/testsuite/gcc.dg/utf16-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/utf16-3.c	(revision 0)
@@ -0,0 +1,77 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test concatenation of char16_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-std=c99 -Wall -Werror" } */
+#include <stddef.h>
+
+typedef short unsigned int char16_t;
+
+extern void abort (void);
+
+char16_t	*s0 = u"a" u"b";
+
+char16_t	*s1 = u"a" "b";
+char16_t	*s2 = "a" u"b";
+char16_t	*s3 = u"a" "\u2029";
+char16_t	*s4 = "\u2029" u"b";
+char16_t	*s5 = u"a" "\U00064321";
+char16_t	*s6 = "\U00064321" u"b";
+
+wchar_t		*s7 = u"a" L"b";
+wchar_t		*s8 = L"a" u"b";
+wchar_t		*s9 = u"\u2029" L"b";
+wchar_t		*sa = L"a" u"\u2029";
+wchar_t		*sb = u"\U00064321" L"b";
+wchar_t		*sc = L"a" u"\U00064321";
+
+#define A	0x0061
+#define B	0x0062
+#define AL	0x00000061
+#define BL	0x00000062
+#define X	0x2029
+#define XL	0x00002029
+#define Y1	0xD950
+#define Y2	0xDF21
+#define YL	0x00064321
+
+int main () {
+    if (sizeof((u"a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a"  "b")[0]) != 2)
+	abort();
+    if (sizeof(( "a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a" L"b")[0]) != 4)
+	abort();
+    if (sizeof((L"a" u"b")[0]) != 4)
+	abort();
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x0000)
+	abort();
+
+    if (s1[0] != A || s1[1] != B || s1[2] != 0x0000)
+	abort();
+    if (s2[0] != A || s2[1] != B || s2[2] != 0x0000)
+	abort();
+    if (s3[0] != A || s3[1] != X || s3[2] != 0x0000)
+	abort();
+    if (s4[0] != X || s4[1] != B || s4[2] != 0x0000)
+	abort();
+    if (s5[0] != A || s5[1] != Y1 || s5[2] != Y2 || s5[3] != 0x0000)
+	abort();
+    if (s6[0] != Y1 || s6[1] != Y2 || s6[2] != B || s6[3] != 0x0000)
+	abort();
+
+    if (s7[0] != AL || s7[1] != BL || s7[2] != 0x00000000)
+	abort();
+    if (s8[0] != AL || s8[1] != BL || s8[2] != 0x00000000)
+	abort();
+    if (s9[0] != XL || s9[1] != BL || s9[2] != 0x00000000)
+	abort();
+    if (sa[0] != AL || sa[1] != XL || sa[2] != 0x00000000)
+	abort();
+    if (sb[0] != YL || sb[1] != BL || sb[2] != 0x00000000)
+	abort();
+    if (sc[0] != AL || sc[1] != YL || sc[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf16-1.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf16-1.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf16-1.C	(revision 0)
@@ -0,0 +1,57 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char16_t character constants. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char16_t	c0 = u'a';
+const static char16_t	c1 = u'\0';
+const static char16_t	c2 = u'\u0024';
+const static char16_t	c3 = u'\u2029';
+
+const static char16_t	c4 = 'a';
+const static char16_t	c5 = U'a';
+const static char16_t	c6 = U'\u2029';
+const static char16_t	c7 = L'a';
+const static char16_t	c8 = L'\u2029';
+
+#define A	(little_endian ? 0x0061 : 0x6100)
+#define D	(little_endian ? 0x0024 : 0x2400)
+#define X	(little_endian ? 0x2029 : 0x2920)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (sizeof(u'a') != 2)
+	abort();
+    if (sizeof(u'\0') != 2)
+	abort();
+    if (sizeof(u'\u0024') != 2)
+	abort();
+    if (sizeof(u'\u2029') != 2)
+	abort();
+
+    if (c0 != A)
+	abort();
+    if (c1 != 0x0000)
+	abort();
+    if (c2 != D)
+	abort();
+    if (c3 != X)
+	abort();
+
+    if (c4 != A)
+	abort();
+    if (c5 != A)
+	abort();
+    if (c6 != X)
+	abort();
+    if (c7 != A)
+	abort();
+    if (c8 != X)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf32-4.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf32-4.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf32-4.C	(revision 0)
@@ -0,0 +1,20 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Expected errors for char32_t character constants. */
+/* { dg-do compile } */
+/* { dg-options "" } */
+/* { dg-warning "implicitly truncated" "" { target *-*-* } 14 } */
+
+const static char32_t	c0 = U'';		/* { dg-error "empty character" } */
+const static char32_t	c1 = U'ab';		/* { dg-warning "constant too long" } */
+const static char32_t	c2 = U'\U00064321';
+
+const static char32_t	c3 = 'a';
+const static char32_t	c4 = u'a';
+const static char32_t	c5 = u'\u2029';
+const static char32_t	c6 = u'\U00064321';	/* { dg-warning "constant too long" } */
+const static char32_t	c7 = L'a';
+const static char32_t	c8 = L'\u2029';
+const static char32_t	c9 = L'\U00064321';
+
+int main () {
+}
Index: gcc/testsuite/g++.dg/other/utf32-1.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf32-1.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf32-1.C	(revision 0)
@@ -0,0 +1,46 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char32_t character constants. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char32_t	c0 = U'a';
+const static char32_t	c1 = U'\0';
+const static char32_t	c2 = U'\u0024';
+const static char32_t	c3 = U'\u2029';
+const static char32_t	c4 = U'\U00064321';
+
+#define A	(little_endian ? 0x00000061 : 0x61000000)
+#define D	(little_endian ? 0x00000024 : 0x24000000)
+#define X	(little_endian ? 0x00002029 : 0x29200000)
+#define Y	(little_endian ? 0x00064321 : 0x21430600)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (sizeof(U'a') != 4)
+	abort();
+    if (sizeof(U'\0') != 4)
+	abort();
+    if (sizeof(U'\u0024') != 4)
+	abort();
+    if (sizeof(U'\u2029') != 4)
+	abort();
+    if (sizeof(U'\U00064321') != 4)
+	abort();
+
+    if (c0 != A)
+	abort();
+    if (c1 != 0x0000)
+	abort();
+    if (c2 != D)
+	abort();
+    if (c3 != X)
+	abort();
+    if (c4 != Y)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf16-2.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf16-2.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf16-2.C	(revision 0)
@@ -0,0 +1,34 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char16_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char16_t	*s0 = u"ab";
+const static char16_t	*s1 = u"a\u0024";
+const static char16_t	*s2 = u"a\u2029";
+const static char16_t	*s3 = u"a\U00064321";
+
+#define A	(little_endian ? 0x0061 : 0x6100)
+#define B	(little_endian ? 0x0062 : 0x6200)
+#define D	(little_endian ? 0x0024 : 0x2400)
+#define X	(little_endian ? 0x2029 : 0x2920)
+#define Y1	(little_endian ? 0xD950 : 0x50D9)
+#define Y2	(little_endian ? 0xDF21 : 0x21DF)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x0000)
+	abort();
+    if (s1[0] != A || s1[1] != D || s0[2] != 0x0000)
+	abort();
+    if (s2[0] != A || s2[1] != X || s0[2] != 0x0000)
+	abort();
+    if (s3[0] != A || s3[1] != Y1 || s3[2] != Y2 || s3[3] != 0x0000)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf32-2.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf32-2.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf32-2.C	(revision 0)
@@ -0,0 +1,33 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test the support for char32_t* string constants. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char32_t	*s0 = U"ab";
+const static char32_t	*s1 = U"a\u0024";
+const static char32_t	*s2 = U"a\u2029";
+const static char32_t	*s3 = U"a\U00064321";
+
+#define A	(little_endian ? 0x00000061 : 0x61000000)
+#define B	(little_endian ? 0x00000062 : 0x62000000)
+#define D	(little_endian ? 0x00000024 : 0x24000000)
+#define X	(little_endian ? 0x00002029 : 0x29200000)
+#define Y	(little_endian ? 0x00064321 : 0x21430600)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x00000000)
+	abort();
+    if (s1[0] != A || s1[1] != D || s0[2] != 0x00000000)
+	abort();
+    if (s2[0] != A || s2[1] != X || s0[2] != 0x00000000)
+	abort();
+    if (s3[0] != A || s3[1] != Y || s3[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf16-3.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf16-3.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf16-3.C	(revision 0)
@@ -0,0 +1,79 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test concatenation of char16_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char16_t	*s0 = u"a" u"b";
+
+const static char16_t	*s1 = u"a" "b";
+const static char16_t	*s2 = "a" u"b";
+const static char16_t	*s3 = u"a" "\u2029";
+const static char16_t	*s4 = "\u2029" u"b";
+const static char16_t	*s5 = u"a" "\U00064321";
+const static char16_t	*s6 = "\U00064321" u"b";
+
+const static wchar_t	*s7 = u"a" L"b";
+const static wchar_t	*s8 = L"a" u"b";
+const static wchar_t	*s9 = u"\u2029" L"b";
+const static wchar_t	*sa = L"a" u"\u2029";
+const static wchar_t	*sb = u"\U00064321" L"b";
+const static wchar_t	*sc = L"a" u"\U00064321";
+
+#define A	(little_endian ? 0x0061 : 0x6100)
+#define B	(little_endian ? 0x0062 : 0x6200)
+#define AL	(little_endian ? 0x00000061 : 0x61000000)
+#define BL	(little_endian ? 0x00000062 : 0x62000000)
+#define X	(little_endian ? 0x2029 : 0x2920)
+#define XL	(little_endian ? 0x00002029 : 0x29200000)
+#define Y1	(little_endian ? 0xD950 : 0x50D9)
+#define Y2	(little_endian ? 0xDF21 : 0x21DF)
+#define YL	(little_endian ? 0x00064321 : 0x21430600)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (sizeof((u"a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a"  "b")[0]) != 2)
+	abort();
+    if (sizeof(( "a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a" L"b")[0]) != 4)
+	abort();
+    if (sizeof((L"a" u"b")[0]) != 4)
+	abort();
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x0000)
+	abort();
+
+    if (s1[0] != A || s1[1] != B || s1[2] != 0x0000)
+	abort();
+    if (s2[0] != A || s2[1] != B || s2[2] != 0x0000)
+	abort();
+    if (s3[0] != A || s3[1] != X || s3[2] != 0x0000)
+	abort();
+    if (s4[0] != X || s4[1] != B || s4[2] != 0x0000)
+	abort();
+    if (s5[0] != A || s5[1] != Y1 || s5[2] != Y2 || s5[3] != 0x0000)
+	abort();
+    if (s6[0] != Y1 || s6[1] != Y2 || s6[2] != B || s6[3] != 0x0000)
+	abort();
+
+    if (s7[0] != AL || s7[1] != BL || s7[2] != 0x00000000)
+	abort();
+    if (s8[0] != AL || s8[1] != BL || s8[2] != 0x00000000)
+	abort();
+    if (s9[0] != XL || s9[1] != BL || s9[2] != 0x00000000)
+	abort();
+    if (sa[0] != AL || sa[1] != XL || sa[2] != 0x00000000)
+	abort();
+    if (sb[0] != YL || sb[1] != BL || sb[2] != 0x00000000)
+	abort();
+    if (sc[0] != AL || sc[1] != YL || sc[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf32-3.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf32-3.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf32-3.C	(revision 0)
@@ -0,0 +1,94 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Test concatenation of char32_t* string literals. */
+/* { dg-do run } */
+/* { dg-options "-Wall -Werror" } */
+
+extern "C" void abort (void);
+
+const static char32_t	*s0 = U"a" U"b";
+
+const static char32_t	*s1 = U"a" "b";
+const static char32_t	*s2 = "a" U"b";
+const static char32_t	*s3 = U"a" "\u2029";
+const static char32_t	*s4 = "\u2029" U"b";
+const static char32_t	*s5 = U"a" "\U00064321";
+const static char32_t	*s6 = "\U00064321" U"b";
+
+const static char32_t	*s7 = U"a" u"b";
+const static char32_t	*s8 = u"a" U"b";
+const static char32_t	*s9 = U"a" u"\u2029";
+const static char32_t	*sa = u"\u2029" U"b";
+const static char32_t	*sb = U"a" u"\U00064321";
+const static char32_t	*sc = u"\U00064321" U"b";
+
+const static wchar_t	*sd = U"a" L"b";
+const static wchar_t	*se = L"a" U"b";
+const static wchar_t	*sf = U"\u2029" L"b";
+const static wchar_t	*sg = L"a" U"\u2029";
+const static wchar_t	*sh = U"\U00064321" L"b";
+const static wchar_t	*si = L"a" U"\U00064321";
+
+#define A	(little_endian ? 0x00000061 : 0x61000000)
+#define B	(little_endian ? 0x00000062 : 0x62000000)
+#define X	(little_endian ? 0x00002029 : 0x29200000)
+#define Y	(little_endian ? 0x00064321 : 0x21430600)
+
+int main () {
+    union { long long ll; int i[2]; } endianness_test;
+    int little_endian;
+    endianness_test.ll = 1;
+    little_endian = endianness_test.i[0];
+
+    if (sizeof((u"a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a"  "b")[0]) != 2)
+	abort();
+    if (sizeof(( "a" u"b")[0]) != 2)
+	abort();
+    if (sizeof((u"a" L"b")[0]) != 4)
+	abort();
+    if (sizeof((L"a" u"b")[0]) != 4)
+	abort();
+
+    if (s0[0] != A || s0[1] != B || s0[2] != 0x00000000)
+	abort();
+
+    if (s1[0] != A || s1[1] != B || s1[2] != 0x00000000)
+	abort();
+    if (s2[0] != A || s2[1] != B || s2[2] != 0x00000000)
+	abort();
+    if (s3[0] != A || s3[1] != X || s3[2] != 0x00000000)
+	abort();
+    if (s4[0] != X || s4[1] != B || s4[2] != 0x00000000)
+	abort();
+    if (s5[0] != A || s5[1] != Y || s5[2] != 0x00000000)
+	abort();
+    if (s6[0] != Y || s6[1] != B || s6[2] != 0x00000000)
+	abort();
+
+    if (s7[0] != A || s7[1] != B || s7[2] != 0x00000000)
+	abort();
+    if (s8[0] != A || s8[1] != B || s8[2] != 0x00000000)
+	abort();
+    if (s9[0] != A || s9[1] != X || s9[2] != 0x00000000)
+	abort();
+    if (sa[0] != X || sa[1] != B || sa[2] != 0x00000000)
+	abort();
+    if (sb[0] != A || sb[1] != Y || sb[2] != 0x00000000)
+	abort();
+    if (sc[0] != Y || sc[1] != B || sc[2] != 0x00000000)
+	abort();
+
+    if (sd[0] != A || sd[1] != B || sd[2] != 0x00000000)
+	abort();
+    if (se[0] != A || se[1] != B || se[2] != 0x00000000)
+	abort();
+    if (sf[0] != X || sf[1] != B || sf[2] != 0x00000000)
+	abort();
+    if (sg[0] != A || sg[1] != X || sg[2] != 0x00000000)
+	abort();
+    if (sh[0] != Y || sh[1] != B || sh[2] != 0x00000000)
+	abort();
+    if (si[0] != A || si[1] != Y || si[2] != 0x00000000)
+	abort();
+}
Index: gcc/testsuite/g++.dg/other/utf16-4.C
===================================================================
--- gcc/testsuite/g++.dg/other/utf16-4.C	(revision 0)
+++ gcc/testsuite/g++.dg/other/utf16-4.C	(revision 0)
@@ -0,0 +1,19 @@
+/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
+/* Expected errors for char16_t character constants. */
+/* { dg-do compile } */
+/* { dg-options "" } */
+
+const static char16_t	c0 = u'';		/* { dg-error "empty character" } */
+const static char16_t	c1 = u'ab';		/* { dg-warning "constant too long" } */
+const static char16_t	c2 = u'\U00064321';	/* { dg-warning "constant too long" } */
+
+const static char16_t	c3 = 'a';
+const static char16_t	c4 = U'a';
+const static char16_t	c5 = U'\u2029';
+const static char16_t	c6 = U'\U00064321';	/* { dg-warning "implicitly truncated" } */
+const static char16_t	c7 = L'a';
+const static char16_t	c8 = L'\u2029';
+const static char16_t	c9 = L'\U00064321';	/* { dg-warning "implicitly truncated" } */
+
+int main () {
+}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 20:34     ` Kris Van Hees
@ 2008-03-14  1:14       ` Joseph S. Myers
  0 siblings, 0 replies; 11+ messages in thread
From: Joseph S. Myers @ 2008-03-14  1:14 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: Andrew Pinski, gcc-patches

On Thu, 13 Mar 2008, Kris Van Hees wrote:

> Here you go.  Sorry about that.
> 
> Index: gcc/testsuite/gcc.dg/utf32-2.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/utf32-2.c	(revision 0)
> +++ gcc/testsuite/gcc.dg/utf32-2.c	(revision 0)
> @@ -0,0 +1,30 @@
> +/* Contributed by Kris Van Hees <kris.van.hees@oracle.com> */
> +/* Test the support for char32_t* string constants. */
> +/* { dg-do run } */
> +/* { dg-options "-std=c99 -Wall -Werror" } */

A TR Type 2 is neither a standard nor an amendment to a standard (the same 
also applies to Type 1 and Type 3 TRs, but this is a Type 2 TR).  Thus, 
the new syntax *must not* be accepted in C99 or C90 or C++98 mode, and 
there must be testcases to verify that it is not accepted in those modes.  
It must only be accepted in gnu89 / gnu99 / ... extension modes, or in 
c++0x mode (or with a special option to enable support for the new syntax 
in older standard modes if absolutely necessary).  In the modes for 
existing C and C++ standards, U or u must be lexed as separate tokens from 
a following string (they could be macros, so valid code could have its 
semantics affected; the testcases should probably involve u and U as 
macros).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
                   ` (2 preceding siblings ...)
  2008-03-13 19:59 ` Paul Koning
@ 2008-03-14  1:18 ` Joseph S. Myers
  2008-03-15  2:57 ` Tom Tromey
  2008-03-22 15:33 ` Jason Merrill
  5 siblings, 0 replies; 11+ messages in thread
From: Joseph S. Myers @ 2008-03-14  1:18 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: gcc-patches

The patch doesn't appear to add the char16_t and char32_t keywords to the 
table of keywords in cp/lex.c (conditional on C++0x, of course).  They are 
keywords in C++0x, not built-in typedefs.  (For example, this means you 
can't re-typedef them as something else - this should of course have C++0x 
testcases.)  In C++98, of course those names should not be defined at all, 
so there should be testcases that using char16_t or char32_t as if they 
were type names in C++98 mode leads to a syntax error, and that they can 
be typedefed freely in C++98 code.

The comments refer to `signed char16_t' and `unsigned char16_t', `signed 
char32_t' and `unsigned char32_t'.  There are no such types.  There should 
be C++0x testcases verifying that no other type specifier keyword can be 
used along with char16_t or char32_t; one of those keywords on its own 
must be the full set of type specifiers (as far as I can tell from the 
current C++0x draft).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
                   ` (3 preceding siblings ...)
  2008-03-14  1:18 ` Joseph S. Myers
@ 2008-03-15  2:57 ` Tom Tromey
  2008-03-22 15:33 ` Jason Merrill
  5 siblings, 0 replies; 11+ messages in thread
From: Tom Tromey @ 2008-03-15  2:57 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: gcc-patches

>>>>> "Kris" == Kris Van Hees <kris.van.hees@oracle.com> writes:

Kris> * include/cpp-id-data.h (UC): Was U, conflicts with U"..."
Kris> literal.

This renaming is fine and if you want to commit it as a separate
patch, that would be ok.  But, if you'd rather just keep it all as one
big patch, that is also ok by me.

I haven't read the spec.  What does it say about surrogate characters?
The code seems to just defer to whatever iconv does.

I also have a few stylistic nits, nothing serious.

Kris> +      switch (type) {

Brace on a new line.  There are a couple instances of this with
'switch'.

Kris>  tree
Kris>  fix_string_type (tree value)
[...]
Kris> +  if (wide) {
Kris> +    if (TREE_TYPE (value) == char16_array_type_node) {

Braces on new lines, lots of instances in this function.

Kris> +    type = *base == 'L' ? CPP_WSTRING
Kris> +			: *base == 'U' ? CPP_STRING32
Kris> +				       : *base == 'u' ? CPP_STRING16
Kris> +						      : CPP_STRING;

Multi-line expressions like need parens around the RHS of the
assignment.  And they have to be indented a bit differently.  I think
the coding standards explain the details.

Kris> +  if (width != CPP_OPTION(pfile, char_precision))

Space before the paren.

Kris> +static struct cset_converter
Kris> +convertor_for_type (cpp_reader *pfile, enum cpp_ttype type)

Please use "converter", not "convertor".  The former is already in use
in the source, and also is hugely more popular according to google.

Kris> +		      cpp_string *to,  enum cpp_ttype type)

Extra space before 'enum'.  The nit-pickiest!

Tom

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] utf-16 and utf-32 support in C and C++
  2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
                   ` (4 preceding siblings ...)
  2008-03-15  2:57 ` Tom Tromey
@ 2008-03-22 15:33 ` Jason Merrill
  5 siblings, 0 replies; 11+ messages in thread
From: Jason Merrill @ 2008-03-22 15:33 UTC (permalink / raw)
  To: Kris Van Hees; +Cc: gcc-patches

Kris Van Hees wrote:
> The proposals do not exclude the implementation of additional rules
> for concatenation.  This implementation also provides for the following
> valid concatenations.  The rationale behind this choice is that the
> concatenation of strings shall result in a string with the highest width,
> according to the ascending order: char - char16_t - char32_t - wchar.

It is inappropriate to assume that wchar_t will always be at least as 
wide as char32_t; several targets have 16-bit wchar_t.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-03-22 15:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-13 19:33 [PATCH] utf-16 and utf-32 support in C and C++ Kris Van Hees
2008-03-13 19:34 ` Kris Van Hees
2008-03-13 19:56 ` Andrew Pinski
2008-03-13 20:14   ` Kris Van Hees
2008-03-13 20:34     ` Kris Van Hees
2008-03-14  1:14       ` Joseph S. Myers
2008-03-13 19:59 ` Paul Koning
2008-03-13 19:59   ` Andrew Pinski
2008-03-14  1:18 ` Joseph S. Myers
2008-03-15  2:57 ` Tom Tromey
2008-03-22 15:33 ` Jason Merrill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).