public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] Adjust output for strings in tree-pretty-print.c
@ 2008-05-19 13:59 FX
  2008-05-19 14:21 ` Manfred Hollstein
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: FX @ 2008-05-19 13:59 UTC (permalink / raw)
  To: GCC Development, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]

Hi all,

The Fortran front-end now handles wide character strings
(UCS-4/UTF-32); for these, the string literals are emitted as strings
with the type of an array of unsigned 32-bit integers. The issue is
that tree-pretty-print.c, in pretty_print_string() assumes strings are
composed of chars and NUL-terminated. This fails, for example, if you
look at the tree dump for the following Fortran source file:

  subroutine foo
    call test(4_"I'm here!")
  end subroutine foo

you currently get:

  foo ()
  {
    test (&"I"[1]{lb: 1 sz: 4}, 9);

On my little-endian compiler, "I'm here!" is in UTF-32:
"I\0\0\0'\0\0\0m\0\0\0 \0\0\0h\0\0\0e\0\0\0r\0\0\0e\0\0\0!\0\0\0". So,
tree-pretty-print.c stops at the first '\0', and we get "l". To make
this work better, as STRING_CST's have an attached length
(TREE_STRING_LENGTH), I suggest using that to output the full string
length, instead of stopping at the first NUL character.

With that patch, the tree dump for the same Fortran source file looks like this:

  test (&"I\0\0\0\'\0\0\0m\0\0\0
\0\0\0h\0\0\0e\0\0\0r\0\0\0e\0\0\0!\0\0\0"[1]{lb: 1 sz: 4}, 9);

and the tree dump for the following C testcase:

  unsigned char *foo(void) { return "look\0here"; }

which was like this:

  return (unsigned char *) "look";

is now like this:

  return (unsigned char *) "look\0here\0";


Notice the added final '\0' in the C case; I don't know if it's bad to
have it there, but I don't see a way to not output it and still have
the correct output for Fortran (whose strings are not NUL-terminated).

Any comments? Is it OK to commit as is? It bootstraps and regtests
fine on x86_64-linux, with C and Fortran enabled, except for
gcc.dg/tree-ssa/builtin-{v,}{f,}printf-1.c which need their
scan-tree-dump patterns adjusted accordingly. If there is no
objection, I'll do that and build and regtest C++, objc and objc++ as
well before going ahead.

Thanks,
FX

-- 
FX Coudert
http://www.homepages.ucl.ac.uk/~uccafco/

[-- Attachment #2: wide_char_part6_gcc.diff --]
[-- Type: application/octet-stream, Size: 2095 bytes --]

Index: tree-pretty-print.c
===================================================================
--- tree-pretty-print.c	(revision 135494)
+++ tree-pretty-print.c	(working copy)
@@ -40,7 +40,7 @@ along with GCC; see the file COPYING3.  
 /* Local functions, macros and variables.  */
 static int op_prio (const_tree);
 static const char *op_symbol (const_tree);
-static void pretty_print_string (pretty_printer *, const char*);
+static void pretty_print_string (pretty_printer *, const char *, int len);
 static void print_call_name (pretty_printer *, const_tree);
 static void newline_and_indent (pretty_printer *, int);
 static void maybe_init_pretty_print (FILE *);
@@ -841,7 +841,8 @@ dump_generic_node (pretty_printer *buffe
 
     case STRING_CST:
       pp_string (buffer, "\"");
-      pretty_print_string (buffer, TREE_STRING_POINTER (node));
+      pretty_print_string (buffer, TREE_STRING_POINTER (node),
+			   TREE_STRING_LENGTH (node));
       pp_string (buffer, "\"");
       break;
 
@@ -2740,15 +2741,19 @@ print_call_name (pretty_printer *buffer,
 /* Parses the string STR and replaces new-lines by '\n', tabs by '\t', ...  */
 
 static void
-pretty_print_string (pretty_printer *buffer, const char *str)
+pretty_print_string (pretty_printer *buffer, const char *str, int len)
 {
   if (str == NULL)
     return;
 
-  while (*str)
+  while ((len--) > 0)
     {
       switch (str[0])
 	{
+	case '\0':
+	  pp_string (buffer, "\\0");
+	  break;
+
 	case '\b':
 	  pp_string (buffer, "\\b");
 	  break;
@@ -2785,8 +2790,6 @@ pretty_print_string (pretty_printer *buf
 	  pp_string (buffer, "\\'");
 	  break;
 
-	  /* No need to handle \0; the loop terminates on \0.  */
-
 	case '\1':
 	  pp_string (buffer, "\\1");
 	  break;
@@ -2816,7 +2819,14 @@ pretty_print_string (pretty_printer *buf
 	  break;
 
 	default:
-	  pp_character (buffer, str[0]);
+	  if (ISPRINT (str[0]))
+	    pp_character (buffer, str[0]);
+	  else
+	    {
+	      char s[5];
+	      sprintf (s, "\\x%02x", (unsigned)(unsigned char)str[0]);
+	      pp_string (buffer, s);
+	    }
 	  break;
 	}
       str++;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 13:59 [RFC] Adjust output for strings in tree-pretty-print.c FX
@ 2008-05-19 14:21 ` Manfred Hollstein
  2008-05-19 14:31 ` Jakub Jelinek
  2008-05-19 14:45 ` Paolo Bonzini
  2 siblings, 0 replies; 8+ messages in thread
From: Manfred Hollstein @ 2008-05-19 14:21 UTC (permalink / raw)
  To: FX; +Cc: GCC Development, gcc-patches

Hi there,

On Mon, 19 May 2008, 15:59:16 +0200, FX wrote:
> [...]
> Any comments? Is it OK to commit as is?

this may sound like nit-picking, but the length of a string cannot be
negative, so, I'd rather make the new parameter `len' an "unsigned int"
or even size_t.

HTH, cheers.

l8er
manfred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 13:59 [RFC] Adjust output for strings in tree-pretty-print.c FX
  2008-05-19 14:21 ` Manfred Hollstein
@ 2008-05-19 14:31 ` Jakub Jelinek
  2008-05-19 14:35   ` FX
  2008-05-19 14:45 ` Paolo Bonzini
  2 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2008-05-19 14:31 UTC (permalink / raw)
  To: FX; +Cc: GCC Development, gcc-patches

On Mon, May 19, 2008 at 02:59:16PM +0100, FX wrote:
>   return (unsigned char *) "look\0here\0";
> 
> 
> Notice the added final '\0' in the C case; I don't know if it's bad to
> have it there, but I don't see a way to not output it and still have
> the correct output for Fortran (whose strings are not NUL-terminated).

Yes, it is bad to have it there for C/C++, please make sure the final
\0 isn't printed.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 14:31 ` Jakub Jelinek
@ 2008-05-19 14:35   ` FX
  0 siblings, 0 replies; 8+ messages in thread
From: FX @ 2008-05-19 14:35 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Development, gcc-patches

> Yes, it is bad to have it there for C/C++, please make sure the final
> \0 isn't printed.

The question is: how? If we limit the tree dumps to C/C++ semantics,
then there's no way they can express Fortran code (ie a string
litteral not terminated by a NUL), is there? Or do we make the
semantics of tree dumps depend on the language compiled? (which seems
a bad idea)

FX

-- 
FX Coudert
http://www.homepages.ucl.ac.uk/~uccafco/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 13:59 [RFC] Adjust output for strings in tree-pretty-print.c FX
  2008-05-19 14:21 ` Manfred Hollstein
  2008-05-19 14:31 ` Jakub Jelinek
@ 2008-05-19 14:45 ` Paolo Bonzini
  2008-05-19 15:00   ` FX
  2 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2008-05-19 14:45 UTC (permalink / raw)
  To: FX; +Cc: GCC Development, gcc-patches


> Notice the added final '\0' in the C case; I don't know if it's bad to
> have it there, but I don't see a way to not output it and still have
> the correct output for Fortran (whose strings are not NUL-terminated).

I think the best thing to do is to have a langhook then.  I'm actually 
not sure that you want all those \0's in the Fortran front-end since the 
kind can be recovered from the {lb:1 sz:4} that is appended to the 
string.  Endianness issues may also appear.  Maybe you should call iconv 
in the langhook to get back to UTF-8, and print that representation instead.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 14:45 ` Paolo Bonzini
@ 2008-05-19 15:00   ` FX
  2008-05-19 15:08     ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: FX @ 2008-05-19 15:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Development, gcc-patches

> I think the best thing to do is to have a langhook then.

It seems a bit weird to have a langhook for a one-character
difference, but if there's a consensus on it, I'll go along.

> Endianness issues may also appear.  Maybe you should call iconv in the
> langhook to get back to UTF-8, and print that representation instead.

Endianness is already take care of, in the sense that the string is
encoded in the target's endianness already. However, that makes
calling iconv more difficult, because that has us going from target's
endianness to UTF-8, which will be a pain. As only developers do read
tree dumps, I think having a lot of \0 in them is fine.


FX

-- 
FX Coudert
http://www.homepages.ucl.ac.uk/~uccafco/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 15:00   ` FX
@ 2008-05-19 15:08     ` Paolo Bonzini
  2008-05-20 13:25       ` FX
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2008-05-19 15:08 UTC (permalink / raw)
  To: FX; +Cc: GCC Development, gcc-patches

FX wrote:
>> I think the best thing to do is to have a langhook then.
> 
> It seems a bit weird to have a langhook for a one-character
> difference, but if there's a consensus on it, I'll go along.

To me too, but I still maintain that it's better to print in UTF-8 
(which would make the langhook more useful).  The recent Unicode patches 
for C possibly could use the langhook too.

>> Endianness issues may also appear.  Maybe you should call iconv in the
>> langhook to get back to UTF-8, and print that representation instead.
> 
> Endianness is already take care of, in the sense that the string is
> encoded in the target's endianness already.

But for testing you want a standardized endianness.  Otherwise some 
targets will need to scan "I\0\0\0" and others will need to scan "\0\0\0I".

> However, that makes
> calling iconv more difficult, because that has us going from target's
> endianness to UTF-8, which will be a pain.

No, you can use UTF-32BE and UTF-32LE encodings.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Adjust output for strings in tree-pretty-print.c
  2008-05-19 15:08     ` Paolo Bonzini
@ 2008-05-20 13:25       ` FX
  0 siblings, 0 replies; 8+ messages in thread
From: FX @ 2008-05-20 13:25 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Development, gcc-patches

> To me too, but I still maintain that it's better to print in UTF-8 (which
> would make the langhook more useful).  The recent Unicode patches for C
> possibly could use the langhook too.

OK, I need to focus on making progress and fix the current behaviour,
which is broken for gfortran (ie doesn't allow us to use tree dumps at
all). So, if people want to deal with Unicode tree dumps, that's fine
(though, as a matter of taste, I'd still prefer having the bit
pattern), but that's not the object of this patch.

To deal with the difference in behaviour between the C family and
Fortran, a langhook function is a bit overkill (it needs a callback
function which in turns calls pp_character to actually display things,
in order to be generic and usable in various parts of the middle-end),
I just added a flag (null_terminated_strings): it defaults to true,
but Fortran sets it to false. Depending on this value,
tree-pretty-print.c will behave the right way. I attach the
preliminary patch here, to know if it's OK to go ahead. If so, I'll
add documentation and submit for proper review.

Also, I'd like to know if it's deemed suitable to use that flag in
tree-dump.c (look for TREE_STRING_POINTER in that file). Currently,
calling debug_tree() on a STRING_CST will display null characters,
like "This is a string.\0", which is apparently unsuitable for C. So,
should I use the new flag here also?

> But for testing you want a standardized endianness.  Otherwise some targets
> will need to scan "I\0\0\0" and others will need to scan "\0\0\0I".

That's debatable. I prefer being able to see the byte sequence in the
tree dumps to make sure it's as expected.

FX

-- 
FX Coudert
http://www.homepages.ucl.ac.uk/~uccafco/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-05-20 13:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-19 13:59 [RFC] Adjust output for strings in tree-pretty-print.c FX
2008-05-19 14:21 ` Manfred Hollstein
2008-05-19 14:31 ` Jakub Jelinek
2008-05-19 14:35   ` FX
2008-05-19 14:45 ` Paolo Bonzini
2008-05-19 15:00   ` FX
2008-05-19 15:08     ` Paolo Bonzini
2008-05-20 13:25       ` FX

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).