public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Bernd Edlinger <bernd.edlinger@hotmail.de>
To: Richard Biener <rguenther@suse.de>
Cc: Richard Biener <richard.guenther@gmail.com>,
	"iant@golang.org"	<iant@golang.org>,
	GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] Make GO string literals properly NUL terminated
Date: Mon, 20 Aug 2018 15:59:00 -0000	[thread overview]
Message-ID: <AM5PR0701MB26574285F592D9E50308E9F6E4320@AM5PR0701MB2657.eurprd07.prod.outlook.com> (raw)
In-Reply-To: <alpine.LSU.2.20.1808201510470.16707@zhemvz.fhfr.qr>

On 08/20/18 15:19, Richard Biener wrote:
> On Mon, 20 Aug 2018, Bernd Edlinger wrote:
> 
>> On 08/20/18 13:01, Richard Biener wrote:
>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>>>>
>>>>
>>>>
>>>> On 08/01/18 11:29, Richard Biener wrote:
>>>>>
>>>>> Hmm.  I think it would be nice if TREE_STRING_LENGTH would
>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient
>>>>> for your check above.  Because the '\0' doesn't belong to the
>>>>> string.  Then build_string internally appends a '\0' outside
>>>>> of TREE_STRING_LENGTH.
>>>>>
>>>>
>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide
>>>> character.
>>>
>>> That could be fixed though (a wide 0 is just N 0s).  Add a elsz = 1
>>> parameter to build_string and allocate as many extra 0s as needed.
>>>
>>>     There are STRING_CSTs which are not string literals,
>>>> for instance attribute tags, Pragmas, asm constrants, etc.
>>>> They use the '\0' outside, and have probably no TREE_TYPE.
>>>>
>>>>>
>>>>>> So I would like to be able to assume that the STRING_CST objects
>>>>>> are internally always generated properly by the front end.
>>>>>
>>>>> Yeah, I guess we need to define what "properly" is ;)
>>>>>
>>>> Yes.
>>>>
>>>>>> And that the ARRAY_TYPE of the string literal either has the
>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter,
>>>>>> this is always exactly one (wide) character size less than TREE_STRING_LENGTH
>>>>>
>>>>> I think it should be always the same...
>>>>>
>>>>
>>>> One could not differentiate between "\0" without zero-termination
>>>> and "" with zero-termination, theoretically.
>>>
>>> Is that important?  Doesn't the C standard say how to parse string literals?
>>>
>>>> We also have char x[100] = "ab";
>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100.
>>>> Of course one could create it with a TREE_STRING_LENGTH = 100,
>>>> but imagine char x[100000000000] = "ab"
>>>
>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I
>>> hope matches "ab" and not 'x'.  If it matches 'x' then I'd rather have it
>>> unconditionally be [], thus incomplete (because the literals "size" depends
>>> on the context/LHS it is used on).
>>>
>>
>> Sorry, but I must say, it is not at all like that.
>>
>> If I compile x.c:
>> const char x[100] = "ab";
>>
>> and set a breakpoint at output_constant:
>>
>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256,
>>       reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821
>> 4821	  if (size == 0 || flag_syntax_only)
>> (gdb) p size
>> $1 = 100
>> (gdb) call debug(exp)
>> "ab"
>> (gdb) p *exp
>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = 1,
>> (gdb) p exp->typed.type->type_common.size_unit
>> $5 = (tree) 0x7ffff6ff9d80
>> (gdb) call debug(exp->typed.type->type_common.size_unit)
>> 100
>> (gdb) p exp->string.length
>> $6 = 3
>> (gdb) p exp->string.str[0]
>> $8 = 97 'a'
>> (gdb) p exp->string.str[1]
>> $9 = 98 'b'
>> (gdb) p exp->string.str[2]
>> $10 = 0 '\000'
>> (gdb) p exp->string.str[3]
>> $11 = 0 '\000'
>>
>>
>> This is an important property of string_cst objects, that is used in c_strlen:
>>
>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond TREE_STRING_LENGTH
>> is guaranteed to be zero up to the type size.
>>
>> I would not have spent one thought on implementing an optimization like that,
>> but that's how it is right now.
> 
> Huh.  So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka
> they have zero-padding up to its type size.  I don't see this documented
> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0"
> with appropriate TYPE_SIZE.
> 
> This is also a relatively new thing on trunk (coming out of the added
> mem_size parameter of string_constant).  That this looks at the STRING_CST
> type like
> 
>    if (TREE_CODE (array) == STRING_CST)
>      {
>        *ptr_offset = fold_convert (sizetype, offset);
>        if (mem_size)
>          *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array));
>        return array;
> 
> I'd call simply a bug.  As said, interpretation of STRING_CSTs should
> depend on the context.  And for
> 

This use of the TYPE_SIZE_UNIT was pre-existing to r263607
before that (but not in the gcc-8-branch) we had this in c_strlen:

   HOST_WIDE_INT maxelts = strelts;
   tree type = TREE_TYPE (src);
   if (tree size = TYPE_SIZE_UNIT (type))
     if (tree_fits_shwi_p (size))
       {
        maxelts = tree_to_uhwi (size);
        maxelts = maxelts / eltsize - 1;
       }

which I moved to string_constant and return the result through memsize.

It seems to be working somehow, and I'd bet removing that will immediately
break twenty or thirty of the strlenopt tests.

Obviously the tree string objects allow way too much variations,
and it has to be restricted in some way or another.

In the moment my approach is: look at what most string constants do
and add assertions to ensure that there are no exceptions.


> char a[4] = "abc";
> char b[5] = "abc";
> 
> we should better be able to share STRING_CSTs [you can see LTO
> sharing the nodes if you do b[4] but not when b[5] I suppose].
> 
>> All I want to do, is make sure that all string constants have the same look and feel
>> in the middle-end, and restrict the variations that are allowed by the current
>> implementation.
> 
> Sure, I understand that.  But I'd like to simplify things and not add
> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide
> whether sth is 0-terminated.
> 

This is not only about 0-terminated. (*)

It is also about when you get a STRING_CST with a TREE_STRING_LENGTH,
there are places in the middle-end that assume that the object contains
_all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT.
Those I want to protect.

Bernd.


*: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero,
the string is only returned when it is properly zero terminated.",
but maybe I should have written:
"If MEM_SIZE is zero, the string is only returned when the storage size
of the string object is at least TREE_STRING_LENGTH."
That's at least exactly what the check does.

> Richard.
> 
>>
>> Bernd.
>>
>>
>>>>>> The idea is to use this property of string literals where needed,
>>>>>> and check rigorously in varasm.c.
>>>>>>
>>>>>> Does that make sense?
>>>>>
>>>>> So if it is not the same then the excess character needs to be
>>>>> a (wide) NUL in your model?  ISTR your varasm.c patch didn't verify
>>>>> that.
>>>>>
>>>>
>>>> I think it does.
>>>>
>>>>
>>>> Bernd.
>>
> 

  reply	other threads:[~2018-08-20 15:59 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-31 12:14 Bernd Edlinger
2018-07-31 14:40 ` Ian Lance Taylor
2018-07-31 16:19   ` Bernd Edlinger
2018-07-31 20:18     ` Ian Lance Taylor
2018-08-01  7:01       ` Richard Biener
2018-08-01  7:44         ` Bernd Edlinger
2018-08-01  7:53           ` Richard Biener
2018-08-01  9:22             ` Bernd Edlinger
2018-08-01  9:30               ` Richard Biener
2018-08-01 13:05                 ` Bernd Edlinger
2018-08-20 11:02                   ` Richard Biener
2018-08-20 12:10                     ` Bernd Edlinger
2018-08-20 13:19                       ` Richard Biener
2018-08-20 15:59                         ` Bernd Edlinger [this message]
2018-08-21  5:30                           ` Bernd Edlinger
2018-08-21  8:33                           ` Richard Biener
2018-08-22  4:57                             ` Bernd Edlinger
2018-08-24 10:52                               ` Richard Biener
2018-08-24 11:49                                 ` Bernd Edlinger
2018-08-24 12:57                                   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AM5PR0701MB26574285F592D9E50308E9F6E4320@AM5PR0701MB2657.eurprd07.prod.outlook.com \
    --to=bernd.edlinger@hotmail.de \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=iant@golang.org \
    --cc=rguenther@suse.de \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).