The treatment of null characters in C source files

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* The treatment of null characters in C source files
@ 1999-09-05 16:29 Zack Weinberg
  1999-09-05 17:07 ` Jeffrey A Law
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Zack Weinberg @ 1999-09-05 16:29 UTC (permalink / raw)
  To: gcc

Consider a source file such as

#include <stdio.h>

int main()
{
  puts ("hello^@ world");
}

where ^@ is a null character.  cccp passes null characters through to
the output and cc1 accepts them in strings.  All released versions of
gcc will therefore compile this without complaint, producing an
executable that prints "hello".

cpplib used to mangle input files with nulls in them.  The patch I
sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
emit a warning and ignore the null.  The above will produce

test.c:5:15: warning: ignoring ASCII NUL in input

and an executable that prints "hello world".

The question is, is this an acceptable behavior change for the
compiler?  Making cpplib pass through nulls would be extremely
difficult, but someone might have a legitimate use for them.

zw

p.s. This has nothing to do with multibyte support.  I'm fully aware
that non-ASCII character sets may contain zero bytes which are not
null characters.  Currently cpplib supports only ASCII (unibyte strict
supersets such as Latin1 probably work if the extended characters are
confined to strings and comments).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 16:29 The treatment of null characters in C source files Zack Weinberg
@ 1999-09-05 17:07 ` Jeffrey A Law
  1999-09-05 17:38   ` Zack Weinberg
                     ` (2 more replies)
  1999-09-05 19:52 ` Alexandre Oliva
  1999-09-30 18:02 ` Zack Weinberg
  2 siblings, 3 replies; 24+ messages in thread
From: Jeffrey A Law @ 1999-09-05 17:07 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

  In message < 199909052329.QAA14145@zack.bitmover.com >you write:
  > 
  > Consider a source file such as
  > 
  > #include <stdio.h>
  > 
  > int main()
  > {
  >   puts ("hello^@ world");
  > }
  > 
  > where ^@ is a null character.  cccp passes null characters through to
  > the output and cc1 accepts them in strings.  All released versions of
  > gcc will therefore compile this without complaint, producing an
  > executable that prints "hello".
  > 
  > cpplib used to mangle input files with nulls in them.  The patch I
  > sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
  > emit a warning and ignore the null.  The above will produce
  > 
  > test.c:5:15: warning: ignoring ASCII NUL in input
  > 
  > and an executable that prints "hello world".
  > 
  > The question is, is this an acceptable behavior change for the
  > compiler?  Making cpplib pass through nulls would be extremely
  > difficult, but someone might have a legitimate use for them.
No, I do not believe that is acceptable behavior.

It is perfectly legitimate for a string to have a null character in it.  If
you look hard you'll even find examples of this in gcc itself.

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:07 ` Jeffrey A Law
@ 1999-09-05 17:38   ` Zack Weinberg
  1999-09-30 18:02     ` Zack Weinberg
  1999-09-05 17:42   ` craig
  1999-09-30 18:02   ` Jeffrey A Law
  2 siblings, 1 reply; 24+ messages in thread
From: Zack Weinberg @ 1999-09-05 17:38 UTC (permalink / raw)
  To: law; +Cc: gcc

Jeffrey A Law wrote:
>   In message < 199909052329.QAA14145@zack.bitmover.com >you write:
>   > 
>   > Consider a source file such as
>   > 
>   > #include <stdio.h>
>   > 
>   > int main()
>   > {
>   >   puts ("hello^@ world");
>   > }
>   > 
>   > where ^@ is a null character.  cccp passes null characters through to
>   > the output and cc1 accepts them in strings.  All released versions of
>   > gcc will therefore compile this without complaint, producing an
>   > executable that prints "hello".
>   > 
>   > cpplib used to mangle input files with nulls in them.  The patch I
>   > sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
>   > emit a warning and ignore the null.  The above will produce
>   > 
>   > test.c:5:15: warning: ignoring ASCII NUL in input
>   > 
>   > and an executable that prints "hello world".
>   > 
>   > The question is, is this an acceptable behavior change for the
>   > compiler?  Making cpplib pass through nulls would be extremely
>   > difficult, but someone might have a legitimate use for them.
> No, I do not believe that is acceptable behavior.
> 
> It is perfectly legitimate for a string to have a null character in it.  If
> you look hard you'll even find examples of this in gcc itself.

Um, I think you misunderstand.  It deals just fine with 
"hello\0 world", always has.  I'm talking about when a file has a
literal NUL character in it, a byte with all bits zero.

zw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:07 ` Jeffrey A Law
  1999-09-05 17:38   ` Zack Weinberg
@ 1999-09-05 17:42   ` craig
  1999-09-06  1:10     ` Jeffrey A Law
  1999-09-30 18:02     ` craig
  1999-09-30 18:02   ` Jeffrey A Law
  2 siblings, 2 replies; 24+ messages in thread
From: craig @ 1999-09-05 17:42 UTC (permalink / raw)
  To: law; +Cc: craig

>  > The question is, is this an acceptable behavior change for the
>  > compiler?  Making cpplib pass through nulls would be extremely
>  > difficult, but someone might have a legitimate use for them.
>No, I do not believe that is acceptable behavior.
>
>It is perfectly legitimate for a string to have a null character in it.  If
>you look hard you'll even find examples of this in gcc itself.

A C *string* (as a constant) may certainly have a null character in it,
as in "hello\000 there".

But, surely there's no requirement that a C *source file* be allowed to
have a null character in it.

Since, when printed or displayed in various canonical ways, a C source
file containing a NUL will look exactly like one without that NUL,
but can (apparently) behave differently, I recommend the warning
be preserved.

Programmers can always write \000 where they want NUL in a string, right?
And that prints/displays correctly pretty much all the time.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 16:29 The treatment of null characters in C source files Zack Weinberg
  1999-09-05 17:07 ` Jeffrey A Law
@ 1999-09-05 19:52 ` Alexandre Oliva
  1999-09-06 10:26   ` Joern Rennecke
  1999-09-30 18:02   ` Alexandre Oliva
  1999-09-30 18:02 ` Zack Weinberg
  2 siblings, 2 replies; 24+ messages in thread
From: Alexandre Oliva @ 1999-09-05 19:52 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Sep  5, 1999, Zack Weinberg <zack@bitmover.com> wrote:

> cpplib used to mangle input files with nulls in them.  The patch I
> sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
> emit a warning and ignore the null.

Couldn't you just arrange for it to be replaced with \000?

-- 
Alexandre Oliva http://www.dcc.unicamp.br/~oliva IC-Unicamp, Bra[sz]il
oliva@{dcc.unicamp.br,guarana.{org,com}} aoliva@{acm.org,computer.org}
oliva@{gnu.org,kaffe.org,{egcs,sourceware}.cygnus.com,samba.org}
** I may forward mail about projects to mailing lists; please use them

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:42   ` craig
@ 1999-09-06  1:10     ` Jeffrey A Law
  1999-09-30 18:02       ` Jeffrey A Law
  1999-09-30 18:02     ` craig
  1 sibling, 1 reply; 24+ messages in thread
From: Jeffrey A Law @ 1999-09-06  1:10 UTC (permalink / raw)
  To: craig; +Cc: zack, gcc

  In message < 19990906004151.7557.qmail@deer >you write:
  > A C *string* (as a constant) may certainly have a null character in it,
  > as in "hello\000 there".
Right.


  > But, surely there's no requirement that a C *source file* be allowed to
  > have a null character in it.
Oh, I mis-understood.  Sorry.  No clue what the standard says here.

Zack -- did you check the standard for any wording on this issue?  Ultimately
if it says anything we should follow it.

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 19:52 ` Alexandre Oliva
@ 1999-09-06 10:26   ` Joern Rennecke
  1999-09-30 18:02     ` Joern Rennecke
  1999-09-30 18:02   ` Alexandre Oliva
  1 sibling, 1 reply; 24+ messages in thread
From: Joern Rennecke @ 1999-09-06 10:26 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: zack, gcc

> Couldn't you just arrange for it to be replaced with \000?

It's not quite that easy - if you have "\^@" , and you replace the ^@
single-mindedly, you'll get "\\000", which is something completely different.
So you'd have to check if there is an odd number of leading backslashes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:42   ` craig
  1999-09-06  1:10     ` Jeffrey A Law
@ 1999-09-30 18:02     ` craig
  1 sibling, 0 replies; 24+ messages in thread
From: craig @ 1999-09-30 18:02 UTC (permalink / raw)
  To: law; +Cc: craig

>  > The question is, is this an acceptable behavior change for the
>  > compiler?  Making cpplib pass through nulls would be extremely
>  > difficult, but someone might have a legitimate use for them.
>No, I do not believe that is acceptable behavior.
>
>It is perfectly legitimate for a string to have a null character in it.  If
>you look hard you'll even find examples of this in gcc itself.

A C *string* (as a constant) may certainly have a null character in it,
as in "hello\000 there".

But, surely there's no requirement that a C *source file* be allowed to
have a null character in it.

Since, when printed or displayed in various canonical ways, a C source
file containing a NUL will look exactly like one without that NUL,
but can (apparently) behave differently, I recommend the warning
be preserved.

Programmers can always write \000 where they want NUL in a string, right?
And that prints/displays correctly pretty much all the time.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-06 10:26   ` Joern Rennecke
@ 1999-09-30 18:02     ` Joern Rennecke
  0 siblings, 0 replies; 24+ messages in thread
From: Joern Rennecke @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: zack, gcc

> Couldn't you just arrange for it to be replaced with \000?

It's not quite that easy - if you have "\^@" , and you replace the ^@
single-mindedly, you'll get "\\000", which is something completely different.
So you'd have to check if there is an odd number of leading backslashes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* The treatment of null characters in C source files
  1999-09-05 16:29 The treatment of null characters in C source files Zack Weinberg
  1999-09-05 17:07 ` Jeffrey A Law
  1999-09-05 19:52 ` Alexandre Oliva
@ 1999-09-30 18:02 ` Zack Weinberg
  2 siblings, 0 replies; 24+ messages in thread
From: Zack Weinberg @ 1999-09-30 18:02 UTC (permalink / raw)
  To: gcc

Consider a source file such as

#include <stdio.h>

int main()
{
  puts ("hello^@ world");
}

where ^@ is a null character.  cccp passes null characters through to
the output and cc1 accepts them in strings.  All released versions of
gcc will therefore compile this without complaint, producing an
executable that prints "hello".

cpplib used to mangle input files with nulls in them.  The patch I
sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
emit a warning and ignore the null.  The above will produce

test.c:5:15: warning: ignoring ASCII NUL in input

and an executable that prints "hello world".

The question is, is this an acceptable behavior change for the
compiler?  Making cpplib pass through nulls would be extremely
difficult, but someone might have a legitimate use for them.

zw

p.s. This has nothing to do with multibyte support.  I'm fully aware
that non-ASCII character sets may contain zero bytes which are not
null characters.  Currently cpplib supports only ASCII (unibyte strict
supersets such as Latin1 probably work if the extended characters are
confined to strings and comments).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:38   ` Zack Weinberg
@ 1999-09-30 18:02     ` Zack Weinberg
  0 siblings, 0 replies; 24+ messages in thread
From: Zack Weinberg @ 1999-09-30 18:02 UTC (permalink / raw)
  To: law; +Cc: gcc

Jeffrey A Law wrote:
>   In message < 199909052329.QAA14145@zack.bitmover.com >you write:
>   > 
>   > Consider a source file such as
>   > 
>   > #include <stdio.h>
>   > 
>   > int main()
>   > {
>   >   puts ("hello^@ world");
>   > }
>   > 
>   > where ^@ is a null character.  cccp passes null characters through to
>   > the output and cc1 accepts them in strings.  All released versions of
>   > gcc will therefore compile this without complaint, producing an
>   > executable that prints "hello".
>   > 
>   > cpplib used to mangle input files with nulls in them.  The patch I
>   > sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
>   > emit a warning and ignore the null.  The above will produce
>   > 
>   > test.c:5:15: warning: ignoring ASCII NUL in input
>   > 
>   > and an executable that prints "hello world".
>   > 
>   > The question is, is this an acceptable behavior change for the
>   > compiler?  Making cpplib pass through nulls would be extremely
>   > difficult, but someone might have a legitimate use for them.
> No, I do not believe that is acceptable behavior.
> 
> It is perfectly legitimate for a string to have a null character in it.  If
> you look hard you'll even find examples of this in gcc itself.

Um, I think you misunderstand.  It deals just fine with 
"hello\0 world", always has.  I'm talking about when a file has a
literal NUL character in it, a byte with all bits zero.

zw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-06  1:10     ` Jeffrey A Law
@ 1999-09-30 18:02       ` Jeffrey A Law
  0 siblings, 0 replies; 24+ messages in thread
From: Jeffrey A Law @ 1999-09-30 18:02 UTC (permalink / raw)
  To: craig; +Cc: zack, gcc

  In message < 19990906004151.7557.qmail@deer >you write:
  > A C *string* (as a constant) may certainly have a null character in it,
  > as in "hello\000 there".
Right.


  > But, surely there's no requirement that a C *source file* be allowed to
  > have a null character in it.
Oh, I mis-understood.  Sorry.  No clue what the standard says here.

Zack -- did you check the standard for any wording on this issue?  Ultimately
if it says anything we should follow it.

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 19:52 ` Alexandre Oliva
  1999-09-06 10:26   ` Joern Rennecke
@ 1999-09-30 18:02   ` Alexandre Oliva
  1 sibling, 0 replies; 24+ messages in thread
From: Alexandre Oliva @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Sep  5, 1999, Zack Weinberg <zack@bitmover.com> wrote:

> cpplib used to mangle input files with nulls in them.  The patch I
> sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
> emit a warning and ignore the null.

Couldn't you just arrange for it to be replaced with \000?

-- 
Alexandre Oliva http://www.dcc.unicamp.br/~oliva IC-Unicamp, Bra[sz]il
oliva@{dcc.unicamp.br,guarana.{org,com}} aoliva@{acm.org,computer.org}
oliva@{gnu.org,kaffe.org,{egcs,sourceware}.cygnus.com,samba.org}
** I may forward mail about projects to mailing lists; please use them

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-05 17:07 ` Jeffrey A Law
  1999-09-05 17:38   ` Zack Weinberg
  1999-09-05 17:42   ` craig
@ 1999-09-30 18:02   ` Jeffrey A Law
  2 siblings, 0 replies; 24+ messages in thread
From: Jeffrey A Law @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

  In message < 199909052329.QAA14145@zack.bitmover.com >you write:
  > 
  > Consider a source file such as
  > 
  > #include <stdio.h>
  > 
  > int main()
  > {
  >   puts ("hello^@ world");
  > }
  > 
  > where ^@ is a null character.  cccp passes null characters through to
  > the output and cc1 accepts them in strings.  All released versions of
  > gcc will therefore compile this without complaint, producing an
  > executable that prints "hello".
  > 
  > cpplib used to mangle input files with nulls in them.  The patch I
  > sent in on Friday (gcc-patches/1999-09/msg00158.html) makes it instead
  > emit a warning and ignore the null.  The above will produce
  > 
  > test.c:5:15: warning: ignoring ASCII NUL in input
  > 
  > and an executable that prints "hello world".
  > 
  > The question is, is this an acceptable behavior change for the
  > compiler?  Making cpplib pass through nulls would be extremely
  > difficult, but someone might have a legitimate use for them.
No, I do not believe that is acceptable behavior.

It is perfectly legitimate for a string to have a null character in it.  If
you look hard you'll even find examples of this in gcc itself.

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-06 13:46 John Marshall
  1999-09-07 11:37 ` Dave Brolley
@ 1999-09-30 18:02 ` John Marshall
  1 sibling, 0 replies; 24+ messages in thread
From: John Marshall @ 1999-09-30 18:02 UTC (permalink / raw)
  To: law; +Cc: gcc

>> But, surely there's no requirement that a C *source file* be allowed to
>> have a null character in it.
> Oh, I mis-understood.  Sorry.  No clue what the standard says here.

Section 5.2.1 (Character sets) requires the basic source character set to
have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
FF, and "some way of indicating the end of each line of text".  Outside
of char and string literals and a few other places, encountering anything
else (eg, NUL) is undefined.  Inside a char constant or string literal:

	[...] members of the execution character set shall be represented
	by corresponding members of the source character set or by escape
	sequences [...]

The next sentence requires there to be a NUL character in the basic
execution character set, but not in the source one.

That section then refers you to 6.1.4 for string literals, which doesn't
really say anything about NULs, although there is an obliquely relevant
footnote:

	A character string literal need not be a string (sec 7.1.1), because
	a null character may be embedded in it by a \0 escape sequence.

So I don't think we're required to understand real NUL characters.
(I didn't look in the C++ standard, though.)

    John  "lawyers 'r' us :-)"

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 11:37 ` Dave Brolley
  1999-09-07 12:27   ` Zack Weinberg
@ 1999-09-30 18:02   ` Dave Brolley
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Brolley @ 1999-09-30 18:02 UTC (permalink / raw)
  To: John Marshall; +Cc: law, gcc

John Marshall wrote:

> Section 5.2.1 (Character sets) requires the basic source character set to
> have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
> FF, and "some way of indicating the end of each line of text".  Outside
> of char and string literals and a few other places, encountering anything
> else (eg, NUL) is undefined.  Inside a char constant or string literal:
> So I don't think we're required to understand real NUL characters.
> (I didn't look in the C++ standard, though.)
>
>     John  "lawyers 'r' us :-)"

  The only other place that I know of with relevent verbiage is section 2.4 -
Preprocessing Tokens where it lists the various kinds of preprocessing tokens
and then says that "any other non whitespace character that can not be one of
the above" is a separate preprocessing token. This allows one to use all kinds
of random characters in macro definitions, or on #error directives (for
example). Of course such uses are somewhat obscure, but nontheless are legal
(one would have to use the stringizing operator to make use of such a macro, for
example).

Anyway the point is that nul is not specifically mentioned, so I think it should
be treated like any other character which is not part of the basic source
character set, although I do think that a warning is in order since it does tend
to be more of a problem than other characters in the ways mentioned by several
previous posts. So my opinion is:

o Within a literal -- accepted with a warning. Zack's initial example should
print 'Hello'.
o On a directive -- treated as a separate pp-token (with a warning).
o In open text -- generates the same diagnostic as any other random character
not in the basic source character set.

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 12:27   ` Zack Weinberg
  1999-09-07 12:40     ` Dave Brolley
  1999-09-07 12:44     ` Dave Brolley
@ 1999-09-30 18:02     ` Zack Weinberg
  2 siblings, 0 replies; 24+ messages in thread
From: Zack Weinberg @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Dave Brolley; +Cc: gcc

Dave Brolley wrote:
> John Marshall wrote:
> 
> > Section 5.2.1 (Character sets) requires the basic source character set to
> > have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
> > FF, and "some way of indicating the end of each line of text".  Outside
> > of char and string literals and a few other places, encountering anything
> > else (eg, NUL) is undefined.  Inside a char constant or string literal:
> > So I don't think we're required to understand real NUL characters.
> > (I didn't look in the C++ standard, though.)
> >
> >     John  "lawyers 'r' us :-)"
> 
>   The only other place that I know of with relevent verbiage is
> section 2.4 - Preprocessing Tokens where it lists the various kinds
> of preprocessing tokens and then says that "any other non whitespace
> character that can not be one of the above" is a separate
> preprocessing token. This allows one to use all kinds of random
> characters in macro definitions, or on #error directives (for
> example). Of course such uses are somewhat obscure, but nontheless
> are legal (one would have to use the stringizing operator to make
> use of such a macro, for example).
> 
> Anyway the point is that nul is not specifically mentioned, so I
> think it should be treated like any other character which is not
> part of the basic source character set

This is not _required_ by the standard any more than we are required
to do something sensible with @ outside of strings.  For purely ease
of implementation reasons, I'm not eager to do this.  cccp is able to
handle nuls, so it is possible; however, cccp's interface is text
files.  cpplib's interface is C strings passed around in memory.  NUL
would have to be a separate token type, special cased all over the
code in both the library itself and its users.  I don't think the
complexity is worth it, compared to discarding them with a warning in
translation phase 1.

If someone had a use for NUL in source that couldn't be served by
\0, I would withdraw the objection.  As it stands, I consider
supporting NUL "properly" to be a waste of effort.

zw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 12:40     ` Dave Brolley
@ 1999-09-30 18:02       ` Dave Brolley
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Brolley @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

Zack Weinberg wrote:

> cpplib's interface is C strings passed around in memory.

Here lies the root of the problem!  I'll bet that changing this to be a 'pointer
and length' interface would improve performance by reducing the number of times
the 'strings' need to get scanned for the terminating nul. How do nul bytes in
multibyte characters get handled?

> If someone had a use for NUL in source that couldn't be served by
> \0, I would withdraw the objection.

True enough. I can't think of a legitemate use, other than nul bytes in multibyte
characters.

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 12:44     ` Dave Brolley
@ 1999-09-30 18:02       ` Dave Brolley
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Brolley @ 1999-09-30 18:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

Zack Weinberg wrote:

> >   The only other place that I know of with relevent verbiage is
> > section 2.4 - Preprocessing Tokens where it lists the various kinds
> > of preprocessing tokens and then says that "any other non whitespace
> > character that can not be one of the above" is a separate
> > preprocessing token. This allows one to use all kinds of random
> > characters in macro definitions, or on #error directives (for
> > example). Of course such uses are somewhat obscure, but nontheless
> > are legal (one would have to use the stringizing operator to make
> > use of such a macro, for example).
> >
> >
>
> This is not _required_ by the standard any more than we are required
> to do something sensible with @ outside of strings.

The following code must compile cleanly......  #define AT @  #define STR(s) #s
  #define XSTR(s) STR(s)
  char *at = XSTR(AT);

also,

  #error brolley@cygnus.com

Must be accepted.

I'm not arguing in favour of jumping through hoops to handle nul, I'm just giving
examples of where other characters which are not in the basic source character
set must be handled cleanly (i.e. we are _required_ to handle them).

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 12:27   ` Zack Weinberg
  1999-09-07 12:40     ` Dave Brolley
@ 1999-09-07 12:44     ` Dave Brolley
  1999-09-30 18:02       ` Dave Brolley
  1999-09-30 18:02     ` Zack Weinberg
  2 siblings, 1 reply; 24+ messages in thread
From: Dave Brolley @ 1999-09-07 12:44 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

Zack Weinberg wrote:

> >   The only other place that I know of with relevent verbiage is
> > section 2.4 - Preprocessing Tokens where it lists the various kinds
> > of preprocessing tokens and then says that "any other non whitespace
> > character that can not be one of the above" is a separate
> > preprocessing token. This allows one to use all kinds of random
> > characters in macro definitions, or on #error directives (for
> > example). Of course such uses are somewhat obscure, but nontheless
> > are legal (one would have to use the stringizing operator to make
> > use of such a macro, for example).
> >
> >
>
> This is not _required_ by the standard any more than we are required
> to do something sensible with @ outside of strings.

The following code must compile cleanly......  #define AT @  #define STR(s) #s
  #define XSTR(s) STR(s)
  char *at = XSTR(AT);

also,

  #error brolley@cygnus.com

Must be accepted.

I'm not arguing in favour of jumping through hoops to handle nul, I'm just giving
examples of where other characters which are not in the basic source character
set must be handled cleanly (i.e. we are _required_ to handle them).

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 12:27   ` Zack Weinberg
@ 1999-09-07 12:40     ` Dave Brolley
  1999-09-30 18:02       ` Dave Brolley
  1999-09-07 12:44     ` Dave Brolley
  1999-09-30 18:02     ` Zack Weinberg
  2 siblings, 1 reply; 24+ messages in thread
From: Dave Brolley @ 1999-09-07 12:40 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

Zack Weinberg wrote:

> cpplib's interface is C strings passed around in memory.

Here lies the root of the problem!  I'll bet that changing this to be a 'pointer
and length' interface would improve performance by reducing the number of times
the 'strings' need to get scanned for the terminating nul. How do nul bytes in
multibyte characters get handled?

> If someone had a use for NUL in source that couldn't be served by
> \0, I would withdraw the objection.

True enough. I can't think of a legitemate use, other than nul bytes in multibyte
characters.

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-07 11:37 ` Dave Brolley
@ 1999-09-07 12:27   ` Zack Weinberg
  1999-09-07 12:40     ` Dave Brolley
                       ` (2 more replies)
  1999-09-30 18:02   ` Dave Brolley
  1 sibling, 3 replies; 24+ messages in thread
From: Zack Weinberg @ 1999-09-07 12:27 UTC (permalink / raw)
  To: Dave Brolley; +Cc: gcc

Dave Brolley wrote:
> John Marshall wrote:
> 
> > Section 5.2.1 (Character sets) requires the basic source character set to
> > have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
> > FF, and "some way of indicating the end of each line of text".  Outside
> > of char and string literals and a few other places, encountering anything
> > else (eg, NUL) is undefined.  Inside a char constant or string literal:
> > So I don't think we're required to understand real NUL characters.
> > (I didn't look in the C++ standard, though.)
> >
> >     John  "lawyers 'r' us :-)"
> 
>   The only other place that I know of with relevent verbiage is
> section 2.4 - Preprocessing Tokens where it lists the various kinds
> of preprocessing tokens and then says that "any other non whitespace
> character that can not be one of the above" is a separate
> preprocessing token. This allows one to use all kinds of random
> characters in macro definitions, or on #error directives (for
> example). Of course such uses are somewhat obscure, but nontheless
> are legal (one would have to use the stringizing operator to make
> use of such a macro, for example).
> 
> Anyway the point is that nul is not specifically mentioned, so I
> think it should be treated like any other character which is not
> part of the basic source character set

This is not _required_ by the standard any more than we are required
to do something sensible with @ outside of strings.  For purely ease
of implementation reasons, I'm not eager to do this.  cccp is able to
handle nuls, so it is possible; however, cccp's interface is text
files.  cpplib's interface is C strings passed around in memory.  NUL
would have to be a separate token type, special cased all over the
code in both the library itself and its users.  I don't think the
complexity is worth it, compared to discarding them with a warning in
translation phase 1.

If someone had a use for NUL in source that couldn't be served by
\0, I would withdraw the objection.  As it stands, I consider
supporting NUL "properly" to be a waste of effort.

zw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
  1999-09-06 13:46 John Marshall
@ 1999-09-07 11:37 ` Dave Brolley
  1999-09-07 12:27   ` Zack Weinberg
  1999-09-30 18:02   ` Dave Brolley
  1999-09-30 18:02 ` John Marshall
  1 sibling, 2 replies; 24+ messages in thread
From: Dave Brolley @ 1999-09-07 11:37 UTC (permalink / raw)
  To: John Marshall; +Cc: law, gcc

John Marshall wrote:

> Section 5.2.1 (Character sets) requires the basic source character set to
> have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
> FF, and "some way of indicating the end of each line of text".  Outside
> of char and string literals and a few other places, encountering anything
> else (eg, NUL) is undefined.  Inside a char constant or string literal:
> So I don't think we're required to understand real NUL characters.
> (I didn't look in the C++ standard, though.)
>
>     John  "lawyers 'r' us :-)"

  The only other place that I know of with relevent verbiage is section 2.4 -
Preprocessing Tokens where it lists the various kinds of preprocessing tokens
and then says that "any other non whitespace character that can not be one of
the above" is a separate preprocessing token. This allows one to use all kinds
of random characters in macro definitions, or on #error directives (for
example). Of course such uses are somewhat obscure, but nontheless are legal
(one would have to use the stringizing operator to make use of such a macro, for
example).

Anyway the point is that nul is not specifically mentioned, so I think it should
be treated like any other character which is not part of the basic source
character set, although I do think that a warning is in order since it does tend
to be more of a problem than other characters in the ways mentioned by several
previous posts. So my opinion is:

o Within a literal -- accepted with a warning. Zack's initial example should
print 'Hello'.
o On a directive -- treated as a separate pp-token (with a warning).
o In open text -- generates the same diagnostic as any other random character
not in the basic source character set.

Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: The treatment of null characters in C source files
@ 1999-09-06 13:46 John Marshall
  1999-09-07 11:37 ` Dave Brolley
  1999-09-30 18:02 ` John Marshall
  0 siblings, 2 replies; 24+ messages in thread
From: John Marshall @ 1999-09-06 13:46 UTC (permalink / raw)
  To: law; +Cc: gcc

>> But, surely there's no requirement that a C *source file* be allowed to
>> have a null character in it.
> Oh, I mis-understood.  Sorry.  No clue what the standard says here.

Section 5.2.1 (Character sets) requires the basic source character set to
have the usual bunch of alphanumerics and punctuation, and space, HT, VT,
FF, and "some way of indicating the end of each line of text".  Outside
of char and string literals and a few other places, encountering anything
else (eg, NUL) is undefined.  Inside a char constant or string literal:

	[...] members of the execution character set shall be represented
	by corresponding members of the source character set or by escape
	sequences [...]

The next sentence requires there to be a NUL character in the basic
execution character set, but not in the source one.

That section then refers you to 6.1.4 for string literals, which doesn't
really say anything about NULs, although there is an obliquely relevant
footnote:

	A character string literal need not be a string (sec 7.1.1), because
	a null character may be embedded in it by a \0 escape sequence.

So I don't think we're required to understand real NUL characters.
(I didn't look in the C++ standard, though.)

    John  "lawyers 'r' us :-)"

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~1999-09-30 18:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-09-05 16:29 The treatment of null characters in C source files Zack Weinberg
1999-09-05 17:07 ` Jeffrey A Law
1999-09-05 17:38   ` Zack Weinberg
1999-09-30 18:02     ` Zack Weinberg
1999-09-05 17:42   ` craig
1999-09-06  1:10     ` Jeffrey A Law
1999-09-30 18:02       ` Jeffrey A Law
1999-09-30 18:02     ` craig
1999-09-30 18:02   ` Jeffrey A Law
1999-09-05 19:52 ` Alexandre Oliva
1999-09-06 10:26   ` Joern Rennecke
1999-09-30 18:02     ` Joern Rennecke
1999-09-30 18:02   ` Alexandre Oliva
1999-09-30 18:02 ` Zack Weinberg
1999-09-06 13:46 John Marshall
1999-09-07 11:37 ` Dave Brolley
1999-09-07 12:27   ` Zack Weinberg
1999-09-07 12:40     ` Dave Brolley
1999-09-30 18:02       ` Dave Brolley
1999-09-07 12:44     ` Dave Brolley
1999-09-30 18:02       ` Dave Brolley
1999-09-30 18:02     ` Zack Weinberg
1999-09-30 18:02   ` Dave Brolley
1999-09-30 18:02 ` John Marshall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).