public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-02  7:53 Robert Dewar
  2002-08-02  9:16 ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2002-08-02  7:53 UTC (permalink / raw)
  To: gary, gcc

>(though they are explicitly public domain with a BSD-like copyright notice).

THat's nonsense. Something is either in the public domain or it is copyrighted.
It cannot be both. If it has a copyright notice, then it is not in the public
domain. People on this list of all lists should not misuse the phrase!

However, BSD is a perfectly acceptable Free Software license, so I don't
see a problem here from a legal point of view.

I still see a problem from a technical point of view. First I think that
LL parser generators are inherently inferior to good LR parser generators.
If you want to use automatic parser generation, a modern LR parser
generator is a much better choice (that by the way excludes YACC and
BISON :-)

The allegation that a given parser generator cannot parse a given language
is always incorrect. Why? Because in practice what you do is to adjust the
grammar to be suitable for the generator. Such adjustment is always possible.
What you do is to superset the grammar so that it is acceptable to the 
generator and then let the semantic analyzer resolve differences. Nearly
all real languages are ambiguous, so some supersetting/approximation is
often necessary.

Now what you can say is that given language L and parser generator P, the
damage done to the grammar of L to get it through P is excessive (for
example, if we have to approximate the C++ grammar as follows:

   CPP_Program ::= Character_Sequence
   Character_Sequence ::= character | character Character_Sequence

then that's bad, it leaves too much work for the semantic analyzer :-)

I still think that writing the parser by hand makes much better sense
for the reasons I stated previously.

P.S. I know my old Berkeley mailer is not generating proper headers for
some of you, and I am trying to find a suitable replacement that works in
my context. Sorry for the inconvenience.

^ permalink raw reply	[flat|nested] 12+ messages in thread
* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-02 11:19 Robert Dewar
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Dewar @ 2002-08-02 11:19 UTC (permalink / raw)
  To: gary, gcc

> It is public-domain (not copyrighted), they ask that users preserve the
> header comments and cite the use of the tool in research reports and
> documentation.

That of course is just a courtesy request, with no force behind it. Once
you place something in the public domain, you lose all control over what
might be done to it.

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-01  3:21 Robert Dewar
  2002-08-01 11:59 ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2002-08-01  3:21 UTC (permalink / raw)
  To: gary, per; +Cc: gcc

<<There are tricks to solve most of these problems, but a
recursive-descent parser is just so much easier to understand, to debug,
and to handle special quirks that I much prefer it.  Yes, the actual
grammar isn't as cleanly separated, and the source is slightly more
verbose, but I much prefer it.
>>

When we use the phrase "recursive descent" what we mean in practice is
not a straight LL(1) parser. If we wanted a straight LL(1) parser, then
we could use an appropriate parser generator (though why would we do this
when LRK PG's are clearly preferable).

The point about recursive descent is that it is hand written code that can
use any techniques it likes to parse the source. It can do a bit of backup
or extra look ahead if it wants, it can keep information on the side and
roam around the tree etc. This is particularly use for two purposes:

1. The grammar used does not have to be exactly LL(1) or LR(1) or anything
else. In GNAT, we take advantage of this to use almost exactly the grammar
in the Ada RM which is certainly neither. But using the exact grammar in
the RM means that semantic analysis is much much cleaner. Since for a 
language like Ada or C++, semantic analysis is far more complex than
parsing, that's a real advantage.

2. Error detection and correction is potentially much more effective. Now it
is true that YACC and BISON are perfectly awful examples of decades old
junk parsing technology, and that automatic parsing technology has got
hugely better in the last 20 years (it's a mystery to me why people would
even think of using such antiquated tools). Nevertheless, even with the
best table generated techniques (reference for example the work of Gerry
Fisher in connection with the Ada/Ed project, early -> mid 80's) I still
think you cannot match a hand written parser for error detection and
recovery, and the GNAT parser is intended to demonstrate this.

One bottom line here is that parsers are a pretty simple part of the process
no matter how written. In the case of the Ada parser, it's a relatively small
and easy fraction of the total front end, and actually if you look at the
parser (par-xxx.ad? files), much more than half the code has to do with
error detection and recovery, and that's the complicated part.

^ permalink raw reply	[flat|nested] 12+ messages in thread
* C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-07-31  8:10 Gary Funck
  2002-07-31 20:30 ` Neil Booth
  0 siblings, 1 reply; 12+ messages in thread
From: Gary Funck @ 2002-07-31  8:10 UTC (permalink / raw)
  To: Gcc Mailing List


While moving changes made to GCC to support an experimental dialect of C, known
as UPC, I ran into a problem: After adding the new language support in a
fashion similar to Objc, the C++ compiler no longer was able to properly lex
and parse C++ programs. It turns out that the difficulty was the result of the
method used in cp/lex.c to recognize reserved words:

   481
   482  /* Table mapping from RID_* constants to yacc token numbers.
   483     Unfortunately we have to have entries for all the keywords in all
   484     three languages.  */
   485  const short rid_to_yy[RID_MAX] =
   486  {
   487    /* RID_STATIC */      SCSPEC,
   488    /* RID_UNSIGNED */    TYPESPEC,
   489    /* RID_LONG */        TYPESPEC,

The rid_to_yy table requires an entry for each reserved word in *all* supported
C dialects, and the table is ordered by increasing RID_* values.  A few
observations:

1) This dependency on other languages makes it more difficult to add a new
dialect and violates modularity.

2) If the dependency must exist, it would make the job of adding another
dialect easier, if for example, a new type of *.def file were introduced under
gcc which would generate both the values currently in c-common.h and to fill in
the table required by cp/lex.c.

3) At a minimum, would it be possible to add a check in the C++ parser
initialization routine, which somehow checks for consistency (perhaps the
number of elements in rid_to_yy can be checked against the value of RID_MAX?),
and aborts with a diagnostic if something in the definition of rid_to_yy seems
inconsistent or incorrect?

4) I haven't had a chance to read the new internals document yet, but if there
is a section on adding a new dialect, it should discuss this dependency between
C++ and the other C dialects.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-08-02 18:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-02  7:53 C++ lexer (GCC 3.1.1) requires knowledge of other C dialects Robert Dewar
2002-08-02  9:16 ` Gary Funck
  -- strict thread matches above, loose matches on Subject: below --
2002-08-02 11:19 Robert Dewar
2002-08-01  3:21 Robert Dewar
2002-08-01 11:59 ` Gary Funck
2002-08-01 14:42   ` Joe Buck
2002-08-02  7:41     ` Gary Funck
2002-07-31  8:10 Gary Funck
2002-07-31 20:30 ` Neil Booth
2002-07-31 21:58   ` Gary Funck
2002-07-31 22:32     ` Neil Booth
2002-07-31 23:03     ` Stan Shebs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).