RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-02 11:19 Robert Dewar
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Dewar @ 2002-08-02 11:19 UTC (permalink / raw)
  To: gary, gcc

> It is public-domain (not copyrighted), they ask that users preserve the
> header comments and cite the use of the tool in research reports and
> documentation.

That of course is just a courtesy request, with no force behind it. Once
you place something in the public domain, you lose all control over what
might be done to it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-08-02  7:53 Robert Dewar
@ 2002-08-02  9:16 ` Gary Funck
  0 siblings, 0 replies; 12+ messages in thread
From: Gary Funck @ 2002-08-02  9:16 UTC (permalink / raw)
  To: gcc

> -----Original Message-----
> From: Robert Dewar [mailto:dewar@gnat.com]
> Sent: Friday, August 02, 2002 7:54 AM
> To: gary@Intrepid.Com; gcc@gcc.gnu.org
> Subject: RE: C++ lexer (GCC 3.1.1) requires knowledge of other C
> dialects
>
>
> >(though they are explicitly public domain with a BSD-like
> copyright notice).
>
> That's nonsense. Something is either in the public domain or it
> is copyrighted.
> It cannot be both. If it has a copyright notice, then it is not
> in the public
> domain. People on this list of all lists should not misuse the phrase!
>

Thanks for the clarification. Here is their rights statement:
 http://www.antlr.org/rights.html
It is public-domain (not copyrighted), they ask that users preserve the
header comments and cite the use of the tool in research reports and
documentation.

> I still see a problem from a technical point of view. First I think that
> LL parser generators are inherently inferior to good LR parser generators.
> If you want to use automatic parser generation, a modern LR parser
> generator is a much better choice (that by the way excludes YACC and
> BISON :-)
>

This is an age-old debate. I've used both LALR and LL parser generators, and
prefer LL (well, modified LL, as implemented in PCCTS) partly because the
top-down nature of the parse is more intuitive, and actions added internally
to RHS rules do not run the risk of introducing new parse conflicts.
Generally, the resulting parsers are easier to understand and require fewer
tricks, though the need to perform left-factoring can make the resulting LL
grammar not as clear and intuitive as it might've been in its original
(non-factored) form.

Terence Parr's thesis discusses some of the pros/cons of using LL vs. LALR
and motivates the need to use k>1 look ahead techniques for LL(k) parsers:
http://www.antlr.org/papers/parr.phd.thesis.ps.gz

I am interested in knowing, which LALR parser generators are superior to
bison/yacc?

> The allegation that a given parser generator cannot parse a given language
> is always incorrect. Why? Because in practice what you do is to adjust the
> grammar to be suitable for the generator. Such adjustment is
> always possible.

I agree. I was going to say, relative to the ANTLR FAQ that said it was
impossible to use ANTLR to build a C++ parser, that this statement was
likely false. I think the author was saying that some unpleasant damage
would have to be done to the C++ BNF reference grammar description to build
a working C++ parser, to get it working with ANTLR.

There's some interesting work which improves the technology used to build
recursive descent parsers:
http://www.cs.rhul.ac.uk/research/languages/projects/grd.shtml

"GRD is a new style of parsing algorithm that allows you to protoype new
languages without regard to the underlying parsing technology. Once a
working prototype is obtained, GRD based parser generators can be used to
refine the language specification to improve parser performance.

The heart of GRD is a new backtracking recursive descent parsing algorithm
that can handle grammars requiring arbitrary amounts of lookahead and which
even behaves correctly when presented with ambiguous grammars. A faster
version uses a new property called follow determinism to decide when to
truncate the back tracking search."

>
> I still think that writing the parser by hand makes much better sense
> for the reasons I stated previously.
>

In my opinion, a parser built using a parser generator is generally easier
to understand, maintain, and extend than a hand built parser. I like the
fact that a parser generator will tell me if I've introduced ambiguities,
and point out where they are. I also generally like automatically generated
error recovery, as long as the parser generator gives me the tools to
augment it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-02  7:53 Robert Dewar
  2002-08-02  9:16 ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2002-08-02  7:53 UTC (permalink / raw)
  To: gary, gcc

>(though they are explicitly public domain with a BSD-like copyright notice).

THat's nonsense. Something is either in the public domain or it is copyrighted.
It cannot be both. If it has a copyright notice, then it is not in the public
domain. People on this list of all lists should not misuse the phrase!

However, BSD is a perfectly acceptable Free Software license, so I don't
see a problem here from a legal point of view.

I still see a problem from a technical point of view. First I think that
LL parser generators are inherently inferior to good LR parser generators.
If you want to use automatic parser generation, a modern LR parser
generator is a much better choice (that by the way excludes YACC and
BISON :-)

The allegation that a given parser generator cannot parse a given language
is always incorrect. Why? Because in practice what you do is to adjust the
grammar to be suitable for the generator. Such adjustment is always possible.
What you do is to superset the grammar so that it is acceptable to the 
generator and then let the semantic analyzer resolve differences. Nearly
all real languages are ambiguous, so some supersetting/approximation is
often necessary.

Now what you can say is that given language L and parser generator P, the
damage done to the grammar of L to get it through P is excessive (for
example, if we have to approximate the C++ grammar as follows:

   CPP_Program ::= Character_Sequence
   Character_Sequence ::= character | character Character_Sequence

then that's bad, it leaves too much work for the semantic analyzer :-)

I still think that writing the parser by hand makes much better sense
for the reasons I stated previously.

P.S. I know my old Berkeley mailer is not generating proper headers for
some of you, and I am trying to find a suitable replacement that works in
my context. Sorry for the inconvenience.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-08-01 14:42   ` Joe Buck
@ 2002-08-02  7:41     ` Gary Funck
  0 siblings, 0 replies; 12+ messages in thread
From: Gary Funck @ 2002-08-02  7:41 UTC (permalink / raw)
  To: gcc

> -----Original Message-----
> From: Joe Buck [mailto:Joe.Buck@synopsys.com]
> Sent: Thursday, August 01, 2002 2:42 PM
> To: Gary Funck
> Cc: gcc@gcc.gnu.org
> Subject: Re: C++ lexer (GCC 3.1.1) requires knowledge of other C
> dialects
>
>
>
> > Although this is a bit of philosophical debate (regarding the merits of
> > handbuilt vs. program-generated parsers), I'll point out that the ANTLR
> > parser generator (http://www.antlr.org) generates LL(k)
> recursive descent
> > parsers with additional features which include: [ many features ]
>
> However, the ANTLR FAQ explains that you can't write a C++ parser in ANTLR
> because of the complexities of the language.  They recommend hand-written.
>
> See http://www.jguru.com/faq/view.jsp?EID=531848
>

Yes, though further down that same page, they mention the following:
"I should mention that John Lilley produced a fairly decent C++ parser for
ANTLR's predecessor, PCCTS, but it required a modification to PCCTS and is
not compatible with latest ANTLR."

PCCTS is described here:
http://www.antlr.org/pccts133.html
which in turn refers to this implementation of a C++ grammar:
http://www.empathy.com/pccts/
and
http://www.empathy.com/pccts/download.html
and this interesting resource page
http://www.empathy.com/pccts/roskind.html
which contains Jim Roskind's implementation of a C++ parser written for an
LALR parser generator (yacc/bison), along with an analysis of the
difficulties encountered.

While we're on the topic, there is this a backtracking implementation of
yacc, known as btyacc, which according to the authors has been used to build
a C++ parser:
http://www.siber.org/btyacc/

It turns out that ANTLR might not be suitable for building a parser for GCC
because ANTLR is written in Java, and neither ANTLR nor PCCTS are GPL'd
(though they are explicitly public domain with a BSD-like copyright notice).
I wasn't trying to imply that these would be the tool of choice, but simply
noting that these automatic parser generators meet many of the technical
objections mentioned by advocates of handbuilt recursive descent parsers. In
the sense that bison is a GPL'd implementation of YACC, there would nothing
to prevent re-implemtnating (and extending) any of the tools mentioned above
in a form suitable for inclusion into GCC.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-08-01 11:59 ` Gary Funck
@ 2002-08-01 14:42   ` Joe Buck
  2002-08-02  7:41     ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Buck @ 2002-08-01 14:42 UTC (permalink / raw)
  To: Gary Funck; +Cc: gcc


> Although this is a bit of philosophical debate (regarding the merits of
> handbuilt vs. program-generated parsers), I'll point out that the ANTLR
> parser generator (http://www.antlr.org) generates LL(k) recursive descent
> parsers with additional features which include: [ many features ]

However, the ANTLR FAQ explains that you can't write a C++ parser in ANTLR
because of the complexities of the language.  They recommend hand-written.

See http://www.jguru.com/faq/view.jsp?EID=531848

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-08-01  3:21 Robert Dewar
@ 2002-08-01 11:59 ` Gary Funck
  2002-08-01 14:42   ` Joe Buck
  0 siblings, 1 reply; 12+ messages in thread
From: Gary Funck @ 2002-08-01 11:59 UTC (permalink / raw)
  To: gcc



> -----Original Message-----
> From: Robert Dewar [mailto:dewar@gnat.com]
> Sent: Thursday, August 01, 2002 3:21 AM
> To: gary@Intrepid.Com; per@bothner.com
> Cc: gcc@gcc.gnu.org
> Subject: Re: C++ lexer (GCC 3.1.1) requires knowledge of other C
> dialects
>
>
> <<There are tricks to solve most of these problems, but a
> recursive-descent parser is just so much easier to understand, to debug,
> and to handle special quirks that I much prefer it.  Yes, the actual
> grammar isn't as cleanly separated, and the source is slightly more
> verbose, but I much prefer it.
> >>
>
> When we use the phrase "recursive descent" what we mean in practice is
> not a straight LL(1) parser. If we wanted a straight LL(1) parser, then
> we could use an appropriate parser generator (though why would we do this
> when LRK PG's are clearly preferable).
>
> The point about recursive descent is that it is hand written code that can
> use any techniques it likes to parse the source. It can do a bit of backup
> or extra look ahead if it wants, it can keep information on the side and
> roam around the tree etc. This is particularly use for two purposes:
[...]

Although this is a bit of philosophical debate (regarding the merits of
handbuilt vs. program-generated parsers), I'll point out that the ANTLR
parser generator (http://www.antlr.org) generates LL(k) recursive descent
parsers with additional features which include: (1) syntactic and semantic
predicates which allow for arbitrary checks before an alternative is taken,
(2) approximate linear lookahead which accepts a larger class of LL(k)
grammars, (3) automated error recovery, and support manual, arbitrary error
recovery, (4) parsers built in C, C++, and Java, (5) builtin support for
building AST's (Abstract Syntax Trees).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-08-01  3:21 Robert Dewar
  2002-08-01 11:59 ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2002-08-01  3:21 UTC (permalink / raw)
  To: gary, per; +Cc: gcc

<<There are tricks to solve most of these problems, but a
recursive-descent parser is just so much easier to understand, to debug,
and to handle special quirks that I much prefer it.  Yes, the actual
grammar isn't as cleanly separated, and the source is slightly more
verbose, but I much prefer it.
>>

When we use the phrase "recursive descent" what we mean in practice is
not a straight LL(1) parser. If we wanted a straight LL(1) parser, then
we could use an appropriate parser generator (though why would we do this
when LRK PG's are clearly preferable).

The point about recursive descent is that it is hand written code that can
use any techniques it likes to parse the source. It can do a bit of backup
or extra look ahead if it wants, it can keep information on the side and
roam around the tree etc. This is particularly use for two purposes:

1. The grammar used does not have to be exactly LL(1) or LR(1) or anything
else. In GNAT, we take advantage of this to use almost exactly the grammar
in the Ada RM which is certainly neither. But using the exact grammar in
the RM means that semantic analysis is much much cleaner. Since for a 
language like Ada or C++, semantic analysis is far more complex than
parsing, that's a real advantage.

2. Error detection and correction is potentially much more effective. Now it
is true that YACC and BISON are perfectly awful examples of decades old
junk parsing technology, and that automatic parsing technology has got
hugely better in the last 20 years (it's a mystery to me why people would
even think of using such antiquated tools). Nevertheless, even with the
best table generated techniques (reference for example the work of Gerry
Fisher in connection with the Ada/Ed project, early -> mid 80's) I still
think you cannot match a hand written parser for error detection and
recovery, and the GNAT parser is intended to demonstrate this.

One bottom line here is that parsers are a pretty simple part of the process
no matter how written. In the case of the Ada parser, it's a relatively small
and easy fraction of the total front end, and actually if you look at the
parser (par-xxx.ad? files), much more than half the code has to do with
error detection and recovery, and that's the complicated part.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-07-31 21:58   ` Gary Funck
  2002-07-31 22:32     ` Neil Booth
@ 2002-07-31 23:03     ` Stan Shebs
  1 sibling, 0 replies; 12+ messages in thread
From: Stan Shebs @ 2002-07-31 23:03 UTC (permalink / raw)
  To: Gary Funck; +Cc: Gcc Mailing List

Gary Funck wrote:

>-----Original Message-----
>From: Neil Booth [mailto:neil@daikokuya.co.uk]
>
>> We're slowly working on combining the C/C++/ObjC front ends into one
>> mahousive front end that handles them all.
>
>
>Will this new front-end still be YACC (bison) based? I've seen and heard
>about an effort to build a hand-crafted recursive descent parser. Although
>there may be some advantages to such an approach, I can see where adding new
>dialects of C might become problematic with such an approach - it may be
>difficult to verify that new language constructs do not break the parser by
>introducing the need for new look-ahead sets, etc. The advantage of using an
>automated parser genreator (like YACC, PCCTS, JYACC, etc) is that these
>programs can determine if (and where) the grammar is ambiguous, and will
>automatically generate the required (and correct) look-ahead sets.
>
Yacc-class generators turn out to be fundamentally inadequate for
C++, and in retrospect it looks like it wasn't such a great idea to
produce a C++ compiler as a completely separate frontend from
the C compiler - over the years there has had to be a bunch of
duplicated work, duplicated bug reports, etc.  (Could Tiemann
have been sufficiently clairvoyant?  Perhaps not...)

To be honest, support for new dialects of C has not really been a
design requirement for the GCC frontends, and I think it would be
impractical to add such a requirement, too vague to specify.  From
my one bit of experience (AltiVec extensions to C and C++), yacc
wasn't really a help anyway, because the extensions had to be context
sensitive in various ways, and the parser changes proper were the
smallest part of the support code.

Stan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-07-31 21:58   ` Gary Funck
@ 2002-07-31 22:32     ` Neil Booth
  2002-07-31 23:03     ` Stan Shebs
  1 sibling, 0 replies; 12+ messages in thread
From: Neil Booth @ 2002-07-31 22:32 UTC (permalink / raw)
  To: Gary Funck; +Cc: Gcc Mailing List

Gary Funck wrote:-

> Will this new front-end still be YACC (bison) based? I've seen and heard
> about an effort to build a hand-crafted recursive descent parser.

I think that is pretty much done for C++ as a first revision (see
cp/Attic/parser.c).  It will probably have some hooks / if statements to
handle C and ObjC in the future.  I have no idea what this might do for
maintainability, but IMO we;ve stretched bison/yacc way beyond what it can
reasonably do, and it shows (diagnostics are very poor, and the parsers are
convoluted, for a start).  I see little reason why a well-designed recursive
descent parser for all front ends can't be at least as maintainable.

I could be wrong, of course.

Neil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-07-31 20:30 ` Neil Booth
@ 2002-07-31 21:58   ` Gary Funck
  2002-07-31 22:32     ` Neil Booth
  2002-07-31 23:03     ` Stan Shebs
  0 siblings, 2 replies; 12+ messages in thread
From: Gary Funck @ 2002-07-31 21:58 UTC (permalink / raw)
  To: Gcc Mailing List

> -----Original Message-----
> From: Neil Booth [mailto:neil@daikokuya.co.uk]
> Sent: Wednesday, July 31, 2002 3:09 PM
> To: Gary Funck
> Cc: Gcc Mailing List
> Subject: Re: C++ lexer (GCC 3.1.1) requires knowledge of other C
> dialects
>
>
> Gary Funck wrote:-
>
> > The rid_to_yy table requires an entry for each reserved word in
> *all* supported
> > C dialects, and the table is ordered by increasing RID_* values.  A few
> > observations:
> >
> > 1) This dependency on other languages makes it more difficult
> to add a new
> > dialect and violates modularity.
>
> Agreed.  Any patch that might improve this without sacrificing
> performance is welcome.  However, I suspect that you might discover why
> things are done the way they are if you attempt such a patch.
>

I think that adding a reserved_id.def file which contained essentially a
union of the info. required by C, C++, and Objective-C (and other dialects)
wouldn't be too difficult to construct, and wouldn't impact performeance,
but I haven't looked into the details.

> We're slowly working on combining the C/C++/ObjC front ends into one
> mahousive front end that handles them all.

Will this new front-end still be YACC (bison) based? I've seen and heard
about an effort to build a hand-crafted recursive descent parser. Although
there may be some advantages to such an approach, I can see where adding new
dialects of C might become problematic with such an approach - it may be
difficult to verify that new language constructs do not break the parser by
introducing the need for new look-ahead sets, etc. The advantage of using an
automated parser genreator (like YACC, PCCTS, JYACC, etc) is that these
programs can determine if (and where) the grammar is ambiguous, and will
automatically generate the required (and correct) look-ahead sets.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
  2002-07-31  8:10 Gary Funck
@ 2002-07-31 20:30 ` Neil Booth
  2002-07-31 21:58   ` Gary Funck
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Booth @ 2002-07-31 20:30 UTC (permalink / raw)
  To: Gary Funck; +Cc: Gcc Mailing List

Gary Funck wrote:-

> The rid_to_yy table requires an entry for each reserved word in *all* supported
> C dialects, and the table is ordered by increasing RID_* values.  A few
> observations:
> 
> 1) This dependency on other languages makes it more difficult to add a new
> dialect and violates modularity.

Agreed.  Any patch that might improve this without sacrificing
performance is welcome.  However, I suspect that you might discover why
things are done the way they are if you attempt such a patch.

We're slowly working on combining the C/C++/ObjC front ends into one
mahousive front end that handles them all.  So anything other than a
clean fix of this is probably not worth it.  I'm not sure such a thing
exists.

Nei.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* C++ lexer (GCC 3.1.1) requires knowledge of other C dialects
@ 2002-07-31  8:10 Gary Funck
  2002-07-31 20:30 ` Neil Booth
  0 siblings, 1 reply; 12+ messages in thread
From: Gary Funck @ 2002-07-31  8:10 UTC (permalink / raw)
  To: Gcc Mailing List

While moving changes made to GCC to support an experimental dialect of C, known
as UPC, I ran into a problem: After adding the new language support in a
fashion similar to Objc, the C++ compiler no longer was able to properly lex
and parse C++ programs. It turns out that the difficulty was the result of the
method used in cp/lex.c to recognize reserved words:

   481
   482  /* Table mapping from RID_* constants to yacc token numbers.
   483     Unfortunately we have to have entries for all the keywords in all
   484     three languages.  */
   485  const short rid_to_yy[RID_MAX] =
   486  {
   487    /* RID_STATIC */      SCSPEC,
   488    /* RID_UNSIGNED */    TYPESPEC,
   489    /* RID_LONG */        TYPESPEC,

The rid_to_yy table requires an entry for each reserved word in *all* supported
C dialects, and the table is ordered by increasing RID_* values.  A few
observations:

1) This dependency on other languages makes it more difficult to add a new
dialect and violates modularity.

2) If the dependency must exist, it would make the job of adding another
dialect easier, if for example, a new type of *.def file were introduced under
gcc which would generate both the values currently in c-common.h and to fill in
the table required by cp/lex.c.

3) At a minimum, would it be possible to add a check in the C++ parser
initialization routine, which somehow checks for consistency (perhaps the
number of elements in rid_to_yy can be checked against the value of RID_MAX?),
and aborts with a diagnostic if something in the definition of rid_to_yy seems
inconsistent or incorrect?

4) I haven't had a chance to read the new internals document yet, but if there
is a section on adding a new dialect, it should discuss this dependency between
C++ and the other C dialects.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-08-02 18:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-02 11:19 C++ lexer (GCC 3.1.1) requires knowledge of other C dialects Robert Dewar
  -- strict thread matches above, loose matches on Subject: below --
2002-08-02  7:53 Robert Dewar
2002-08-02  9:16 ` Gary Funck
2002-08-01  3:21 Robert Dewar
2002-08-01 11:59 ` Gary Funck
2002-08-01 14:42   ` Joe Buck
2002-08-02  7:41     ` Gary Funck
2002-07-31  8:10 Gary Funck
2002-07-31 20:30 ` Neil Booth
2002-07-31 21:58   ` Gary Funck
2002-07-31 22:32     ` Neil Booth
2002-07-31 23:03     ` Stan Shebs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).