Parser rewritting

public inbox for archer@sourceware.org
 help / color / mirror / Atom feed

* Parser rewritting
@ 2010-03-30 18:46 Sergio Durigan Junior
  2010-03-30 19:05 ` Chris Moller
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sergio Durigan Junior @ 2010-03-30 18:46 UTC (permalink / raw)
  To: Project Archer

Hello!

As you may have noticed, in the last Archer meeting I brought a topic into 
discussion: the rewritting of the GDB's parser.  The current parser is written 
using Bison, and unfortunately it is insufficient to satisfy our current 
needs, especially for C++ productions.

With that in mind, Tom asked me to start this discussion in the mailing-list 
to see what you think about it.  We decided to send an e-mail to the archer 
list at first; this topic will eventually be discussed at the gdb list as 
well.

I am sorry I took so long to send this e-mail, but I was trying to come up 
with an initial plan to re-implement the parser.  I've been studying GCC/G++ 
parsers in order to understand how they work, but I noticed that it would take 
some time for me to think in a good plan.  I also noticed that other people 
here have (much!!) more experience about parsers than I do, so why not 
exposing this idea and see what you think?

The initial idea (by Tom) would be to mimic the current structure of the G++ 
parser.  There is also another proposal (from Keith), but I don't know if he 
wants it to be listed here :-).  Feel free to post it, Keith!

Any more ideas?  Comments about the exinsting ideas are also welcome, of 
course.  Meanwhile, I'll continue studying this parser stuff and will try to 
propose something useful in some time.

Regards,

-- 
Sergio Durigan Junior
Debugger Engineer
Red Hat Inc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 18:46 Parser rewritting Sergio Durigan Junior
@ 2010-03-30 19:05 ` Chris Moller
  2010-03-30 21:12   ` Tom Tromey
  2010-03-30 21:18 ` Tom Tromey
  2010-04-02  1:50 ` Chris Moller
  2 siblings, 1 reply; 14+ messages in thread
From: Chris Moller @ 2010-03-30 19:05 UTC (permalink / raw)
  To: Sergio Durigan Junior; +Cc: Project Archer

There are a couple of antlr C++ parsers available:

http://hg.netbeans.org/main/file/tip/cnd.modelimpl/src/org/netbeans/modules/cnd/modelimpl/parser/cppparser.g

http://www.antlr.org/grammar/1198064893071/CPP_parser_v_3.2.zip

as well as a C++ preprocessor:

http://hg.netbeans.org/main/file/tip/cnd.apt/src/org/netbeans/modules/cnd/apt/impl/support/aptlexer.g

I don't know how easy/hard they'd be to adapt to use in GDB, but they 
might be worth looking at.  And I'm just about certain it would be 
easier to use them than to write a whole new parser--antlr is a lot less 
weird than bison.

There's antlr package in brewroot.

On 03/30/10 14:46, Sergio Durigan Junior wrote:
> Hello!
>
> As you may have noticed, in the last Archer meeting I brought a topic into
> discussion: the rewritting of the GDB's parser.  The current parser is written
> using Bison, and unfortunately it is insufficient to satisfy our current
> needs, especially for C++ productions.
>
> With that in mind, Tom asked me to start this discussion in the mailing-list
> to see what you think about it.  We decided to send an e-mail to the archer
> list at first; this topic will eventually be discussed at the gdb list as
> well.
>
> I am sorry I took so long to send this e-mail, but I was trying to come up
> with an initial plan to re-implement the parser.  I've been studying GCC/G++
> parsers in order to understand how they work, but I noticed that it would take
> some time for me to think in a good plan.  I also noticed that other people
> here have (much!!) more experience about parsers than I do, so why not
> exposing this idea and see what you think?
>
> The initial idea (by Tom) would be to mimic the current structure of the G++
> parser.  There is also another proposal (from Keith), but I don't know if he
> wants it to be listed here :-).  Feel free to post it, Keith!
>
> Any more ideas?  Comments about the exinsting ideas are also welcome, of
> course.  Meanwhile, I'll continue studying this parser stuff and will try to
> propose something useful in some time.
>
> Regards,
>
>    

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 19:05 ` Chris Moller
@ 2010-03-30 21:12   ` Tom Tromey
  2010-04-04  8:50     ` Dodji Seketeli
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Tromey @ 2010-03-30 21:12 UTC (permalink / raw)
  To: Chris Moller; +Cc: Sergio Durigan Junior, Project Archer

>>>>> "Chris" == Chris Moller <cmoller@redhat.com> writes:

Chris> There are a couple of antlr C++ parsers available:
Chris> http://hg.netbeans.org/main/file/tip/cnd.modelimpl/src/org/netbeans/modules/cnd/modelimpl/parser/cppparser.g

We can't generally reuse code like this due to copyright assignment
requirements.

Chris> And I'm just about certain it would be easier to use them than to
Chris> write a whole new parser--antlr is a lot less weird than bison.

My preferred route is to hand-write a recursive descent parser, mimicing
the structure of the existing code in g++.

I think directly sharing code is impractical due to impedance mismatch
between gdb and g++ internals.  Also our goals are slightly different,
in that in gdb we only need to parse expressions, we want a single
parser for C and C++, and finally gdb must implement certain language
extensions.

Using a parser generator may be ok, but I think there are benefits to
following an existing parser.  Also, parsers like the one in g++ are
simpler to debug.  (Of course, maybe that is a problem we should solve
as well :-)

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 18:46 Parser rewritting Sergio Durigan Junior
  2010-03-30 19:05 ` Chris Moller
@ 2010-03-30 21:18 ` Tom Tromey
  2010-03-30 22:20   ` Keith Seitz
  2010-04-02  1:50 ` Chris Moller
  2 siblings, 1 reply; 14+ messages in thread
From: Tom Tromey @ 2010-03-30 21:18 UTC (permalink / raw)
  To: Sergio Durigan Junior; +Cc: Project Archer

>>>>> "Sergio" == Sergio Durigan Junior <sergiodj@redhat.com> writes:

Sergio> The current parser is written using Bison, and unfortunately it
Sergio> is insufficient to satisfy our current needs, especially for C++
Sergio> productions.

A few particulars...

We ran into some problems with the function-like cast notation.  I think
those are probably fixable, by not differentiating different kinds of
names here, but we think there will be more problems.  E.g., I suspect
we'll run into problems when we get rid of the template name hack in the
lexer.

Also, there is no good way in bison to disable a production only when
the parsing language is C++.  You can play games by returning different
tokens in different modes, or you can run a preprocessor on the grammar,
but both of those are pretty ugly.

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 21:18 ` Tom Tromey
@ 2010-03-30 22:20   ` Keith Seitz
  2010-03-30 22:59     ` Tom Tromey
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Seitz @ 2010-03-30 22:20 UTC (permalink / raw)
  To: Project Archer

On 03/30/2010 02:18 PM, Tom Tromey wrote:
> Also, there is no good way in bison to disable a production only when
> the parsing language is C++.  You can play games by returning different
> tokens in different modes, or you can run a preprocessor on the grammar,
> but both of those are pretty ugly.

Do we really need to worry about C vs C++? How dangerous would it be to 
simply assume C++? [I know there is a subtle difference between the two, 
I just wonder whether it would matter that much in usage to warrant 
treating the two differently/independently.]

I also worry more about three other areas that might influence 
design/implementation decisions:

1) Java? Okay, we could probably work around this by using the current 
parser for java (ick!) [Do we even consider adding java to the mix worth 
it? I don't, but that's just my opinion...]

2) Linespec re-evaluation: Let's face it, a number of us have had to 
deal with problems in linespec.c, and we all know it's a nightmare. 
Anyone (else) interested in moving to expressions-based linespec processing?

3) Symbol table cleanups: I get a sinking feeling that the symbol table 
API may need some work before any attempt at writing a new parser my be 
started.

Specifically, when a symbol lookup happens, we should get ALL matching 
symbols, not just the first one found. [Maybe that's just me?] I know 
this was a constant barricade when trying to implement overload 
resolution in the parser. And to this day, we cannot implement overload 
resolution on a non-class function. A nice side-effect of this: it would 
help with symbol completion.

Heck, I might even just settle for something that says there are 
multiple matches...

Keith

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 22:20   ` Keith Seitz
@ 2010-03-30 22:59     ` Tom Tromey
  2010-03-31  2:01       ` Matt Rice
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Tromey @ 2010-03-30 22:59 UTC (permalink / raw)
  To: Keith Seitz; +Cc: Project Archer

>>>>> "Keith" == Keith Seitz <keiths@redhat.com> writes:

Keith> Do we really need to worry about C vs C++? How dangerous would it be
Keith> to simply assume C++? [I know there is a subtle difference between the
Keith> two, I just wonder whether it would matter that much in usage to
Keith> warrant treating the two differently/independently.]

There are plenty of unsubtle distinctions as well, like all the
additional operator names in C++.  This we have to handle, though of
course we already have an adequate solution here.

Offhand I don't know if there are productions which would cause
confusion if enabled in C.  Maybe not.  It still seems less potentially
confusing and perhaps mildly more future-proof to follow each language
spec relatively closely.

Keith> 1) Java? Okay, we could probably work around this by using the current
Keith> parser for java (ick!) [Do we even consider adding java to the mix
Keith> worth it? I don't, but that's just my opinion...]

Let's leave Java alone.  It is "good enough" and really reworking it
isn't our mandate.

If we were going to really consider merging another language into this
effort, I would say ObjC, which currently has its own fork of c-exp.y,
minus most of the bug fixes from the last couple of years.  But even
there, I would rather have somebody knowledgeable and interested in ObjC
do it.

Keith> 2) Linespec re-evaluation: Let's face it, a number of us have had to
Keith> deal with problems in linespec.c, and we all know it's a
Keith> nightmare. Anyone (else) interested in moving to expressions-based
Keith> linespec processing?

Yeah, I think we need a better parser in linespec.c, but I see that as
mostly orthogonal.  Maybe we would need a second entry point to each
language's expression parser to let us ask for just a "function name"
production, but otherwise I don't think there is a big overlap.  This
can easily be retrofitted into the bison-based parsers if needed.

Keith> 3) Symbol table cleanups: I get a sinking feeling that the symbol
Keith> table API may need some work before any attempt at writing a new
Keith> parser my be started.

Keith> Specifically, when a symbol lookup happens, we should get ALL matching
Keith> symbols, not just the first one found. [Maybe that's just me?]

I tend to agree with this idea, though I haven't thought through all the
ramifications.

But this can also be done independently, I think.  The overload
resolution stuff is largely done at evaluation time, not in the parser
(which makes sense if you want to choose different overloads depending
on the value of a convenience variable, which doesn't have a static
type).  So here we would need the symbol table change and perhaps an IR
change -- but not, I think, a parser change.

IMO, the first goal for a rewrite of the parser should just be feature
parity.  It is just changing how we express the parser, from bison to
(say) recursive descent.  Then we can start adding features, fixing
bugs, and moving hacks out of the lexer and into the parser.

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 22:59     ` Tom Tromey
@ 2010-03-31  2:01       ` Matt Rice
  0 siblings, 0 replies; 14+ messages in thread
From: Matt Rice @ 2010-03-31  2:01 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Keith Seitz, Project Archer

On Tue, Mar 30, 2010 at 3:59 PM, Tom Tromey <tromey@redhat.com> wrote:
>>>>>> "Keith" == Keith Seitz <keiths@redhat.com> writes:
>
> Keith> 1) Java? Okay, we could probably work around this by using the current
> Keith> parser for java (ick!) [Do we even consider adding java to the mix
> Keith> worth it? I don't, but that's just my opinion...]
>
> Let's leave Java alone.  It is "good enough" and really reworking it
> isn't our mandate.
>
> If we were going to really consider merging another language into this
> effort, I would say ObjC, which currently has its own fork of c-exp.y,
> minus most of the bug fixes from the last couple of years.  But even
> there, I would rather have somebody knowledgeable and interested in ObjC
> do it.
>

I would agree with Tom, ObjC is a strict superset of C so it would be
alot easier to bolt on top of the new c parser.
and a unified parser could have good implications for myself,
having been debugging objc++ code, it is quite a pain to have to split
up expressions, and 'set language' in the middle of the split up
expression
so, i would be willing to put some time into getting objc working on
what you guys come up with, and keeping an eye on your progress with
this in mind.  not really something i'd expect you guys to undertake
just for fun

so first i need to start making test cases of the things the objc
parser currently handles,
then objc++ cases it doesn't currently handle

with any luck the differences between c and c++ will also be
applicable and adding objc support to the parser will not add
unforseen issues (I wont really hold my breath on that until i see
it...), if that is not the case, adding a 2nd set of problems now
won't get you guys any closer to your goal, while having the 1st set
of problems solved by your parser would surely help when dealing with
the 2nd set from the objc perspective.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 18:46 Parser rewritting Sergio Durigan Junior
  2010-03-30 19:05 ` Chris Moller
  2010-03-30 21:18 ` Tom Tromey
@ 2010-04-02  1:50 ` Chris Moller
  2010-04-08 19:21   ` Tom Tromey
  2 siblings, 1 reply; 14+ messages in thread
From: Chris Moller @ 2010-04-02  1:50 UTC (permalink / raw)
  To: Sergio Durigan Junior; +Cc: Project Archer

[-- Attachment #1: Type: text/plain, Size: 1476 bytes --]

On 03/30/10 14:46, Sergio Durigan Junior wrote:
> Hello!
>
> As you may have noticed, in the last Archer meeting I brought a topic into
> discussion: the rewritting of the GDB's parser.  The current parser is written
> using Bison, and unfortunately it is insufficient to satisfy our current
> needs, especially for C++ productions.
>
> With that in mind, Tom asked me to start this discussion in the mailing-list
> to see what you think about it.  We decided to send an e-mail to the archer
> list at first; this topic will eventually be discussed at the gdb list as
> well.
>    

A lot of years ago I wrote a fairly elaborate parser using 
antlr--definitely a cool tool and I recommend you consider it.  It's a 
predicated LL(*) parser generator--the "predicated" bit making it 
possible, among other things, to handle the context-dependent bits of  
C/C++ grammar.

Just as an example, I've attached a rudimentary antlr grammar that 
parses a subset of C/C++ decls--if you look, you'll see that the rules 
look a lot like the specifications in the C++, and in fact started out 
as a cut'n'paste of those specs.  Also, if you look in the grammar for 
"is_cpp," you can see how rule predicates can be used to have the parser 
do different things depending on circumstances.

Anyway, it's probably worth considering.

(In addition to the .g file attached, I wrote a couple other .c and .h 
files that make it all work.  I'll make them available if anyone wants him.)

Chris


[-- Attachment #2: CPPparser.g --]
[-- Type: text/plain, Size: 8503 bytes --]

grammar CPPparser;

options
{
    language = C;
    backtrack = true;
}

@header
{
#include "pd.h"
}

decl_specifier
    @init {
        bzero(&data_type, sizeof(data_type));
        data_type.data_type = DATA_TYPE_TYPE_DESC;
    }
    @after {
        print_data (&data_type);
    }
    :
      storage_class_specifier* WS?
      type_specifier?
      function_specifier?
      FRIEND?
      TYPEDEF?
      CONSTEXPR?
    /* | alignment_specifier */
    ;

storage_class_specifier :
        REGISTER
        {
            if (type_desc.storage_class == STORAGE_CLASS_NONE)
                type_desc.storage_class = STORAGE_CLASS_REGISTER;
            else fprintf (stderr, "Storage class already set.\n");
        }
    |   STATIC
        {
            if (type_desc.storage_class == STORAGE_CLASS_NONE)
                type_desc.storage_class = STORAGE_CLASS_STATIC;
            else fprintf (stderr, "Storage class already set.\n");
        }
    |   THREAD_LOCAL
    |   EXTERN
        {
            if (type_desc.storage_class == STORAGE_CLASS_NONE)
                type_desc.storage_class = STORAGE_CLASS_EXTERN;
            else fprintf (stderr, "Storage class already set.\n");
        }
    |   MUTABLE
    ;

type_specifier 
        @init {
            type_desc.type = TYPE_CODE_UNSET;
            type_desc.size = -1;
            type_desc.nr_longs   = 0;
            type_desc.nosign_bit = 1;
            type_desc.signed_bit = 0;
        } 
    :
        ( ({!is_cpp}? simple_type_specifier
            | {is_cpp}? simple_type_specifier_cpp) WS?)+
       | class_specifier
    /* | enum_specifier */
    /* | elaborated_type_specifier */
    /* | typename_specifier */
    | cv_qualifier*
    ;

simple_type_specifier :
        /*   nested_name_specifier? type_name */ 
        /* | nested_name_specifier TEMPLATE type_name */ 
      CHAR
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(char);
        }
    | WCHAR_T
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(wchar_t);
        }
        /*
    | BOOL
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(bool);
        }
    */
    | SHORT
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(short);
        }
    | INT
        {
            type_desc.type = TYPE_CODE_INT;
            switch (type_desc.nr_longs) {
                case 0: type_desc.size = sizeof(int); break;
                case 1: type_desc.size = sizeof(long int); break;
                case 2: type_desc.size = sizeof(long long int); break;
            }
        }
    | LONG        
        {
            if (type_desc.type == TYPE_CODE_UNSET)
                type_desc.type = TYPE_CODE_INT;
            if (type_desc.nr_longs < 2) type_desc.nr_longs++;
            switch (type_desc.nr_longs) {
                case 0: type_desc.size = sizeof(int); break;
                case 1: type_desc.size = sizeof(long int); break;
                case 2: type_desc.size = sizeof(long long int); break;
            }
        }
    | SIGNED
        {
            type_desc.nosign_bit = 0;
            type_desc.signed_bit = 1;
        }
    | UNSIGNED
        {
            type_desc.nosign_bit = 0;
            type_desc.signed_bit = 0;
        }
    | FLOAT
        {
            type_desc.type = TYPE_CODE_FLT;
            type_desc.size = sizeof(float);
        }
    | DOUBLE
        {
            type_desc.type = TYPE_CODE_FLT;
            type_desc.size = (type_desc.nr_longs > 0) ? sizeof(long double)
                : sizeof(double);
        }
    | VOID
        {
            type_desc.type = TYPE_CODE_VOID;
        }
    | AUTO
        {
        }
        /*  | decltype ( expression) */
    ;


simple_type_specifier_cpp :
        /*   nested_name_specifier? type_name */ 
        /* | nested_name_specifier TEMPLATE type_name */ 
      Char
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(char);
        }
    | Wchar_t
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(wchar_t);
        }
        /*
    | Bool
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(bool);
        }
    */
    | Short
        {
            type_desc.type = TYPE_CODE_INT;
            type_desc.size = sizeof(short);
        }
    | Int
        {
            type_desc.type = TYPE_CODE_INT;
            switch (type_desc.nr_longs) {
                case 0: type_desc.size = sizeof(int); break;
                case 1: type_desc.size = sizeof(long int); break;
                case 2: type_desc.size = sizeof(long long int); break;
            }
        }
    | Long        
        {
            if (type_desc.type == TYPE_CODE_UNSET)
                type_desc.type = TYPE_CODE_INT;
            if (type_desc.nr_longs < 2) type_desc.nr_longs++;
            switch (type_desc.nr_longs) {
                case 0: type_desc.size = sizeof(int); break;
                case 1: type_desc.size = sizeof(long int); break;
                case 2: type_desc.size = sizeof(long long int); break;
            }
        }
    | Signed
        {
            type_desc.nosign_bit = 0;
            type_desc.signed_bit = 1;
        }
    | Unsigned
        {
            type_desc.nosign_bit = 0;
            type_desc.signed_bit = 0;
        }
    | Float
        {
            type_desc.type = TYPE_CODE_FLT;
            type_desc.size = sizeof(float);
        }
    | Double
        {
            type_desc.type = TYPE_CODE_FLT;
            type_desc.size = (type_desc.nr_longs > 0) ? sizeof(long double)
                : sizeof(double);
        }
    | Void
        {
            type_desc.type = TYPE_CODE_VOID;
        }
    | Auto
        {
        }
        /*  | decltype ( expression) */
    ;


class_specifier : class_head '{' member_specification* '}'
    ;

class_head :
        class_key identifier?
        /* | nested_name_specifier identifier base_clause? */
        /* | nested_name_specifier? simple_template_id base_clause? */
    ;

member_specification:
      type_specifier identifier initialiser? ';'
    | scope_specifier ':'
    ;
        

class_key :
      CLASS
    | STRUCT
    | UNION
    ;

scope_specifier :
      PRIVATE
    | PUBLIC
    | PROTECTED
    ;

initialiser : '='
      numeric
        /* | string */
        /* | array */
    ;
numeric :
      FIXED
    | FLOATING
    | EXPO
    ;

identifier : ALPHAI ALPHAC ;

/*
type_name:
      class_name
      enum_name
      typedef_name
    ;
*/


cv_qualifier :
      CONST
    | VOLATILE
    ;
        
function_specifier :
      INLINE
    | VIRTUAL
    | EXPLICIT
    ;

/* Literals for decl_specifier. */
FRIEND       : 'friend' ;
TYPEDEF      : 'typedef' ;
CONSTEXPR    : 'constexpr' ;

/* Literals for storage_specifier. */
REGISTER     : 'register' ;
STATIC       : 'static' ;
THREAD_LOCAL : 'thread_local' ;
EXTERN       : 'extern' ;
MUTABLE      : 'mutable' ;

/* Literals for function_specifier. */
INLINE       : 'inline' ;
VIRTUAL      : 'virtual' ;
EXPLICIT     : 'explicit' ;

/* Literals for simple_type_specifier. */
CHAR         : 'char' ;
WCHAR_T      : 'wchar_t' ;
BOOL         : 'bool' ;
SHORT        : 'short' ;
INT          : 'int' ;
LONG         : 'long' ;
SIGNED       : 'signed' ;
UNSIGNED     : 'unsigned' ;
FLOAT        : 'float' ;
DOUBLE       : 'double' ;
VOID         : 'void' ;
AUTO         : 'auto' ;

/* Literals for simple_type_specifier_cpp. */
Char         : 'Char' ;
Wchar_t      : 'Wchar_t' ;
Bool         : 'Bool' ;
Short        : 'Short' ;
Int          : 'Int' ;
Long         : 'Long' ;
Signed       : 'Signed' ;
Unsigned     : 'Unsigned' ;
Float        : 'Float' ;
Double       : 'Double' ;
Void         : 'Void' ;
Auto         : 'Auto' ;

/* Literals for cv_qualifier. */
CONST        : 'const' ;
VOLATILE     : 'volatile' ;

/* Literals for class_key. */
CLASS        : 'class' ;
STRUCT       : 'struct' ;
UNION        : 'union' ;

/* Literals for scope_specifier. */
PRIVATE      : 'private' ;
PUBLIC       : 'public' ;
PROTECTED    : 'protected' ;

SIGN     : ('+' | '-') ;
INTEGER  : ('0'..'9') ;
FIXED    : INTEGER+ ;
FLOATING : INTEGER '.' INTEGER* ;
EXPO     : INTEGER ('.' INTEGER*)? ('e' | 'E') SIGN? INTEGER ;

ALPHAI   : ('a'..'z' | 'A'..'Z' | '_') ;
ALPHAC   : (ALPHAI | INTEGER)* ;
NEWLINE  : '\r' ? '\n' ;
WS       : (' ' |'\t' |'\n' |'\r' )* /*{ SKIP(); }*/ ;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-03-30 21:12   ` Tom Tromey
@ 2010-04-04  8:50     ` Dodji Seketeli
  2010-04-08 19:28       ` Tom Tromey
  0 siblings, 1 reply; 14+ messages in thread
From: Dodji Seketeli @ 2010-04-04  8:50 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Chris Moller, Sergio Durigan Junior, Project Archer

On Tue, Mar 30, 2010 at 03:12:26PM -0600, Tom Tromey wrote:

[...]

> Chris> There are a couple of antlr C++ parsers available:
> Chris> http://hg.netbeans.org/main/file/tip/cnd.modelimpl/src/org/netbeans/modules/cnd/modelimpl/parser/cppparser.g
> 
> We can't generally reuse code like this due to copyright assignment
> requirements.

Would the copyright assignment requirements prevent us from trying to 
reuse, say, Clang? Maybe one could think about providing a C api on top 
of Clang and consider Clang as an external dependency? If not, then my 
point was to explicitely mention it and make sure we did consider the 
option and ruled it out based on sound reasons.

[...]

> My preferred route is to hand-write a recursive descent parser, 
> mimicing
> the structure of the existing code in g++.
> 
> I think directly sharing code is impractical due to impedance mismatch
> between gdb and g++ internals.  Also our goals are slightly different,
> in that in gdb we only need to parse expressions, we want a single
> parser for C and C++, and finally gdb must implement certain language
> extensions.

I understand that this minimal parser is meant to stay simple, e.g. no
preprocessing support, very minimal error reporting if any at all, no 
semantic analysis etc, but still, if we can't re-use Clang, then would 
it be possible to devise this new "minimal parser" as an independant,
reusable library with its own dejagnu-free testsuite?
Maybe other projects might be interested in using (and extending) 
something like that.

        Dodji

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-04-02  1:50 ` Chris Moller
@ 2010-04-08 19:21   ` Tom Tromey
  2010-04-08 20:21     ` Chris Moller
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Tromey @ 2010-04-08 19:21 UTC (permalink / raw)
  To: Chris Moller; +Cc: Sergio Durigan Junior, Project Archer

Chris> A lot of years ago I wrote a fairly elaborate parser using
Chris> antlr--definitely a cool tool and I recommend you consider it.

One thing to ensure is that the antlr output is GPL-compatible.
If not, we can't use it.

Chris> Just as an example, I've attached a rudimentary antlr grammar that
Chris> parses a subset of C/C++ decls

We only need expressions.

Chris> Anyway, it's probably worth considering.

While I still think it makes the most sense to mimic g++, I am open to
other solutions that are powerful enough.

Another thing worth considering is bison's GLR mode.  This has the
advantage that we wouldn't actually need to rewrite the whole parser, we
could just start by tweaking it.

Using tools that generate code is problematic in GDB, because people
complain about every new dependency.  Even requiring bison will probably
generate complaints, because AFAIK some people still do their builds
with byacc.  Maybe we could check in the generated code, though.

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-04-04  8:50     ` Dodji Seketeli
@ 2010-04-08 19:28       ` Tom Tromey
  2010-04-10 22:05         ` Jim Blandy
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Tromey @ 2010-04-08 19:28 UTC (permalink / raw)
  To: Dodji Seketeli; +Cc: Chris Moller, Sergio Durigan Junior, Project Archer

>>>>> "Dodji" == Dodji Seketeli <dodji@redhat.com> writes:

Dodji> Would the copyright assignment requirements prevent us from trying to 
Dodji> reuse, say, Clang? Maybe one could think about providing a C api on top 
Dodji> of Clang and consider Clang as an external dependency?

This can be done, after all, we do it with Python :)

A new external dependency always causes trouble, though.  Look through
the archives to see the discussions around expat, python, and libiconv.
A required external dependency will be trouble.

Anyway, I suspect the impedance mismatch problem holds equally for clang.
It is probably worth verifying that.

Dodji> I understand that this minimal parser is meant to stay simple, e.g. no
Dodji> preprocessing support, very minimal error reporting if any at all, no 
Dodji> semantic analysis etc, but still, if we can't re-use Clang, then would 
Dodji> it be possible to devise this new "minimal parser" as an independant,
Dodji> reusable library with its own dejagnu-free testsuite?
Dodji> Maybe other projects might be interested in using (and extending) 
Dodji> something like that.

I'm not opposed to this but I don't want to slow down our progress to
make a library.

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-04-08 19:21   ` Tom Tromey
@ 2010-04-08 20:21     ` Chris Moller
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Moller @ 2010-04-08 20:21 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Sergio Durigan Junior, Project Archer

On 04/08/10 15:21, Tom Tromey wrote:
> Chris>  A lot of years ago I wrote a fairly elaborate parser using
> Chris>  antlr--definitely a cool tool and I recommend you consider it.
>
> One thing to ensure is that the antlr output is GPL-compatible.
> If not, we can't use it.
>    

antlr.org says that ANTLR itself is under "The BSD License," which looks 
to like a small subset of GPLv2, but IANAL.  I couldn't find anything 
about licensing for the generated code.

http://www.antlr.org/wiki/display/Mantra/License

> Chris>  Just as an example, I've attached a rudimentary antlr grammar that
> Chris>  parses a subset of C/C++ decls
>
> We only need expressions.
>
> Chris>  Anyway, it's probably worth considering.
>
> While I still think it makes the most sense to mimic g++, I am open to
> other solutions that are powerful enough.
>
> Another thing worth considering is bison's GLR mode.  This has the
> advantage that we wouldn't actually need to rewrite the whole parser, we
> could just start by tweaking it.
>
> Using tools that generate code is problematic in GDB, because people
> complain about every new dependency.  Even requiring bison will probably
> generate complaints, because AFAIK some people still do their builds
> with byacc.  Maybe we could check in the generated code, though.
>    

With one exception, ANTLR, including v3, under at least Fedora--I don't 
know about RHEL.  The exception is the v3 C target-language support, 
which I had to install separately, but I expect it could be included in 
the antlrv3 package.

The generated code is kinda big.  The source for the antlr C/C++ 
expression parser I wrote totals 737 lines, about 500 of which is C 
support code--the antlr grammar is only 239 lines.  But that 239 lines 
gets turned into about 8800 lines of combined lexer and parser.

> Tom
>    

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-04-08 19:28       ` Tom Tromey
@ 2010-04-10 22:05         ` Jim Blandy
  2010-04-10 22:11           ` Jim Blandy
  0 siblings, 1 reply; 14+ messages in thread
From: Jim Blandy @ 2010-04-10 22:05 UTC (permalink / raw)
  To: Tom Tromey
  Cc: Dodji Seketeli, Chris Moller, Sergio Durigan Junior, Project Archer

On Thu, Apr 8, 2010 at 12:28 PM, Tom Tromey <tromey@redhat.com> wrote:
> I'm not opposed to this but I don't want to slow down our progress to
> make a library.

For what it's worth, isolating a complex component like this makes it
much easier to write unit tests for it.

As an experiment, I did my recent work on Google Breakpad --- a new
symbol dumper for Linux that converts DWARF debugging info and CFI to
Breakpad's own textual format, corresponding extensions to the parser
for that data, and stack walkers for x86, x86_64, and ARM ---
following a discipline of providing full code coverage and branch
coverage (each branch has to be both taken and not taken) with unit
tests for each separable component.  It slowed me down quite a bit ---
I spent more time writing tests than code.  But except for cases where
I misunderstood the spec, I have also not had any bugs yet in ~5500
non-comment lines of code.  Or, more precisely, I had lots of bugs ---
some days I could have stayed in bed and not lost ground --- but none
of them got committed.  This full rewrite of the debugging info
dumper, and pretty deep surgery on the stack walker is running on our
production crash-handling servers (crash-stats.mozilla.com), and the
transition has been painless.

What made this possible, though, was that each piece could be taken in
isolation and driven from the Google C++ Test Framework.  It was easy
for me to directly check the results of the parser in isolation, not
the results of the command-line interpreter's dispatching, the
parsing, the symbol table lookup (and thus the debug info readers),
the evaluator, and the printer.  The tests were fast to run, so I
would run them after pretty much at every point the code could be
expected to behave, during the development process.

As I say, it wasn't quick.  But it also means that my next project can
actually have my full attention, because I'm not spreading that
debugging effort across the next year, based on ill-defined,
occasionally reproducible bug reports.

Anyway, what this message comes down to is, "But, but, unit testing! Wow!"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Parser rewritting
  2010-04-10 22:05         ` Jim Blandy
@ 2010-04-10 22:11           ` Jim Blandy
  0 siblings, 0 replies; 14+ messages in thread
From: Jim Blandy @ 2010-04-10 22:11 UTC (permalink / raw)
  To: Tom Tromey
  Cc: Dodji Seketeli, Chris Moller, Sergio Durigan Junior, Project Archer

On Sat, Apr 10, 2010 at 3:04 PM, Jim Blandy <jimb@red-bean.com> wrote:
> But except for cases where
> I misunderstood the spec, I have also not had any bugs yet in ~5500
> non-comment lines of code.

That's non-test lines.  There are apparently ~10k lines of non-comment
test code.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-04-10 22:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-30 18:46 Parser rewritting Sergio Durigan Junior
2010-03-30 19:05 ` Chris Moller
2010-03-30 21:12   ` Tom Tromey
2010-04-04  8:50     ` Dodji Seketeli
2010-04-08 19:28       ` Tom Tromey
2010-04-10 22:05         ` Jim Blandy
2010-04-10 22:11           ` Jim Blandy
2010-03-30 21:18 ` Tom Tromey
2010-03-30 22:20   ` Keith Seitz
2010-03-30 22:59     ` Tom Tromey
2010-03-31  2:01       ` Matt Rice
2010-04-02  1:50 ` Chris Moller
2010-04-08 19:21   ` Tom Tromey
2010-04-08 20:21     ` Chris Moller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).