From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8553 invoked by alias); 26 Oct 2004 00:03:35 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 8541 invoked from network); 26 Oct 2004 00:03:32 -0000 Received: from unknown (HELO mail.codesourcery.com) (65.74.133.9) by sourceware.org with SMTP; 26 Oct 2004 00:03:32 -0000 Received: (qmail 21430 invoked from network); 26 Oct 2004 00:03:31 -0000 Received: from taltos.codesourcery.com (zack@66.92.218.83) by mail.codesourcery.com with DES-CBC3-SHA encrypted SMTP; 26 Oct 2004 00:03:31 -0000 Received: by taltos.codesourcery.com (sSMTP sendmail emulation); Mon, 25 Oct 2004 17:03:31 -0700 To: "Joseph S. Myers" Cc: gcc-patches@gcc.gnu.org Subject: Re: New C parser [patch] References: From: Zack Weinberg Date: Tue, 26 Oct 2004 00:32:00 -0000 In-Reply-To: (Joseph S. Myers's message of "Sun, 24 Oct 2004 20:19:05 +0000 (UTC)") Message-ID: <873c02e46k.fsf@codesourcery.com> User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-SW-Source: 2004-10/txt/msg02137.txt.bz2 "Joseph S. Myers" writes: > This second slightly revised patch version fixes the lexer lookahead > strategy not to look ahead one token until required, so improving the > correspondence of diagnostic locations with the new parser to those > with the old parser. I'm generally quite happy with this patch. Could you enumerate all the places where 2-token lookahead is needed, and give an opinion as to how many of those could be eliminated through future cleanups? Some kibitzes follow... > +int yydebug; Where is this used? > +void > +c_parse_init (void) > +{ > + init_reswords (); > +} Seems a little silly to have two functions one of which does nothing but call the other. > +/* A lexer with up to two tokens of look-ahead; more are not needed > + for C. */ > +typedef struct c_lexer GTY (()) > +{ > + /* The look-ahead tokens. */ > + c_token tokens[2]; > + /* How many look-ahead tokens are available (0, 1 or 2). */ > + int tokens_avail; > +} c_lexer; Unlike C++, the lexer abstraction seems superfluous. By folding this into the parser structure, you would avoid an extra pointer dereference on every lexer operation. If you want to keep the distinction between parser and lexer, suggest embedding this struct within the parser struct. Come to think, I could do that to the C++ parser too. (makes note) This is especially a good idea since ... > +/* A parser structure recording information about the state and > + context of parsing. */ > +typedef struct c_parser GTY(()) > +{ > + /* The lexer. */ > + c_lexer *lexer; > + /* True if a syntax error is being recovered from; false otherwise. > + c_parser_error sets this flag. It should clear this flag when > + enough tokens have been consumed to recover from the error. */ > + BOOL_BITFIELD error : 1; > +} c_parser; the parser structure has so little in it, and you could save a word of memory by using a short int for the tokens_avail field. This only helps if you don't keep the separate lexer structure, because of tail padding. > +/* Issue a diagnostic of the form > + FILE:LINE: MESSAGE before TOKEN > + where TOKEN is the next token in the input stream of PARSER. > + MESSAGE (specified by the caller) is usually of the form "expected > + OTHER-TOKEN". > + > + Do not issue a diagnostic if still recovering from an error. > + > + ??? This is taken from the C++ parser, but building up messages in > + this way is not i18n-friendly and some other approach should be > + used. */ As a tangential note, the C++ parser does it this way because the C parser does it this way, and the C parser did it this way because Bison's error reporting logic more or less forced it to. Once your new parser goes in, it will be easy to clean all of this up. > +/* If the next token is the indicated keyword, consume it. Otherwise, > + issue the error MSGID. Returns true if found, false otherwise. */ > + > +static bool > +c_parser_require_keyword (c_parser *parser, > + enum rid keyword, > + const char *msgid) Please put this next to c_parser_require. > +static void > +c_parser_translation_unit (c_parser *parser) > +{ > + if (c_lexer_next_token_is (parser->lexer, CPP_EOF)) > + { > + if (pedantic) > + pedwarn ("ISO C forbids an empty source file"); > + } > + else > + { > + while (c_lexer_next_token_is_not (parser->lexer, CPP_EOF)) > + { > + void *obstack_position = obstack_alloc (&parser_obstack, 0); > + ggc_collect (); > + c_parser_external_declaration (parser); > + obstack_free (&parser_obstack, obstack_position); > + } > + } > +} If control reaches the else block, we know the first token is not CPP_EOF, so the while can be turned into a do-while; I do not think the compiler can do that itself. The obstack_alloc call can and should be hoisted out of the loop, since your goal is to flush the entire obstack between external declarations. > + int ext; > + SAVE_EXT_FLAGS (ext); It is not obvious from the name that SAVE_EXT_FLAGS disables diagnostics for extensions. Also, it would be nice if the macro returned a value (make it into an inline function maybe?) so that this could be written int old_ext_flags = disable_extension_diagnostics (); ... restore_extension_diagnostics (old_ext_flags); Also, (due to the recursive call to c_parser_external_declaration) you accept arbitrary numbers of __extension__ keywords; is this really intended? (I see the grammar as commented does allow it.) > +static void > +c_parser_asm_definition (c_parser *parser) > +{ > + tree asm_str = c_parser_simple_asm_expr (parser); > + if (asm_str) > + assemble_asm (asm_str); > + c_parser_skip_until_found (parser, CPP_SEMICOLON, "expected ';'"); > +} This isn't your fault, but in the modern compiler, calling assemble_anything directly from the parser is wrong. The only way asm() at top level can continue to make sense is if we pass it along to cgraph, and somehow get cgraph to preserve the global order of functions and asm()s. [Mind you, I would not have any problem with dropping this feature.] > + c_parser_error (parser, "expected ',' or ';'"); > + break; > + } > + } > + else > + { > + c_parser_error (parser, > + "expected ':', ',', ';' or '__attribute__'"); > + break; These want %<%>ifying. There are several other cases - I'm not going to call them all out explicitly. > + /* ??? The old parser accepted wide string literals here, but do we > + want to? */ The C++ parser does not accept wide string literals in any form of asm() string, so I support not accepting them in C either. I strongly recommend you grab the C++ parser's logic for context-dependent string constant translation, by the way, so that the horrible kludge used for C can go away and the C++ parser can stop duplicating the get-a-sequence-of-strings-concatenate-and-translate-them routine. > + attrib: > + empty > + any-word > + any-word ( identifier ) > + any-word ( identifier , nonempty-expr-list ) > + any-word ( expr-list ) > + > + where the "identifier" must not be declared as a type, and > + "any-word" may be any identifier (including one declared as a > + type), a reserved word storage class specifier, type specifier or > + type qualifier. ??? This still leaves out most reserved keywords > + (following the old parser), shouldn't we include them, and why not > + allow identifiers declared as types to start the arguments? */ What's really going on is that there is a finite set of attribute keywords which, in the "any-word" context, have that meaning and no other. This set really ought to be tagged explicitly by the lexer. The grammar inside the parentheses has more constraints than you have here, but it's also highly dependent on the specific attribute; maybe we should have individual attribute-parser functions, as we do for pragmas. > + ??? Allowing old-style [n] designators has as a side-effect (copied > + from the old parser) allowing ".member" without "=" as a > + designator, but perhaps as this has never been documented (and only > + worked from 2.95.x onwards) it could be freely removed. I think I prefer allowing ".member" without "=" as a designator, until and unless we decide to drop support for all of this grammar that's not C99-ish. In fact, I think I would prefer that the extension be regularized down to (a) allowing the omission of the "=" from any C99-style designator, and (b) allowing "identifier:" as meaning the same thing as ".identifier =". Unless that would introduce additional grammar ambiguities, but I don't think it would. It's not obvious to me from code or comment - we do allow "[ value ... value ] = initializer", yes? (That might be standardizable independently from the rest of the extensions in this area.) > + /* Two cases cannot and do not have line numbers associated: If stmt > + is degenerate, such as "2;", then stmt is an INTEGER_CST, which > + cannot hold line numbers. But that's OK because the statement > + will either be changed to a MODIFY_EXPR during gimplification of > + the statement expr, or discarded. If stmt was compound, but > + without new variables, we will have skipped the creation of a > + BIND and will have a bare STATEMENT_LIST. But that's OK because > + (recursively) all of the component statements should already have > + line numbers assigned. */ Can we arrange to scrap all no-op statements very early - "2;" other than as the last statement of a statement expr, and so on? > + ??? In accordance with the old parser, the declaration may be a > + nested function, which is then rejected in check_for_loop_decls, > + but does it make any sense for this to be included in the grammar? First reaction: Barf. Second reaction: It would be nice to pitch out invalid declarations early. check_for_loop_decls is potentially an expensive operation. > +/* The actual parser and external interface. */ > + > +static GTY (()) c_parser *the_parser; Is this global used for anything other than GGC, and does it actually need to be allocated under the garbage collector? zw