public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* change the location_t type, DECL_SOURCE_LINE etc
@ 2002-12-02 15:43 Per Bothner
  2002-12-02 16:06 ` Gabriel Dos Reis
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Per Bothner @ 2002-12-02 15:43 UTC (permalink / raw)
  To: gcc

I've been looking at the location_t type, which is used (primarily)
to represent the line-number of a tree_decl.  There are two problems
I'd like to address:
(1) Each location_t is 8 bytes (on 32-bit hosts) or 16 bytes (in
64-bit hosts).  This is a substantial amount of space when there
are many declarations.
(2) We don't store column numbers.

The following proposed change addresses both issues. It is somewhat
tentative, and I haven't finished the implementation.  However, I'd
like to explain the idea before I go much further.

To reduce space usage we have to replace the filename by an index
in a 'line_table'.  We also have to use bit fields in order to squeeze
everything into 32 bits.  This means a limited number of different
file names, and a limited range on line and column numbers.  However,
it is not quite a limited as the 14 and 8 suggest, since we can
have multiple 'line_table' entries for the same filename, but with
different line and column base values.  Thus:

/* The data structure used to record a location in a translation unit.  */
struct location_s GTY (())
{
   /* An index into line_table. */
   unsigned int file_index : 10;

   /* The source line, relative to line_table[file_index].line_base. */
   unsigned int rline : 14;

   /* The source column, relative to line_table[file_index].column_base. */
   unsigned int rcolumn : 8;
};
typedef struct location_s location_t;

#define LOCATION_FILE(LOCATION) (line_table[(LOCATION).file_index].file)
#define LOCATION_LINE(LOCATION) \
   (line_table[(LOCATION).file_index].line_base + (LOCATION).rline)
#define LOCATION_COLUMN(LOCATION) \
   (line_table[(LOCATION).file_index].column_base + (LOCATION).rcolumn)

struct line_table_entry
{
   /* The name of the source file for locations using this entry.  */
   const char *file;

   /* Value (normally 0) to add to rline for locations using this entry. */
   int line_base;

   /* Value (normally 0) to add to rcolumn for locations using this 
entry. */
   int column_base;

   /* Index of next element of line_table for file; -1 if none. */
   int next_entry;
};
struct line_table_entry line_table[1024] = {{NULL, 0, 0, -1}};

For simplicity this version uses a pre-allocated full-size line_table.
Since file_index is 10 bits, we can handle a maximum of 1024 entries,
I've used a statically allocated array.  This may not be a great
idea:  It uses 16kB of memory, which is trivial for large compilations,
but perhaps it is a bit wasteful.  (Also, the pre-allocation
of 'unknown_location_index' as line_table entry 0 forces the entire
table into data rather than bss; this is relatively easy to fix.)

The main function is 'get_location', which takes a file/line/column
triple and returns a location_t, finding a suitable 'line-table' entry.

/* Return a location_t for the given arguments. */

struct location_s
get_location (file, line, column)
      const char *file;
      int line;
      int column;
{
   struct location_s result;
   struct line_table_entry *entry = &line_table[most_recent_line_entry];
   if (ENTRY_MATCHES (entry, file, line, column))
     goto ok; /* Fast case. */
   else
     {
       /* Slow case - the most_recent_line_entry didn't match. */
       int next, step;
       int hash = PRIMARY_HASH (file);
       int start = hash & 1023;
       entry = &line_table[start];
       if (! (FILE_MATCHES (entry, file)))
	{
	  step = SEARCH_STEP (file, hash);
	  next = start;
	  for (;;)
	    {
	      if (ENTRY_AVAIL (entry))
		goto new;
	      next = (next + step) & 1023;
	      if (next == start)
		return get_nospace_location (file, line, column);
	      entry = &line_table[next];
	      if (FILE_MATCHES (entry, file))
		break;
	    }
	}
       /* At this point FILE_MATCH (entry, file) is true. */
       for (;;)
	{
	  if (LINE_COL_MATCHES (entry, line, column))
	    goto ok;
	  next = entry->next_entry;
	  if (next < 0) break;
	  entry = &line_table[next];
	}
       step = SEARCH_STEP (file, hash);
       next = start;
       for (;;)
	{
	  next = (next + step) & 1023;
	  if (ENTRY_AVAIL (&line_table[next]))
	    {
	      entry->next_entry = next;
	      entry = &line_table[next];
	      goto new;
	    }
	  if (next == start)
	    return get_nospace_location (file, line, column);
	}
     }
  new:
   entry->line_base = line & ~ LINE_MAX;
   entry->column_base = column & ~ COL_MAX;
   entry->next_entry = -1;
   entry->file = IDENTIFIER_POINTER (get_identifier (file));
  ok:
   most_recent_line_entry = entry - line_table;
   result.file_index = most_recent_line_entry;
   result.rline = line - entry->line_base;
   result.rcolumn = line - entry->column_base;
   return result;
}

It uses various (non-public) helper macros:

/* The index of the most recently uses line_table entry.  A cache. */
static int most_recent_line_entry;

#define LINE_MAX ((1<<14)-1)
#define COL_MAX ((1<<8)-1)

#define ENTRY_MATCHES(entry, file, line, column) \
   (FILE_MATCHES (entry, file) && LINE_COL_MATCHES (entry, line, column))
#define FILE_MATCHES(entry, file) \
   (entry->file == file \
    || (entry->file != NULL && file != NULL && strcmp (entry->file, 
file) == 0))
#define LINE_COL_MATCHES(entry, line, column) \
   (entry->line_base <= line && line - entry->line_base <= LINE_MAX \
    && entry->column_base <= column && column - entry->column_base <= 
COL_MAX)

#define ENTRY_AVAIL(entry) ((entry)->next_entry == 0)

There is a slight chance the line_table will fill up; the function
get_nospace_location sort-of handles this case, by using
the pre-allocated entry for "unknown location".

static int unknown_location_index = 0;

/* Emergency handling if line_table is full. */

static struct location_s
get_nospace_location (const char *file, int line, int column)
{
   static int emitted_too_many_files_warning;
   struct location_s result;
   if (! emitted_too_many_files_warning)
     {
       emitted_too_many_files_warning = 1;
       warning ("too many files or lines for line number information");
     }
   /* See if ignoring column number helps. */
   if (column < 0 || column > COL_MAX)
     return get_location (file, line, 0);
   result.file_index = unknown_location_index;
   result.rline = line >= 0 && line <= LINE_MAX ? line : 0;
   result.rcolumn = column;
   return result;
}

Then of course we need to modify these definitions in tree.h:

#define DECL_SOURCE_LOCATION(NODE) (DECL_CHECK (NODE)->decl.locus)
#define DECL_SOURCE_FILE(NODE) LOCATION_FILE ((DECL_SOURCE_LOCATION (NODE)))
#define DECL_SOURCE_LINE(NODE) LOCATION_LINE ((DECL_SOURCE_LOCATION (NODE)))
#define DECL_SOURCE_COLUMN(NODE) LOCATION_COLUMN ((DECL_SOURCE_LOCATION 
(NODE)))
#define SET_DECL_SOURCE_FILE_LINE(NODE, FILE, LINE) \
   (DECL_CHECK (NODE)->decl.locus = get_location (FILE, LINE, 0))
#define INPUT_LOCATION() get_location (input_filename, lineno, 0)

At which point it's just a tedious matter of fixing places in the source
that use DECL_SOURCE_FILE and DECL_SOURCE_LINE as lvalues ...  I've
mostly done that.

This patch does not change EXPR_WITH_FILE_LOCATION or the related
EXPR_WFL macros (which use an expression's complexity node).  I
haven't dealt with Java's DECL_SOURCE_LINE_FIRST and DECL_SOURCE_LINE_LAST.
It doesn't actually set the column numbers - that's up to each front-end.

Comments?  Is this a good idea?

-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-02 15:43 change the location_t type, DECL_SOURCE_LINE etc Per Bothner
@ 2002-12-02 16:06 ` Gabriel Dos Reis
  2002-12-02 23:19 ` Neil Booth
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Gabriel Dos Reis @ 2002-12-02 16:06 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

Per Bothner <per@bothner.com> writes:

| I've been looking at the location_t type, which is used (primarily)
| to represent the line-number of a tree_decl.  There are two problems
| I'd like to address:
| (1) Each location_t is 8 bytes (on 32-bit hosts) or 16 bytes (in
| 64-bit hosts).  This is a substantial amount of space when there
| are many declarations.
| (2) We don't store column numbers.
| 
| The following proposed change addresses both issues. It is somewhat
| tentative, and I haven't finished the implementation.  However, I'd
| like to explain the idea before I go much further.
| 
| To reduce space usage we have to replace the filename by an index
| in a 'line_table'.  We also have to use bit fields in order to squeeze
| everything into 32 bits.  This means a limited number of different
| file names, and a limited range on line and column numbers.  However,
| it is not quite a limited as the 14 and 8 suggest, since we can
| have multiple 'line_table' entries for the same filename, but with
| different line and column base values.  Thus:

This idea has been in the "air" for a long time (primarily championed
by Neil).  We agreed to implement it for 3.4.  Neil may have more to
say on that.

Note however that, to have effective saving, we would have to use
location_t in lieu of ad hoc 'const char*, int' pairs used here and
there thoughout the compiler -- that is a work postponed for 3.4.

-- Gaby

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-02 15:43 change the location_t type, DECL_SOURCE_LINE etc Per Bothner
  2002-12-02 16:06 ` Gabriel Dos Reis
@ 2002-12-02 23:19 ` Neil Booth
  2002-12-04 15:16   ` Per Bothner
  2002-12-03 18:36 ` Fergus Henderson
  2002-12-04 16:29 ` Per Bothner
  3 siblings, 1 reply; 12+ messages in thread
From: Neil Booth @ 2002-12-02 23:19 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

Per Bothner wrote:-

> I've been looking at the location_t type, which is used (primarily)
> to represent the line-number of a tree_decl.  There are two problems
> I'd like to address:
> (1) Each location_t is 8 bytes (on 32-bit hosts) or 16 bytes (in
> 64-bit hosts).  This is a substantial amount of space when there
> are many declarations.
> (2) We don't store column numbers.

This is great; we've been planning to do this for a while.
But please use (or modify if you have to, though you shouldn't need
to) line-map.c; it provides the line -> (file, line) mapping already
for cpplib; I've wanted to get the whole compiler using it for a while.

Also, we've discussed before squeezing line and col information into
32-bits and it was rejected by rth, Fergus and others.  The only
thing that would work would be a "logical character" offset in the
translation unit, but that's a much larger and more dubious change
IMO.  Can we please stick to 32-bit lines and 16-bit columns,
at least for now?

> static int unknown_location_index = 0;

FYI line-map.c and cpplib reserve line zero for this.

> Comments?  Is this a good idea?

Most certainly.

Neil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-02 15:43 change the location_t type, DECL_SOURCE_LINE etc Per Bothner
  2002-12-02 16:06 ` Gabriel Dos Reis
  2002-12-02 23:19 ` Neil Booth
@ 2002-12-03 18:36 ` Fergus Henderson
  2002-12-04 11:22   ` Per Bothner
  2002-12-04 16:29 ` Per Bothner
  3 siblings, 1 reply; 12+ messages in thread
From: Fergus Henderson @ 2002-12-03 18:36 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

On 02-Dec-2002, Per Bothner <per@bothner.com> wrote:
> To reduce space usage we have to replace the filename by an index
> in a 'line_table'.  We also have to use bit fields in order to squeeze
> everything into 32 bits.  This means a limited number of different
> file names, and a limited range on line and column numbers.  However,
> it is not quite a limited as the 14 and 8 suggest, since we can
> have multiple 'line_table' entries for the same filename, but with
> different line and column base values.  Thus:
> 
> /* The data structure used to record a location in a translation unit.  */
> struct location_s GTY (())
> {
>    /* An index into line_table. */
>    unsigned int file_index : 10;
> 
>    /* The source line, relative to line_table[file_index].line_base. */
>    unsigned int rline : 14;
> 
>    /* The source column, relative to line_table[file_index].column_base. */
>    unsigned int rcolumn : 8;
> };

The C code generated by the Mercury compiler is going to overflow
those limits, I think.

Currently the longest C file generated by the Mercury compiler
is about 600,000 lines long (20,000,000 bytes).
The longest line is about 2300 characters.
There are up to about 750 header files.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-03 18:36 ` Fergus Henderson
@ 2002-12-04 11:22   ` Per Bothner
  2002-12-04 21:28     ` Fergus Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Per Bothner @ 2002-12-04 11:22 UTC (permalink / raw)
  To: Fergus Henderson; +Cc: gcc

Fergus Henderson wrote:
> The C code generated by the Mercury compiler is going to overflow
> those limits, I think.

> Currently the longest C file generated by the Mercury compiler
> is about 600,000 lines long (20,000,000 bytes).

That means that file might need to grab ceiling(600,000/(1<<14))
= 37 number of file_indexes.  There leaves still 986 available.

> The longest line is about 2300 characters.

Column numbers would probably not be very useful for generated C code.

> There are up to about 750 header files.

Perhaps file_index should be increased to 11 or 12 bits, and rline
reduced to 13 or 12 bits.  Or reduce the bits for rcolumn - it will
often just be 0 anyway.  That allows more flexibility.

In any case, it appears to me that line numbers in such C code are
useless except to debug the Mercury compiler itself.  Line numbers
relative to the C code would be useless for debugging Mercury code.
I assume you emit #line directives, in which case the size of
the C code is irrelevant.

If we start storing column numbers, I think i makes sense to disable
column numbers after a #line directive.
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-02 23:19 ` Neil Booth
@ 2002-12-04 15:16   ` Per Bothner
  2002-12-04 16:19     ` Neil Booth
  0 siblings, 1 reply; 12+ messages in thread
From: Per Bothner @ 2002-12-04 15:16 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc

Neil Booth wrote:

> This is great; we've been planning to do this for a while.
> But please use (or modify if you have to, though you shouldn't need
> to) line-map.c; it provides the line -> (file, line) mapping already
> for cpplib; I've wanted to get the whole compiler using it for a while.

I'm don't understand what you're suggesting.  It appears to me that
the line-map stuctues are solving a different problem tham I am,
which is how to compactly represent file/line[/column] pairs[/triples].

> Also, we've discussed before squeezing line and col information into
> 32-bits and it was rejected by rth, Fergus and others.  The only
> thing that would work would be a "logical character" offset in the
> translation unit, but that's a much larger and more dubious change
> IMO.  Can we please stick to 32-bit lines and 16-bit columns,
> at least for now?

I really want to pack file/line[/column] pairs[/triples] into 32 bits.
However, I have a suggestion for a more complex but less size-limited
encoding it port shortly.
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-04 15:16   ` Per Bothner
@ 2002-12-04 16:19     ` Neil Booth
  2002-12-04 19:44       ` Per Bothner
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Booth @ 2002-12-04 16:19 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

Per Bothner wrote:-

> Neil Booth wrote:
> 
> >This is great; we've been planning to do this for a while.
> >But please use (or modify if you have to, though you shouldn't need
> >to) line-map.c; it provides the line -> (file, line) mapping already
> >for cpplib; I've wanted to get the whole compiler using it for a while.
> 
> I'm don't understand what you're suggesting.  It appears to me that
> the line-map stuctues are solving a different problem tham I am,
> which is how to compactly represent file/line[/column] pairs[/triples].

line-maps compress file/line into 32 bits.  They don't hold column
info; since when I suggested it before it wasn't popular.  The 32-bit
line is convertible into a file / line combination when necessary (which
at the moment is during diagnostics, and at the front-end / CPP
interface, since the front ends don't use the line maps yet.  If they
did, it would only be necessary during asm output, and everything else
could use the 32-bit number).

I added other stuff to it; it also handles -H dumps and file stack
printing from diagnostics; using line-map for this meant these two
features became available for preprocessed input in addition to normal
input.

What don't you understand about them?

Neil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-02 15:43 change the location_t type, DECL_SOURCE_LINE etc Per Bothner
                   ` (2 preceding siblings ...)
  2002-12-03 18:36 ` Fergus Henderson
@ 2002-12-04 16:29 ` Per Bothner
  2002-12-04 20:03   ` Per Bothner
  3 siblings, 1 reply; 12+ messages in thread
From: Per Bothner @ 2002-12-04 16:29 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

This is a revised proposal for compressing file/line/[column]
numbers into 32 bits, but it can handle very large files.

Motivation/goals:
* Shrink location_t down to 32 bits.
* Handle column numbers, at least for reasonably-sized files.
* Handle huge files, though not necessarily with column numbers.
* Handle a large but not huge (specifically 4095) number of source files.

Observations/strategy:
Most C source files (judging by 'wc gcc/*.c') have fewer than 4k lines.
Very few (none in gcc/*.c) have more than 16k lines.
Most lines have fewer than 128 columns; almost all have fewer than
255 columns.  So we will "optimize" for the normal case, using 12 bits
for line numbers and 8 bits for column numbers.  That leaves us 12 bits
for a "file index". A few files (especially machine-ganerated ones) are
"pathologically large".  For these, we can drop the column numbers, 
giving us 20 bits to play with, which allows files over 10^6 lines.
This is almost always enough, but occasionally a file will be even
bigger.  We can handle those by allocating multiple file indexes, with
different line number bases.

struct location_s
{
   /* An index into line_table. */
   unsigned int file_index : 12;

   /* Encoding of line and column number. */
   unsigned int linecol : 20;
};

struct line_table_s
{
   const char *file;

   unsigned int line_base : 12;
   /* If column_base == 0xFF, then we don't store column numbers.
      In that case the linecol == the line number, adjusted by line_base.
      Otherwise linecol = (line : 12, column : 8). */
   unsigned int column_base : 8;

   /* Used to chain entries for the same file. */
   unsigned int next_extry : 12;
} line_table[4096];

#define LOCATION_FILE(LOCATION) \
   (line_table[(LOCATION).file_index].file)
#define LOCATION_LINE(LOCATION) \
   (line_table[(LOCATION).file_index].column_base == 0xFF \
    ? (LOCATION).linecol + (1 << 20) * 
line_table[(LOCATION).file_index].line_base \
    : ((LOCATION).linecol >> 8) + (1 << 12) * 
line_table[(LOCATION).file_index].line_base)
#define LOCATION_COLUMN(LOCATION) \
   (line_table[(LOCATION).file_index].column_base == 0xFF ? 0 \
    : ((LOCATION).linecol & 0xFF) + (1 << 8) * 
line_table[(LOCATION).file_index].column_base)

Notes:
The maximum representable column number is 254*256+255 or 65279.
However, to avoid allocating too many line_table entries for the same
file, I suggest ignoring the column number (treat it as 0) if either
column >= 2^10 (1024) or line >= 2^16 (65536).

A single entry in the line_table can handle either: line numbers up to
2^20-1 or 1048575, without column numbers (i.e. column_base == 0xFF),
or line numbers up to 2^12-1 or 4095 plus column numbers up to 255.
The maximum representable line_size without column numbers is
2^20*(2^12-1)+2^20-1, which is the same as 2^32-1.  Of course, if all
those lines are actually used (even just one every 2^20 lines) that
would use up all the entries in the line_table.  So we can handle any
combination of up to 2^12 normal-sized files (up to 2^20-1 lines or
2^12-1 lines and 2^8-1 column) or a single file with 2^32-1 lines or
two files with up to 2^32 lines and 2^10 normal-sized files or any
other combination.
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-04 16:19     ` Neil Booth
@ 2002-12-04 19:44       ` Per Bothner
  2002-12-04 23:09         ` Neil Booth
  0 siblings, 1 reply; 12+ messages in thread
From: Per Bothner @ 2002-12-04 19:44 UTC (permalink / raw)
  To: Neil Booth; +Cc: gcc

Neil Booth wrote:

> line-maps compress file/line into 32 bits.

I've looked at the code.  I *think* I understand the idea, but
as they don't seem to be used much, even in cpplib, I may be
confused.

One problem is that it isn't obvious what lines numbers are logical,
and which are physical.  I can figure it by tracing calls in
cpplib, but a simple typedef would help.  For example:

typedef unsigned int file_line_t;

The goal of course would be for file_line_t to replace location_t,
so perhaps we could just use location_t.  However, file_line_t is
perhaps less ambiguous.

extern const struct line_map *lookup_line
   PARAMS ((struct line_maps *, file_line_t));

 > They don't hold column info; since when I suggested it before it
 > wasn't popular.

I really would like column numbers.  I also think it is very awkward
when cpplib takes a (file_line_t, column)-pair.  It is much cleaner
to fold the column number into the file_line_t.

Suggestion:  To struct line_map add:

   /* Number of column in a file_line_t which are column numbers. */
   int column_bits;

Then SOURCE_LINE becomes:
#define SOURCE_LINE(MAP, LINE) \
   (((LINE) + (MAP)->to_line - (MAP)->from_line) >> (MAP)->column_bits)
and we get:
#define SOURCE_COLUMN(MAP, LINE) \
   (((LINE) + (MAP)->to_line - (MAP)->from_line)
     & ((1 << (MAP)->column_bits) - 1))

Of course add_line_map would have to take a column_bits parameter,
which leads to the issue of what to set it to.  I suggest a default
of 8, for front-ends that generate column-numbers, and 0 for
front-ends that don't, or when the line-number gets too high.

If a file_line_t is needed in a situation where the current
physical column number is more than 255, we have the option of
making another call to add_line_map with a higher column_bits,
or ignoring the column numbers (set it to 0).  If the current
line numbes starts getting really high, we might similarly want
to reset column_bits to 0, so we don't "use up" the file_line_t
values too quickly.

Another problem:  Currently, the struct line_maps is local to a
cpp_reader.  If it is going to replace location_t, then of course
it needs to be global.  Perhaps cpp_create_reader could take an
extra parameter, which is the line_maps set to use?

-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-04 16:29 ` Per Bothner
@ 2002-12-04 20:03   ` Per Bothner
  0 siblings, 0 replies; 12+ messages in thread
From: Per Bothner @ 2002-12-04 20:03 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

Ignore the proposal below for now, until Neil and I figure
out if we can add column number support to the line-map stuff.
Perhaps that can solve the goal below, while avoid the hard
limit on the number of files - plus making use of the
existing code.  Let's see what Neil says.

Per Bothner wrote:

> This is a revised proposal for compressing file/line/[column]
> numbers into 32 bits, but it can handle very large files.
>
> Motivation/goals:
> * Shrink location_t down to 32 bits.
> * Handle column numbers, at least for reasonably-sized files.
> * Handle huge files, though not necessarily with column numbers.
> * Handle a large but not huge (specifically 4095) number of source files.


-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/per/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-04 11:22   ` Per Bothner
@ 2002-12-04 21:28     ` Fergus Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Fergus Henderson @ 2002-12-04 21:28 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

On 04-Dec-2002, Per Bothner <per@bothner.com> wrote:
> Fergus Henderson wrote:
> > The longest line is about 2300 characters.
> 
> Column numbers would probably not be very useful for generated C code.

I suspect it would be the other way around.
The wider the lines are, the more useful column numbers are,
and files with very wide lines are usually automatically generated.

The times I really wish gcc supported column numbers is when debugging
errors in macro-intensive code.  The trouble comes when you get an
error at the point of the macro invocation, but have no idea which part
of the body of the macro is causing the problem.  Typically what I do
then is generate the `.i' file, remove the #line directives, rename
it to `tmp.c' and recompile it.  Then I find the error message points
to a 2000-character line, so I edit the tmp.c file again, piping the
relevent function through `indent' to split up long lines.  (I'd run
the whole file through `indent', except that the program doesn't seem
to cope with files containing very long lines...)  Finally after a bit
of this sort of hacking I'm able to figure out which part of the `.i'
file is causing the error, and can then trace it back to the offending
macro in the source code.  There has to be a better way...

> In any case, it appears to me that line numbers in such C code are
> useless except to debug the Mercury compiler itself.

Not entirely.  Mercury allows users to write in-line C code in their
Mercury source files.  So the generated C file contains both
code which is automatically generated by the Mercury compiler,
and also C code fragments that the user wrote which are copied verbatim
into the generated C file.  (In this respect it's similar to what happens
with lex and yacc.)  Sometimes users make mistakes such as putting
unterminated comments or string literals in their C code fragments
or defining macro names which happen to conflict with identifiers used
in the generated code.  This can lead to error messages showing up
in the generated code.  Users who are debugging such problems need
to see which part of the generated code had problems in order to then
track down which part of their own code was the ultimate cause.

> Line numbers
> relative to the C code would be useless for debugging Mercury code.
> I assume you emit #line directives, in which case the size of
> the C code is irrelevant.

We do emit #line directives by default.
However, we don't emit a #line directives for the generated code --
instead we only emit #line directives before and after the user-written
C code fragments.  The #line directive after such a fragment just
resets the filename and line number back to the physical line in the file.

The reason that we do this is that the line numbers are only being used
for C compiler error messages, not for gdb.  If there are C errors in
the generated code, it's much better to point the user at the generated C
code which had a problem rather than pointing them at the corresponding
part of their Mercury source, since the problem can only be debugged
at the C level not the Mercury level.

> If we start storing column numbers, I think i makes sense to disable
> column numbers after a #line directive.

There are three different uses of #line directives:

	(1) immediately before a line of code copied verbatim from the
	    user's source file

	(2) immediately before a line of generated code which
	    conceptually corresponds to some part of the user's source file,
	    but which has been munged in some way (e.g. translated from
	    a different language)

	(3) immediately before a line of generated code which does not
	    correspond to any part of the user's source file,
	    to undo the effects of previous #line directives
	    and reset the logical file and line so that they
	    point back to the physical file again.

In case (2), it makes sense to disable column numbers.
In cases (1) and (3), it does not.

Case (3) can be detected by checking whether the line number and file name
in the #line directive match the physical file and line number; if they do,
column numbers shold be (re)enabled.

Unfortunately the C #line directive syntax provides no way to distinguish
cases (1) and (2).  I suppose it is best to disable column numbers in
cases (1)/(2) since it is better to provide less information in case (1)
than to provide misleading information in case (2).  In the long term
it might be worth considering some extension to the C #line directive
syntax to distinguish case (1).

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: change the location_t type, DECL_SOURCE_LINE etc
  2002-12-04 19:44       ` Per Bothner
@ 2002-12-04 23:09         ` Neil Booth
  0 siblings, 0 replies; 12+ messages in thread
From: Neil Booth @ 2002-12-04 23:09 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc

Per Bothner wrote:-

> I've looked at the code.  I *think* I understand the idea, but
> as they don't seem to be used much, even in cpplib, I may be
> confused.

Sorry, I thought the comments in the header were enough, though
I admit when I looked at them when this thread started it wasn't
entirely obvious (it's over 1 year since I wrote the code).
I could go over them, or you could reword them to clarify what wasn't
clear to you.

The reason they appear to not be being used it that they are only
needed when crossing the boundary from line->(file, line), which doesn't
happen often like I said.  Everything in cpplib uses only logical lines.
At present, translation is only required for diagnostics and when a token
enters the front end.  The latter case is simplified too, because the
front end just remembers the current map, and uses that to convert to a
physical line number without a binary search.  The current map is updated
only when entering / leaving a file, or by #line, through a call-back.
See comment below about efficiency.

This latter use would be unnecessary too if / when the front-ends
started using them; the boundary would pass to the front-end/back-end
interface, and if the back-ends use them, ultimately to asm output.

> One problem is that it isn't obvious what lines numbers are logical,
> and which are physical.  I can figure it by tracing calls in
> cpplib, but a simple typedef would help.  For example:
> 
> typedef unsigned int file_line_t;

Yes, this would be useful.

> I really would like column numbers.  I also think it is very awkward
> when cpplib takes a (file_line_t, column)-pair.  It is much cleaner
> to fold the column number into the file_line_t.

I would like column numbers too.  However, since the rest of the
compiler wasn't supporting it, and getting the rest of the compiler
to use line-maps as they are isn't entirely easy, I kinda welched.

The main problem getting the compiler to use line-maps is global abuse
of reads (and in some cases writes!) to the "lineno" and "filename"
variables.  These all need to go.  Ultimately we could remove the
ad-hoc file stack implemented in toplev.c.

> Suggestion:  To struct line_map add:
> 
>   /* Number of column in a file_line_t which are column numbers. */
>   int column_bits;
> 
> Then SOURCE_LINE becomes:
> #define SOURCE_LINE(MAP, LINE) \
>   (((LINE) + (MAP)->to_line - (MAP)->from_line) >> (MAP)->column_bits)
> and we get:
> #define SOURCE_COLUMN(MAP, LINE) \
>   (((LINE) + (MAP)->to_line - (MAP)->from_line)
>     & ((1 << (MAP)->column_bits) - 1))
> 
> Of course add_line_map would have to take a column_bits parameter,
> which leads to the issue of what to set it to.  I suggest a default
> of 8, for front-ends that generate column-numbers, and 0 for
> front-ends that don't, or when the line-number gets too high.

That's a neat idea.  We could use say 12 bits for column numbers (my
preferred default) initially, and reduce it if / when the line numbers
get too big.  That would require 256K lines, which I'd guess 99.9%
of files get nowhere near.

You'd need code in the populator of the linemaps to recognize crossing
the boundary somehow.  In cpplib, this would be handle_newline() I
think.

Of course, the line maps have no business storing the column themselves;
they just map logical to physical lines.  The columns would be embedded
in the 32-bit logical_line_t used by the front ends.

I'd like to reserve column number zero to mean "no column stored" or
similar; that would be appropriate for at least the C front ends until
they start using the information.

However, I've just realized that this would mean that location
information would not be monotonic increasing across such a
number-of-bits-change boundary.  This may or may not be a problem.

> If a file_line_t is needed in a situation where the current
> physical column number is more than 255, we have the option of
> making another call to add_line_map with a higher column_bits,
> or ignoring the column numbers (set it to 0).  If the current
> line numbes starts getting really high, we might similarly want
> to reset column_bits to 0, so we don't "use up" the file_line_t
> values too quickly.

Yup.  I wrote the above before reading your paragraph properly.

Fancy implementing something like you've described?

> Another problem:  Currently, the struct line_maps is local to a
> cpp_reader.  If it is going to replace location_t, then of course
> it needs to be global.  Perhaps cpp_create_reader could take an
> extra parameter, which is the line_maps set to use?

Of course; this is just an effect of cpplib insisting on reentrancy.
We do something similar for the global identifier hash table.

Note that the line maps are quite efficient; normally the only map being
used is the most recent one, so no binary lookup is even necessary.
They'd be even more efficient if the whole compiler used them; lookups
and quick translations would be even less frequent.

Full binary lookup only tends to be used in diagnostics to refer to
prior locations.

Neil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-12-05  7:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-02 15:43 change the location_t type, DECL_SOURCE_LINE etc Per Bothner
2002-12-02 16:06 ` Gabriel Dos Reis
2002-12-02 23:19 ` Neil Booth
2002-12-04 15:16   ` Per Bothner
2002-12-04 16:19     ` Neil Booth
2002-12-04 19:44       ` Per Bothner
2002-12-04 23:09         ` Neil Booth
2002-12-03 18:36 ` Fergus Henderson
2002-12-04 11:22   ` Per Bothner
2002-12-04 21:28     ` Fergus Henderson
2002-12-04 16:29 ` Per Bothner
2002-12-04 20:03   ` Per Bothner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).