From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-514057-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 108042 invoked by alias); 19 Nov 2019 17:30:57 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 108034 invoked by uid 89); 19 Nov 2019 17:30:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-22.6 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.1 spammy=consolidation, paperwork, visually, conclude
X-HELO: us-smtp-1.mimecast.com
Received: from us-smtp-delivery-1.mimecast.com (HELO us-smtp-1.mimecast.com) (207.211.31.120) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 19 Nov 2019 17:30:51 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;	s=mimecast20190719; t=1574184646;	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:	 content-transfer-encoding:content-transfer-encoding:	 in-reply-to:in-reply-to:references:references;	bh=+enmcw9HQPuftZTpgzKP68oQ80uFyjPc+PZG0fwhgXY=;	b=XU56hAs9SEqDbiJW59IDE9pdA8pmTHfGGOvJVnCBT2A4N2ezDgaxfaoiehrzLqxJvG7ZlN	Tqb1d0hWC6x5jMmVgotyMpFOhytnriz43r3pl7fyqdiyJH8jSDvSHCOeFIEsv5SA8HRQYH	vbY1n/lVoh2bEeRUdkqYVxrk8SLbKIY=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-237-7Tw6tzpUP7ioMu_rEuQUlQ-1; Tue, 19 Nov 2019 12:30:42 -0500
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 595C8107B7F1;	Tue, 19 Nov 2019 17:30:41 +0000 (UTC)
Received: from ovpn-116-76.phx2.redhat.com (ovpn-116-76.phx2.redhat.com [10.3.116.76])	by smtp.corp.redhat.com (Postfix) with ESMTP id 564984DA48;	Tue, 19 Nov 2019 17:30:40 +0000 (UTC)
Message-ID: <8a30a5a30078714f399822b0513b070af59c3e88.camel@redhat.com>
Subject: Re: [PATCH] Multibyte awareness for diagnostics (PR 49973)
From: David Malcolm <dmalcolm@redhat.com>
To: Lewis Hyatt <lhyatt@gmail.com>, gcc-patches@gcc.gnu.org
Cc: joseph@codesourcery.com
Date: Tue, 19 Nov 2019 17:38:00 -0000
In-Reply-To: <20190927204144.GA86720@ldh.local>
References: <20190926201639.GA82807@ldh.local>	 <alpine.DEB.2.21.1909262040450.14912@digraph.polyomino.org.uk>	 <20190927204144.GA86720@ldh.local>
User-Agent: Evolution 3.32.4 (3.32.4-1.fc30)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2019-11/txt/msg01850.txt.bz2

On Fri, 2019-09-27 at 16:41 -0400, Lewis Hyatt wrote:
> On Thu, Sep 26, 2019 at 08:46:56PM +0000, Joseph Myers wrote:
> > On Thu, 26 Sep 2019, Lewis Hyatt wrote:
> >=20
> > > A couple notes:=20
> > >     - In order to avoid any portability problems with wchar_t,
> > > the
> > > equivalent of wcwidth() from libc is implemented in-house.
> >=20
> > I'm uneasy about contrib/gen_wcwidth.cpp doing the generation using
> > host=20
> > libc's wcwidth.  The effect is that libcpp/generated_cpp_wcwidth.h
> > is=20
> > *not* reproducible unless you know exactly what host (libc version,
> > locale=20
> > used when running the program, distribution patches to libc and
> > locale=20
> > data) was used to run the program.  I think we need a generator
> > that works=20
> > from Unicode data in some way so we can explicitly say what version
> > of the=20
> > (unmodified) Unicode data was used.
>=20
> Here is a revised patch that hopefully addresses your concerns. I
> borrowed the
> relevant Python code for parsing Unicode's data files from glibc,
> then added a
> new script that parses the locale data they output into the same sort
> of simply
> searchable tables I was creating before. The new generated table is
> very close
> to the old one, but there are some differences due to improvements
> that have
> been made to glibc recently, affecting 200 or so codepoints. The
> procedure for
> updating GCC's wcwidth would then be the following:
>=20
> -The three Unicode data files live in contrib/unicode/
> {EastAsianWidth.txt,PropList.txt,UnicodeData.txt} and can be updated
> at any
> time when Unicode changes them.
>=20
> -glibc's processing logic lives in two Python scripts in
> contrib/unicode/from_glibc and these would ideally be updated when
> glibc makes
> updates. It seems they occasionally put some manual overrides, etc.,
> based on
> feedback and bug reports. (These are the verbatim scripts from glibc
> with no
> changes, so they need only be copied over.)
>=20
> -contrib/unicode/gen_wcwidth.py runs the glibc code, using GCC's
> Unicode data
> files as inputs, and produces the necessary tables for
> libcpp/generated_cpp_wcwidth.h.
>=20
> Hope that sounds better. This way it is relatively straightforward to
> keep in
> sync with glibc (which seems desirable to me anyway), but is also
> always
> reproducible.
>=20
> Note: I did not include the three large unicode data files in this
> emailed
> patch, although they would be committed as part of the patch
> presumably.
> They are available here:
> ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
> ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
> ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
>=20
> The rest of the patch is unchanged from before, other than one
> comment updated
> to reflect the new situation, and charset.c rebased to current trunk.
>=20
> Thank you for taking the time to review this.
>=20
> -Lewis

Thanks for posting this patch; I'm sorry about how long it's taken me
to review it.

BTW, have you done GCC contributor paperwork?
  https://gcc.gnu.org/contribute.html#legal

> diff --git a/contrib/unicode/from_glibc/unicode_utils.py b/contrib/unicod=
e/from_glibc/unicode_utils.py
> new file mode 100644
> index 00000000000..a9e94cce418
> --- /dev/null
> +++ b/contrib/unicode/from_glibc/unicode_utils.py

[...snip...]

I'll leave it for Joseph to comment on whether this approach satisifies
his concerns; I'll focus on the diagnostic-show-locus.c changes.

I'm assuming that all of the Python code is Python 3, rather than 2
(I see python 3 shebangs and python3-style-print, so it looks good from
that POV).  It appears that there's no need for the build machine to
have Python, it's only needed when maintainers are refreshing the data
from glibc.

If we're going with this approach, and redistributing those unicode data
files as part of our repo and tarballs, do we need some kind of
copyright/license statement that spells out the situation?  What does
glibc do for this?

> diff --git a/contrib/unicode/from_glibc/utf8_gen.py b/contrib/unicode/fro=
m_glibc/utf8_gen.py
> new file mode 100755
> index 00000000000..0e5583cd259
> --- /dev/null
> +++ b/contrib/unicode/from_glibc/utf8_gen.py

[...snip...]

> diff --git a/contrib/unicode/gen_wcwidth.py b/contrib/unicode/gen_wcwidth=
.py
> new file mode 100755
> index 00000000000..02b28bcedcf
> --- /dev/null
> +++ b/contrib/unicode/gen_wcwidth.py

[...snip...]

If we're going with this approach (which I'll leave to Joseph), perhaps
this directory should have a brief README (covering much of the material
you've mentioned in the email with the patch, talking about syncing with
glibc, regenerating the data, etc).

> diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
> index 4d563dda8f4..7a5bd36d962 100644
> --- a/gcc/diagnostic-show-locus.c
> +++ b/gcc/diagnostic-show-locus.c
> @@ -30,6 +30,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gcc-rich-location.h"
>  #include "selftest.h"
>  #include "selftest-diagnostic.h"
> +#include "cpplib.h"
>=20=20
>  #ifdef HAVE_TERMIOS_H
>  # include <termios.h>
> @@ -112,7 +113,29 @@ class colorizer
>    const char *m_stop_color;
>  };
>=20=20
> -/* A point within a layout_range; similar to an expanded_location,
> +/* In order to handle multibyte sources properly, all of this logic need=
s to be
> +   aware of the distinction between the number of bytes and the number of
> +   display columns occupied by a character.  One or the other is more us=
eful
> +   depending on the context.  For instance, in order to output the caret=
 at the
> +   correct location, we need to count display columns; in order to color=
ize a
> +   source line, we need to count the bytes.  All locations are provided =
to us
> +   as byte counts.  We augment these with the display column so that it =
can be
> +   used when need.  This is not the most efficient way to do things sinc=
e it
> +   requires looping over the whole line each time, but it should be fine=
 for
> +   the purpose of outputting diagnostics.  */

The patch uses a bool in many places to indicate what kind of column is
being referred to.  I think an enum would be more typesafe and
self-documenting - perhaps:
  "enum column_unit" with values CU_BYTES, CU_DISPLAY_COLUMN?  or
  "enum column_kind" with values CK_BYTES, CK_DISPLAY_COLUMN?
I wonder if this might eventually grow a third value, representing the
a count of unicode characters, but that's out of scope for this patch.

I should confess that it took me a while to realize the whole
multi-column display thing (I had an "aha" moment, then felt rather
foolish, given that I'd been playing with the examples in the PR; it was
on reading through the ASCII art in the new selftests and going "huh"
that I had my epiphany on the problem your patch is solving).

I think this file could use a high-level introductory comment at the top
talking about the various meanings of "column".  I liked the two example
code points you used below, so perhaps have a comment up at the top
talking about the distinction between byte vs display column, using
those code points as examples (and a plain ASCII character, by way of
contrast).  Perhaps have the comment describing the enum be the big
introductory comment.

It would be good for that introductory comment to have a copy of the
ASCII art you used in the selftests below, or similar.


> +class exploc_with_display_col : public expanded_location
> +{
> + public:
> +  exploc_with_display_col (const expanded_location &exploc)
> +    : expanded_location (exploc),
> +      m_display_col (location_compute_display_column (exploc)) {}
> +
> +  int m_display_col;
> +};

OK

> +
> +/* A point within a layout_range; similar to an exploc_with_display_col,
>     but after filtering on file.  */
>=20=20
>  class layout_point
> @@ -120,10 +143,17 @@ class layout_point
>   public:
>    layout_point (const expanded_location &exploc)
>    : m_line (exploc.line),
> -    m_column (exploc.column) {}
> +    m_column (exploc.column),
> +    m_display_col (location_compute_display_column (exploc)) {}
> +
> +  int get_col (bool use_display) const
> +    {
> +      return use_display ? m_display_col : m_column;
> +    }

enum here.


>    linenum_type m_line;
>    int m_column;
> +  int m_display_col;

Does it simplify things if this is an array accessed via the enum?
(not sure, just a thought that occurred to me)

>  };
>=20=20
>  /* A class for use by "class layout" below: a filtered location_range.  =
*/
> @@ -138,7 +168,7 @@ class layout_range
>  		unsigned original_idx,
>  		const range_label *label);
>=20=20
> -  bool contains_point (linenum_type row, int column) const;
> +  bool contains_point (linenum_type row, int column, bool use_display) c=
onst;

enum

>    bool intersects_line_p (linenum_type row) const;
>=20=20
>    layout_point m_start;
> @@ -157,6 +187,17 @@ struct line_bounds
>  {
>    int m_first_non_ws;
>    int m_last_non_ws;
> +
> +  void convert_to_display_cols (char_span line)
> +  {
> +    m_first_non_ws =3D cpp_byte_column_to_display_column (line.get_buffe=
r (),
> +							line.length (),
> +							m_first_non_ws);
> +
> +    m_last_non_ws =3D cpp_byte_column_to_display_column (line.get_buffer=
 (),
> +						       line.length (),
> +						       m_last_non_ws);
> +  }
>  };
>=20=20
>  /* A range of contiguous source lines within a layout (e.g. "lines 5-10"
> @@ -284,6 +325,7 @@ class layout
>    get_state_at_point (/* Inputs.  */
>  		      linenum_type row, int column,
>  		      int first_non_ws, int last_non_ws,
> +		      bool use_display,

enum

>  		      /* Outputs.  */
>  		      point_state *out_state);
>=20=20
> @@ -298,7 +340,7 @@ class layout
>    diagnostic_context *m_context;
>    pretty_printer *m_pp;
>    location_t m_primary_loc;
> -  expanded_location m_exploc;
> +  exploc_with_display_col m_exploc;
>    colorizer m_colorizer;
>    bool m_colorize_source_p;
>    bool m_show_labels_p;
> @@ -472,10 +514,15 @@ layout_range::layout_range (const expanded_location=
 *start_exploc,
>     - 'w' indicates a point within the range
>     - 'F' indicates the finish of the range (which is
>  	 within it).
> -   - 'a' indicates a subsequent point *after* the range.  */
> +   - 'a' indicates a subsequent point *after* the range.
> +
> +   USE_DISPLAY controls whether we check the byte column or
> +   the display column; one or the other is more convenient
> +   depending on the context.  */

FWIW it looks like this is only used by layout::get_state_at_point,
which in turn is used by layout::print_source_line for colorization,
and by layout::print_annotation_line for printing carets/underlines;
looks like they're using false and true respectively.

>  bool
> -layout_range::contains_point (linenum_type row, int column) const
> +layout_range::contains_point (linenum_type row, int column,
> +			      bool use_display) const

enum again

>  {
>    gcc_assert (m_start.m_line <=3D m_finish.m_line);
>    /* ...but the equivalent isn't true for the columns;
> @@ -491,7 +538,7 @@ layout_range::contains_point (linenum_type row, int c=
olumn) const
>      /* On same line as start of range (corresponding
>         to line 02 in example A and line 03 in example B).  */
>      {
> -      if (column < m_start.m_column)
> +      if (column < m_start.get_col (use_display))
>  	/* Points on the starting line of the range, but
>  	   before the column in which it begins.  */
>  	return false;
> @@ -505,7 +552,7 @@ layout_range::contains_point (linenum_type row, int c=
olumn) const
>  	{
>  	  /* This is a single-line range.  */
>  	  gcc_assert (row =3D=3D m_finish.m_line);
> -	  return column <=3D m_finish.m_column;
> +	  return column <=3D m_finish.get_col (use_display);
>  	}
>      }
>=20=20
> @@ -530,7 +577,7 @@ layout_range::contains_point (linenum_type row, int c=
olumn) const
>=20=20
>    gcc_assert (row =3D=3D  m_finish.m_line);
>=20=20
> -  return column <=3D m_finish.m_column;
> +  return column <=3D m_finish.get_col (use_display);
>  }
>=20=20
>  /* Does this layout_range contain any part of line ROW?  */
> @@ -574,20 +621,23 @@ test_layout_range_for_single_point ()
>=20=20
>    /* Tests for layout_range::contains_point.  */
>=20=20
> -  /* Before the line. */
> -  ASSERT_FALSE (point.contains_point (6, 1));
> +  for (int use_display =3D 0; use_display <=3D 1; ++use_display)
> +    {
> +      /* Before the line.  */
> +      ASSERT_FALSE (point.contains_point (6, 1, use_display));

[...snip...]

Here you generalize the layout_range tests to iterate over both
meanings of "column".

If I'm reading things right, implicit here is that the layout_point
ctors within the layout range are now calling:
  m_display_col (location_compute_display_column (exploc))
which in this selftest is looking for a file named "test.c",
presumably not finding it, and hitting the case of a NULL "line"
char_span.

So if there happens to be a test.c in the current directory
containing the "right" characters, this selftest could break.

Previously it's never mattered to this selftest whether or not
there was an actual test.c, so it might be good to modify it
to use temp_source_file (and maybe even to have some multicolumn
chars in it, though that might be taking things too far).

> @@ -642,40 +695,43 @@ test_layout_range_for_multiple_lines ()

[...snip...]

Similar comments as for the single_point case.

> @@ -687,8 +743,8 @@ test_layout_range_for_multiple_lines ()
>=20=20
>  #endif /* #if CHECKING_P */
>=20=20
> -/* Given a source line LINE of length LINE_WIDTH, determine the width
> -   without any trailing whitespace.  */
> +/* Given a source line LINE of length LINE_WIDTH bytes, determine the wi=
dth
> +   (in bytes, not display cols) without any trailing whitespace.  */
>=20=20
>  static int
>  get_line_width_without_trailing_whitespace (const char *line, int line_w=
idth)

Why is get_line_width_without_trailing_whitespace done in bytes?
It's used for calculating the maximum number of printed columns, to try
to cope with extra wide source lines, offsetting things to fit within the w=
idth
of the user's terminal.

> @@ -897,17 +953,35 @@ layout::layout (diagnostic_context * context,
>       will be adjusted accordingly.  */
>    size_t max_width =3D m_context->caret_max_width;
>    char_span line =3D location_get_source_line (m_exploc.file, m_exploc.l=
ine);
> -  if (line && (size_t)m_exploc.column <=3D line.length ())
> +  if (line && max_width)
>      {
> -      size_t right_margin =3D CARET_LINE_MARGIN;
> -      size_t column =3D m_exploc.column;
> -      if (m_show_line_numbers_p)
> -	column +=3D m_linenum_width + 2;
> -      right_margin =3D MIN (line.length () - column, right_margin);
> -      right_margin =3D max_width - right_margin;
> -      if (line.length () >=3D max_width && column > right_margin)
> -	m_x_offset =3D column - right_margin;
> -      gcc_assert (m_x_offset >=3D 0);
> +      size_t column =3D m_exploc.m_display_col;
> +      int line_width
> +	=3D get_line_width_without_trailing_whitespace (line.get_buffer (),
> +						      line.length ());
> +      size_t eol =3D cpp_display_width (line.get_buffer (), line_width);
> +      const size_t eol_before_linenum =3D eol;
> +
> +      if (column <=3D eol)
> +	{
> +	  if (m_show_line_numbers_p)
> +	    {
> +	      column +=3D m_linenum_width + 2;
> +	      eol +=3D m_linenum_width + 2;
> +	    }
> +	  size_t right_margin =3D CARET_LINE_MARGIN;
> +	  right_margin =3D MIN (eol - column, right_margin);
> +	  right_margin =3D max_width - right_margin;
> +	  /* Note: if right_margin > max_width, we end up failing this next
> +	     check due to wrapping, and we don't offset anything.  Otherwise we
> +	     would conclude we can't output the line at all.  */
> +	  if (eol >=3D max_width && column > right_margin)
> +	    {
> +	      m_x_offset =3D column - right_margin;
> +	      m_x_offset =3D MIN (m_x_offset, (int) eol_before_linenum - 1);
> +	    }
> +	  gcc_assert (m_x_offset >=3D 0);
> +	}
>      }
>=20=20
>    if (context->show_ruler_p)
> @@ -1252,7 +1326,9 @@ layout::calculate_line_spans ()
>  /* Print line ROW of source code, potentially colorized at any ranges, a=
nd
>     populate *LBOUNDS_OUT.
>     LINE is the source line (not necessarily 0-terminated) and LINE_WIDTH
> -   is its width.  */
> +   is its width.  This function deals only with byte offsets, not display
> +   columns; m_x_offset must be converted from display to byte units.  In
> +   particular, LINE_WIDTH and LBOUNDS_OUT are in bytes.  */
>=20=20
>  void
>  layout::print_source_line (linenum_type row, const char *line, int line_=
width,
> @@ -1264,7 +1340,10 @@ layout::print_source_line (linenum_type row, const=
 char *line, int line_width,
>       whitespace.  */
>    line_width =3D get_line_width_without_trailing_whitespace (line,
>  							   line_width);
> -  line +=3D m_x_offset;
> +
> +  const int x_offset_bytes
> +    =3D cpp_display_column_to_byte_column (line, line_width, m_x_offset);
> +  line +=3D x_offset_bytes;
>=20=20
>    if (m_show_line_numbers_p)
>      {
> @@ -1278,7 +1357,7 @@ layout::print_source_line (linenum_type row, const =
char *line, int line_width,
>    int first_non_ws =3D INT_MAX;
>    int last_non_ws =3D 0;
>    int column;
> -  for (column =3D 1 + m_x_offset; column <=3D line_width; column++)
> +  for (column =3D 1 + x_offset_bytes; column <=3D line_width; column++)
>      {
>        /* Assuming colorization is enabled for the caret and underline
>  	 characters, we may also colorize the associated characters
> @@ -1298,6 +1377,8 @@ layout::print_source_line (linenum_type row, const =
char *line, int line_width,
>  	  point_state state;
>  	  in_range_p =3D get_state_at_point (row, column,
>  					   0, INT_MAX,
> +					   false, /* Using bytes, not display
> +						     columns, here.  */
>  					   &state);
>  	  if (in_range_p)
>  	    m_colorizer.set_range (state.range_idx);
> @@ -1360,12 +1441,13 @@ layout::start_annotation_line (char margin_char) =
const
>  }
>=20=20
>  /* Print a line consisting of the caret/underlines for the given
> -   source line.  */
> +   source line.  This function works with display columns, rather than b=
yte
> +   counts; in particular, LBOUNDS should be in display column units.  */
>=20=20
>  void
>  layout::print_annotation_line (linenum_type row, const line_bounds lboun=
ds)
>  {
> -  int x_bound =3D get_x_bound_for_row (row, m_exploc.column,
> +  int x_bound =3D get_x_bound_for_row (row, m_exploc.m_display_col,
>  				     lbounds.m_last_non_ws);
>=20=20
>    start_annotation_line ();
> @@ -1378,6 +1460,7 @@ layout::print_annotation_line (linenum_type row, co=
nst line_bounds lbounds)
>        in_range_p =3D get_state_at_point (row, column,
>  				       lbounds.m_first_non_ws,
>  				       lbounds.m_last_non_ws,
> +				       true, /* Using display units.  */
>  				       &state);
>        if (in_range_p)
>  	{
> @@ -1415,9 +1498,11 @@ class line_label
>  public:
>    line_label (int state_idx, int column, label_text text)
>    : m_state_idx (state_idx), m_column (column),
> -    m_text (text), m_length (strlen (text.m_buffer)),
> -    m_label_line (0)
> -  {}
> +    m_text (text), m_label_line (0)
> +  {
> +    const int bytes =3D strlen (text.m_buffer);
> +    m_length =3D cpp_display_width (text.m_buffer, bytes);
> +  }

Please rename m_length to m_display_width, given that it's changing meaning.


[...snip...]
> @@ -1723,7 +1808,7 @@ layout::annotation_line_showed_range_p (linenum_typ=
e line, int start_column,
>=20=20
>     and is thus printed as desired.  */
>=20=20
> -/* A range of columns within a line.  */
> +/* A range of (byte or display) columns within a line.  */
>=20=20
>  class column_range
>  {
> @@ -1743,32 +1828,51 @@ public:
>    int finish;
>  };
>=20=20
> -/* Get the range of columns that HINT would affect.  */
> -
> +/* Get the range of bytes or display columns that HINT would affect.  */
>  static column_range
> -get_affected_columns (const fixit_hint *hint)
> +get_affected_range (const fixit_hint *hint, bool use_display)
>  {

enum again.


[...snip...]

> @@ -1825,7 +1938,8 @@ public:
>=20=20
>    /* The text to be inserted/used as replacement.  */
>    char *m_text;
> -  size_t m_len;
> +  size_t m_bytes;

How about:

  size_t m_byte_length; /* not including 0-terminator.  */

?

> +  int m_display_cols;
>    size_t m_alloc_sz;
>  };
>=20=20
> @@ -1850,8 +1964,8 @@ void
>  correction::ensure_terminated ()
>  {
>    /* 0-terminate the buffer.  */
> -  gcc_assert (m_len < m_alloc_sz);
> -  m_text[m_len] =3D '\0';
> +  gcc_assert (m_bytes < m_alloc_sz);
> +  m_text[m_bytes] =3D '\0';
>  }
>=20=20
>  /* A list of corrections affecting a particular line.
> @@ -1913,7 +2027,8 @@ source_line::source_line (const char *filename, int=
 line)
>  void
>  line_corrections::add_hint (const fixit_hint *hint)
>  {
> -  column_range affected_columns =3D get_affected_columns (hint);
> +  column_range affected_bytes =3D get_affected_range (hint, false);
> +  column_range affected_columns =3D get_affected_range (hint, true);

enum again.

>    column_range printed_columns =3D get_printed_columns (hint);
>=20=20
>    /* Potentially consolidate.  */

[...snip...]

> @@ -1947,7 +2062,7 @@ line_corrections::add_hint (const fixit_hint *hint)
>  	      /* Consolidate into the last correction:
>  		 add a no-op "replace" of the "between" text, and
>  		 add the text from the new hint.  */
> -	      int old_len =3D last_correction->m_len;
> +	      int old_len =3D last_correction->m_bytes;

Maybe rename to old_len, new_len to old_byte_len, new_byte_len?

>  	      gcc_assert (old_len >=3D 0);
>  	      int between_len =3D between.finish + 1 - between.start;
>  	      gcc_assert (between_len >=3D 0);
> @@ -1961,19 +2076,24 @@ line_corrections::add_hint (const fixit_hint *hin=
t)
>  	      last_correction->overwrite (old_len + between_len,
>  					  char_span (hint->get_string (),
>  						     hint->get_length ()));
> -	      last_correction->m_len =3D new_len;
> +	      last_correction->m_bytes =3D new_len;
>  	      last_correction->ensure_terminated ();
> +	      last_correction->m_affected_bytes.finish
> +		=3D affected_bytes.finish;
>  	      last_correction->m_affected_columns.finish
>  		=3D affected_columns.finish;
> +	      int prev_display_cols =3D last_correction->m_display_cols;
> +	      last_correction->compute_display_cols ();
>  	      last_correction->m_printed_columns.finish
> -		+=3D between_len + hint->get_length ();
> +		+=3D last_correction->m_display_cols - prev_display_cols;
>  	      return;
>  	    }
>  	}
>      }
>=20=20
>    /* If no consolidation happened, add a new correction instance.  */
> -  m_corrections.safe_push (new correction (affected_columns,
> +  m_corrections.safe_push (new correction (affected_bytes,
> +					   affected_columns,
>  					   printed_columns,
>  					   hint->get_string (),
>  					   hint->get_length ()));

[...snip...]

> @@ -2072,12 +2192,14 @@ layout::print_newline ()
>  /* Return true if (ROW/COLUMN) is within a range of the layout.
>     If it returns true, OUT_STATE is written to, with the
>     range index, and whether we should draw the caret at
> -   (ROW/COLUMN) (as opposed to an underline).  */
> +   (ROW/COLUMN) (as opposed to an underline).  USE_DISPLAY controls
> +   whether all inputs and outputs are in bytes or display column units. =
 */
>=20=20
>  bool
>  layout::get_state_at_point (/* Inputs.  */
>  			    linenum_type row, int column,
>  			    int first_non_ws, int last_non_ws,
> +			    bool use_display,

enum again.

>  			    /* Outputs.  */
>  			    point_state *out_state)
>  {

[...snip...]

> @@ -2846,6 +2972,560 @@ test_diagnostic_show_locus_one_liner (const line_=
table_case &case_)
>    test_one_liner_labels ();
>  }
>=20=20
> +/* Version of all one-liner tests exercising multibyte awareness.  For
> +   simplicity we stick to using two multibyte characters in the test, U+=
1F602
> +   =3D=3D "\xf0\x9f\x98\x82", which uses 4 bytes and 2 display columns, =
and U+03C0
> +   =3D=3D "\xcf\x80", which uses 2 bytes and 1 display column.  Note: al=
l of the
> +   below asserts would be easier to read if we used UTF-8 directly in the
> +   string constants, but it seems better not to demand the host compiler
> +   support this, when it isn't otherwise necessary.  Instead, whenever an
> +   extended character appears in a string, we put a line break after it =
so that
> +   all succeeding characters can appear visually at the correct display =
column.
> +
> +   All of these work on the following 1-line source file:
> +
> +     .0000000001111111111222222   display
> +     .1234567890123456789012345   columns
> +     "SS_foo =3D P_bar.SS_fieldP;\n"
> +     .0000000111111111222222223   byte
> +     .1356789012456789134567891   columns
> +
> +   which is set up by test_diagnostic_show_locus_one_liner and calls
> +   them.  Here SS represents the two display columns for the U+1F602 emo=
ji and
> +   P represents the one display column for the U+03C0 pi symbol.  */
> +

[...snip...]

Thanks for exercising all this with selftests.

Presumably this involved a big copy-and-paste from the existing
selftests.

How did you generate the expected output for the various _utf8
selftests?  Was it a lot of tedious manual editing, or is there a handy
way to do this? (I'm nervous about how much work it will be to update
these if e.g. we want to experiment with new ways of printing fix-it
hints)

> +/* Test of labeling the ranges within a rich_location.  */
> +
> +static void
> +test_one_liner_labels_utf8 ()
> +{
> +  location_t foo
> +    =3D make_location (linemap_position_for_column (line_table, 1),
> +		     linemap_position_for_column (line_table, 1),
> +		     linemap_position_for_column (line_table, 8));
> +  location_t bar
> +    =3D make_location (linemap_position_for_column (line_table, 12),
> +		     linemap_position_for_column (line_table, 12),
> +		     linemap_position_for_column (line_table, 17));
> +  location_t field
> +    =3D make_location (linemap_position_for_column (line_table, 19),
> +		     linemap_position_for_column (line_table, 19),
> +		     linemap_position_for_column (line_table, 30));
> +
> +  /* Example where all the labels fit on one line.  */
> +  {
> +    text_range_label label0
> +      ("\xcf\x80\xcf\x80\xcf\x80\xcf\x80\xcf\x80\xcf\x80");
> +    text_range_label label1
> +      ("\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xcf\x80");
> +    text_range_label label2
> +      ("\xf0\x9f\x98\x82\xcf\x80\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xcf\x80"
> +       "\xcf\x80");

Can you add a comment describing these labels.
In particular the label placement code is meant to ensure that labels
don't overlap, so I think this is adding coverage that we're computing
using display columns when placing labels, since otherwise foo's label
would need to be pushed onto a new line to avoid overlapping the other
labels.

> +    gcc_rich_location richloc (foo, &label0);
> +    richloc.add_range (bar, SHOW_RANGE_WITHOUT_CARET, &label1);
> +    richloc.add_range (field, SHOW_RANGE_WITHOUT_CARET, &label2);
> +
> +    {
> +      test_diagnostic_context dc;
> +      diagnostic_show_locus (&dc, &richloc, DK_ERROR);
> +      ASSERT_STREQ ("\n"
> +		    " \xf0\x9f\x98\x82"
> +		       "_foo =3D \xcf\x80"
> +			       "_bar.\xf0\x9f\x98\x82"
> +				      "_field\xcf\x80"
> +					     ";\n"
> +		    " ^~~~~~   ~~~~~ ~~~~~~~~~\n"
> +		    " |        |     |\n"
> +		    " \xcf\x80\xcf\x80\xcf\x80\xcf\x80\xcf\x80\xcf\x80"
> +			   "   \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xcf\x80"
> +				   " \xf0\x9f\x98\x82\xcf\x80\xf0\x9f\x98\x82"
> +					 "\xf0\x9f\x98\x82\xcf\x80\xcf\x80\n",
> +		    pp_formatted_text (dc.printer));
> +    }
> +
> +    /* Verify that we can disable label-printing.  */
> +    {
> +      test_diagnostic_context dc;
> +      dc.show_labels_p =3D false;
> +      diagnostic_show_locus (&dc, &richloc, DK_ERROR);
> +      ASSERT_STREQ ("\n"
> +		    " \xf0\x9f\x98\x82"
> +		       "_foo =3D \xcf\x80"
> +			       "_bar.\xf0\x9f\x98\x82"
> +				      "_field\xcf\x80"
> +					     ";\n"
> +		    " ^~~~~~   ~~~~~ ~~~~~~~~~\n",
> +		    pp_formatted_text (dc.printer));
> +    }

This part of the test might be redundant.

[...snip...]

> +  }
> +
> +  /* Example of boundary conditions: label 0 and 1 have just enough clea=
rance,
> +     but label 1 just touches label 2.  */
> +  {
> +    text_range_label label0 ("aaaaa\xf0\x9f\x98\x82\xcf\x80");
> +    text_range_label label1 ("bb\xf0\x9f\x98\x82\xf0\x9f\x98\x82");
> +    text_range_label label2 ("c");
> +    gcc_rich_location richloc (foo, &label0);
> +    richloc.add_range (bar, SHOW_RANGE_WITHOUT_CARET, &label1);
> +    richloc.add_range (field, SHOW_RANGE_WITHOUT_CARET, &label2);
> +
> +    test_diagnostic_context dc;
> +    diagnostic_show_locus (&dc, &richloc, DK_ERROR);
> +    ASSERT_STREQ ("\n"
> +		  " \xf0\x9f\x98\x82"
> +		     "_foo =3D \xcf\x80"
> +			     "_bar.\xf0\x9f\x98\x82"
> +				    "_field\xcf\x80"
> +					   ";\n"
> +		  " ^~~~~~   ~~~~~ ~~~~~~~~~\n"
> +		  " |        |     |\n"
> +		  " |        |     c\n"
> +		  " aaaaa\xf0\x9f\x98\x82\xcf\x80"
> +			   " bb\xf0\x9f\x98\x82\xf0\x9f\x98\x82\n",
> +		  pp_formatted_text (dc.printer));

It's hard to tell from the escaped expected string, but presumably this
matches the comment about boundary conditions, right? (compared with
the ASCII case).

[...snip...]

> @@ -3221,13 +3901,16 @@ test_overlapped_fixit_printing (const line_table_=
case &case_)
>      /* Unit-test the line_corrections machinery.  */
>      ASSERT_EQ (3, richloc.get_num_fixit_hints ());
>      const fixit_hint *hint_0 =3D richloc.get_fixit_hint (0);
> -    ASSERT_EQ (column_range (12, 12), get_affected_columns (hint_0));
> +    ASSERT_EQ (column_range (12, 12), get_affected_range (hint_0, false)=
);
> +    ASSERT_EQ (column_range (12, 12), get_affected_range (hint_0, true));
enums again

[...snip...]
=20
> @@ -3407,10 +4330,10 @@ test_overlapped_fixit_printing_2 (const line_tabl=
e_case &case_)
>      /* These fixits should be accepted; they can't be consolidated.  */
>      ASSERT_EQ (2, richloc.get_num_fixit_hints ());
>      const fixit_hint *hint_0 =3D richloc.get_fixit_hint (0);
> -    ASSERT_EQ (column_range (23, 22), get_affected_columns (hint_0));
> +    ASSERT_EQ (column_range (23, 22), get_affected_range (hint_0, false)=
);
>      ASSERT_EQ (column_range (23, 23), get_printed_columns (hint_0));
>      const fixit_hint *hint_1 =3D richloc.get_fixit_hint (1);
> -    ASSERT_EQ (column_range (21, 20), get_affected_columns (hint_1));
> +    ASSERT_EQ (column_range (21, 20), get_affected_range (hint_1, false)=
);
>      ASSERT_EQ (column_range (21, 21), get_printed_columns (hint_1));

enums again.

[...snip...]

> diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
> index 96b6fa30052..8638fbebb2d 100644
> --- a/gcc/diagnostic.c
> +++ b/gcc/diagnostic.c
> @@ -346,9 +346,13 @@ diagnostic_get_location_text (diagnostic_context *co=
ntext,
>    const char *locus_cs =3D colorize_start (pp_show_color (pp), "locus");
>    const char *locus_ce =3D colorize_stop (pp_show_color (pp));
>    const char *file =3D s.file ? s.file : progname;
> -  int line =3D strcmp (file, N_("<built-in>")) ? s.line : 0;
> -  int col =3D context->show_column ? s.column : 0;
> -
> +  int line =3D 0;
> +  int col =3D 0;
> +  if (strcmp (file, N_("<built-in>")))
> +    {
> +      line =3D s.line;
> +      col =3D context->show_column ? location_compute_display_column (s)=
 : 0;
> +    }

Why does the patch use the display column here?

Ideally it would be the count of unicode characters, but I think we want
to preserve the current behavior of using a byte offset.

>    const char *line_col =3D maybe_line_and_column (line, col);
>    return build_message_string ("%s%s%s:%s", locus_cs, file,
>  			       line_col, locus_ce);
> diff --git a/gcc/input.c b/gcc/input.c
> index 00301ef68dd..d2d99000b84 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c
> @@ -908,6 +908,18 @@ make_location (location_t caret, source_range src_ra=
nge)
>    return COMBINE_LOCATION_DATA (line_table, pure_loc, src_range, NULL);
>  }
>=20=20
> +int
> +location_compute_display_column (expanded_location exploc)
> +{
> +  if (!(exploc.file && exploc.line && exploc.column))
> +    return exploc.column;
> +  char_span line =3D location_get_source_line (exploc.file, exploc.line);
> +  /* If line is NULL, this function returns exploc.column which is the
> +     desired fallback.  */
> +  return cpp_byte_column_to_display_column (line.get_buffer (), line.len=
gth (),
> +					    exploc.column);
> +}

This new function needs a leading comment (in particular spelling out the
"file not found" error-handling).

>  /* Dump statistics to stderr about the memory usage of the line_table
>     set of line maps.  This also displays some statistics about macro
>     expansion.  */
> @@ -3590,6 +3602,51 @@ test_line_offset_overflow ()
>    ASSERT_NE (ordmap_a, ordmap_b);
>  }
>=20=20
> +void test_cpp_utf8 ()
> +{
> +  /* Verify that wcwidth of invalid UTF-8 or control bytes is 1.  */
> +  {
> +    int w_bad =3D cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8);
> +    ASSERT_EQ (8, w_bad);
> +    int w_ctrl =3D cpp_display_width ("\r\t\n\v\0\1", 6);
> +    ASSERT_EQ (6, w_ctrl);
> +  }
> +
> +  /* Verify that wcwidth of valid UTF-8 is as expected.  */
> +  {
> +    const int w_pi =3D cpp_display_width ("\xcf\x80", 2);
> +    ASSERT_EQ (1, w_pi);
> +    const int w_emoji =3D cpp_display_width ("\xf0\x9f\x98\x82", 4);
> +    ASSERT_EQ (2, w_emoji);
> +    const int w_ascii =3D cpp_display_width ("GCC", 3);
> +    ASSERT_EQ (3, w_ascii);
> +    const int w_mixed
> +      =3D cpp_display_width ("\xcf\x80 =3D 3.14 \xf0\x9f\x98\x82 \x9f!",=
 17);
> +    ASSERT_EQ (14, w_mixed);
> +  }
> +
> +  /* Verify that cpp_byte_column_to_display_column can go past the end,
> +     and similar edge cases.  */
> +  {
> +    const char *str =3D "\xcf\x80 abc";
> +    ASSERT_EQ (5, cpp_display_width (str, 6));
> +    ASSERT_EQ (105, cpp_byte_column_to_display_column (str, 6, 106));
> +    ASSERT_EQ (10000, cpp_byte_column_to_display_column (NULL, 0, 10000)=
);
> +    ASSERT_EQ (0, cpp_byte_column_to_display_column (NULL, 10000, 0));
> +  }
> +
> +  /* Verify that cpp_display_column_to_byte_column can go past the end,
> +   and similar edge cases.  */
> +  {
> +    const char *str =3D "\xf0\x9f\x98\x82 \xf0\x9f\x98\x82 hello";
> +    ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2));
> +    ASSERT_EQ (15, cpp_display_column_to_byte_column (str, 15, 11));
> +    ASSERT_EQ (115, cpp_display_column_to_byte_column (str, 15, 111));
> +    ASSERT_EQ (10000, cpp_display_column_to_byte_column (NULL, 0, 10000)=
);
> +    ASSERT_EQ (0, cpp_display_column_to_byte_column (NULL, 10000, 0));

I think the two "str" consts could use comments showing the numbering
within them, like.

Maybe a brute force test for inverse, something like:

for (int display_column =3D 0; display_column < 20; display_column++)
  {
    int byte_column =3D cpp_display_column_to_byte_column (str, 15, display=
_column);
    ASSERT_EQ (cpp_byte_column_to_display_column (str, 15, byte_column),
               display_column);
  }

or similar?

What happens if you request a display column that's in the middle of a
character?  It feels like we ought to have selftest coverage for that.

[...snip...]

> diff --git a/libcpp/charset.c b/libcpp/charset.c
> index 39af77a554a..d1bdff095eb 100644
> --- a/libcpp/charset.c
> +++ b/libcpp/charset.c

[...snip...]

> +/* Our own version of wcwidth().  We don't use the actual wcwidth() in g=
libc,
> +   because that will inspect the user's locale, and in particular in an =
ASCII
> +   locale, it will not return anything useful for extended characters.  =
But GCC
> +   in other respects (see e.g. _cpp_default_encoding()) behaves as if
> +   everything is UTF-8.  We also make some tweaks that are useful for th=
e way
> +   GCC needs to use this data, e.g. tabs and other control characters sh=
ould be
> +   treated as having width 1.  The lookup tables are generated from
> +   contrib/unicode/gen_wcwidth.py and were made by simply calling glibc
> +   wcwidth() on all codepoints, then applying the small tweaks.  These t=
ables
> +   are not highly optimized, but for the present purpose of outputting
> +   diagnostics, they are sufficient.  */
> +
> +#include "generated_cpp_wcwidth.h"
> +int cpp_wcwidth (cppchar_t c)
> +{
> +  if (__builtin_expect (c <=3D wcwidth_range_ends[0], true))
> +    return wcwidth_widths[0];
> +
> +  /* Binary search the tables.  */
> +  int begin =3D 1;
> +  static const int end
> +      =3D sizeof wcwidth_range_ends / sizeof (*wcwidth_range_ends);
> +  int len =3D end - begin;
> +  do
> +    {
> +      int half =3D len/2;
> +      int middle =3D begin + half;
> +      if (c > wcwidth_range_ends[middle])
> +	{
> +	  begin =3D middle + 1;
> +	  len -=3D half + 1;
> +	}
> +      else
> +	len =3D half;
> +    } while (len);
> +
> +  if (__builtin_expect (begin !=3D end, true))
> +    return wcwidth_widths[begin];
> +  return 1;
> +}

Please can you add some unit-testing for this function in selftest form to =
input.c
(i.e. testing a few specific code points).

[...snip...]

Again, thanks for this patch, and sorry again for the delay in reviewing it.

Dave