[PATCH] RFC: On-demand locations within string-literals

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] RFC: On-demand locations within string-literals
@ 2016-07-08 21:22 David Malcolm
  2016-07-20 19:38 ` David Malcolm
  2016-07-23 21:36 ` [PATCH] RFC: " Martin Sebor
  0 siblings, 2 replies; 61+ messages in thread
From: David Malcolm @ 2016-07-08 21:22 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Malcolm

This patch implements precise tracking of source locations for the
individual chars within string literals, so that we can e.g. underline
specific ranges in -Wformat diagnostics.

It should also enable fixing PR inline-asm/57950 ("wrong line numbers
in error messages for inline assembler statements").

I posted a much earlier version of this here:
  "[PATCH 17/22] libcpp: add location tracking within string literals"
    https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00744.html
and:
  "[PATCH 18/22] Track locations within string literals in tree_string"
    https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00743.html
In that old approach, I attempted to capture the location data during
parsing, storing it within a new cpp_string_location class, accessed
by a new TREE_STRING_LOCATION field of STRING_CST.

Doing so would add a pointer to every string literal, and mean storing the
data somewhere (unless we only store it for the "interesting" cases
in a hash somewhere).

Manu implemented an alternative "on-demand" approach in r223470:
in c-format.c which locates the relevant line in the source file and
effectively re-lexes the literal, thus avoiding having to store anything.
That implementation has a simplified lexer that doesn't support
all possible literals ("location_column_from_byte_offset" in c-format.c):

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=d5a2ddc76a109258297ff345957c35cb50116c94#patch2

In particular, it doesn't support concatenation or macros (amongst other
things).

In the following patch, I've taken the on-demand idea, and reimplemented
it within libcpp's string literal lexer, where the generation of
source-location information is an optional extra aspect of
cpp_interpret_string.
It's disabled during the regular lexer, but it's available through an
interface in input.{c|h} which can rerun the libcpp code and capture
the per-char source_ranges for when we need to issue a diagnostic.

This has the advantage that we share code with the libcpp string
literal lexer, rather than trying to duplicate it, and thus it can handle
everything the "real" lexer can (as it *is* the real lexer).

To handle concatentation the patch adds some extra data storage:
every time a string concatenation happens in c-lex.c: it stores
the locations of the component tokens in a hash_map, keyed by
the spelling location of the start first token
(see class string_concat_db in input.h).

Hence it's only storing extra data for string concatenations,
not for simple string literals.

This approach also handles macros.

I have followup patches in-progess (to c-format.c) that make it use
the new location information to underline bad format strings, and
provide fix-its hints for the format code that should have been
used, for PR c/64955 ("RFE: have -Wformat suggest the correct format
string to use").

Unfortunately this doesn't yet work with the C++ frontend;
the EXPR_LOCATION for the ADDR_EXPR wrapping the literals is
currently UNKNOWN_LOCATION, and this also gets overwritten
by the CALL_EXPR's location due to this in gimplify.c:

2397	  /* FIXME diagnostics: This will mess up gcc.dg/Warray-bounds.c.  */
2398	  /* Make sure arguments have the same location as the function call
2399	     itself.  */
2400	  protected_set_expr_location (*arg_p, call_location);

from 489c40889c8be89bd5bed4b166974f8c1e01e4ee (aka r140917):

+2008-10-06  Aldy Hernandez  <aldyh@redhat.com>
+
+       * gimplify.c (gimplify_arg): Add location argument.  Use it.
+       (gimplify_call_expr): Pass location to gimplify_arg.
+       (gimplify_modify_expr_to_memcpy): Same.
+       (gimplify_modify_expr_to_memset): Same.

which seems to be due to debug information:
  https://gcc.gnu.org/ml/gcc-patches/2008-10/msg00191.html

So this isn't quite ready yet.

Also, this patch currently makes the assumption (in charset.c)
that there's a 1:1 correspondence between bytes in the source
character set and bytes in the execution character set.  This can
be the case if both are, say, UTF-8, but might not hold in
general.

The source char set is UTF-8 or UTF-EBCDIC, and safe-ctype.c has:

# if HOST_CHARSET == HOST_CHARSET_EBCDIC
  #error "FIXME: write tables for EBCDIC"

so presumably we don't actually have any hosts that supports EBCDIC
(do we?); as far as I can tell, we only currently support UTF-8
as the source char set.

Similarly, do we support any targets for which the execution
character set is *not* UTF-8?

Other notes:

- this patch is on top of
  "[PATCH] input.c: add lexing selftests and a test matrix for line_table states"
    https://gcc.gnu.org/ml/gcc-patches/2016-06/msg01340.html
and uses the test matrix idea there to exercise the lexing
under lots of interesting situations.

- string_concat_db has a bit more indirection that I'd like,
but this was necessary in order to get gengtype to work.

- the older approach (storing locations during initial lexing),
had a reasonably compact representation, storing runs of equal
columns-per-char, but it was bit-rotted by the range-packing
optimization of r230331.  FWIW I updated it, and there's a
working version of that idea at:
https://dmalcolm.fedorapeople.org/gcc/2016-07-01/rich-errors-gcc7-v15/0006-FIXME-Location-tracking-within-string-literals.patch

Successfully bootstrapped&regrtested on x86_64-pc-linux-gnu.

Thoughts?

gcc/c-family/ChangeLog:
	* c-common.c (g_string_concat_db): New global.
	* c-common.h (g_string_concat_db): New declaration.
	* c-lex.c (lex_string): When concatenating strings, capture the
	locations of all tokens using a new obstack, and record the
	concatenation locations within g_string_concat_db.
	* c-opts.c (c_common_init_options): Construct g_string_concat_db
	on the ggc-heap.

gcc/ChangeLog:
	* input.c (string_concat::string_concat): New constructor.
	(string_concat_db::string_concat_db): New constructor.
	(string_concat_db::record_string_concatenation): New method.
	(string_concat_db::get_string_concatenation): New method.
	(string_concat_db::get_key_loc): New method.
	(class auto_cpp_string_vec): New class.
	(get_substring_ranges_for_loc): New function.
	(get_source_range_for_substring): New function.
	(get_num_source_ranges_for_substring): New function.
	(test_builtins): Likewise.
	(struct selftest::lexer_test): New struct.
	(selftest::lexer_test::lexer_test): New constructor.
	(selftest::lexer_test::lexer_test): New constructor.
	(selftest::lexer_test::~lexer_test): New destructor.
	(selftest::lexer_test::get_token): New method.
	(selftest::assert_char_at_range): New function.
	(ASSERT_CHAR_AT_RANGE): New macro.
	(selftest::assert_num_substring_ranges): New function.
	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
	(selftest::test_lexer_string_locations_simple): New function.
	(selftest::test_lexer_string_locations_hex): New function.
	(selftest::test_lexer_string_locations_oct): New function.
	(selftest::test_lexer_string_locations_ucn4): New function.
	(selftest::test_lexer_string_locations_ucn8): New function.
	(selftest::test_lexer_string_locations_u8): New function.
	(selftest::test_lexer_string_locations_utf8_source): New function.
	(selftest::test_lexer_string_locations_concatenation_1): New
	function.
	(selftest::test_lexer_string_locations_concatenation_2): New
	function.
	(selftest::test_lexer_string_locations_concatenation_3): New
	function.
	(selftest::test_lexer_string_locations_macro): New function.
	(selftest::input_c_tests): Call the new test functions once per
	case within the line_table test matrix.
	* input.h (struct string_concat): New struct.
	(struct location_hash): New struct.
	(class string_concat_db): New class.
	(get_source_range_for_substring): New prototype.
	* selftest.h (ASSERT_TRUE): Reimplement in terms of...
	(ASSERT_TRUE_AT): New macro.
	(ASSERT_FALSE): Reimplement in terms of...
	(ASSERT_FALSE_AT): New macro.
	(ASSERT_STREQ_AT): Fix typo in comment.

gcc/testsuite/ChangeLog:
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New file.
	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add
	diagnostic_plugin_test_string_literals.c and
	diagnostic-test-string-literals-1.c.

libcpp/ChangeLog:
	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
	constructor.
	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
	(cpp_substring_ranges::add_range): New method.
	(cpp_substring_ranges::add_n_ranges): New method.
	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
	they are non-NULL, read position information from *loc_reader
	and update char_range->m_finish accordingly.
	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
	params.  If loc_reader is non-NULL, read location information from
	it, and update *ranges accordingly, using char_range.
	Conditionalize the conversion into tbuf on tbuf being non-NULL.
	(convert_hex): Likewise, conditionalizing the call to
	emit_numeric_escape on tbuf.
	(convert_oct): Likewise.
	(convert_escape): Add params "loc_reader" and "ranges".  If
	loc_reader is non-NULL, read location information from it, and
	update *ranges accordingly.  Conditionalize the conversion into
	tbuf on tbuf being non-NULL.
	(cpp_interpret_string): Rename to...
	(cpp_interpret_string_1): ...this, adding params "loc_readers" and
	"out".  Use "to" to conditionalize the initialization and usage of
	"tbuf", such as running the converter.  If "loc_readers" is
	non-NULL, use the instances within it, reading location
	information from them, and passing them to convert_escape; likewise
	write to "out" if loc_readers is non-NULL.
	(cpp_interpret_string): Reimplement in terms to
	cpp_interpret_string_1.
	(cpp_interpret_string_ranges): New function.
	(cpp_string_location_reader::cpp_string_location_reader): New
	constructor.
	(cpp_string_location_reader::get_next): New method.
	* include/cpplib.h (class cpp_string_location_reader): New class.
	(class cpp_substring_ranges): New class.
	(cpp_interpret_string_ranges): New prototype.
	* internal.h (_cpp_valid_ucn): Add params "char_range" and
	"loc_reader".
	* lex.c (forms_identifier_p): Pass NULL for new params to
	_cpp_valid_ucn.
---
 gcc/c-family/c-common.c                            |   5 +
 gcc/c-family/c-common.h                            |   2 +
 gcc/c-family/c-lex.c                               |  24 +-
 gcc/c-family/c-opts.c                              |   3 +
 gcc/input.c                                        | 977 +++++++++++++++++++++
 gcc/input.h                                        |  35 +
 gcc/selftest.h                                     |  30 +-
 .../plugin/diagnostic-test-string-literals-1.c     | 172 ++++
 .../diagnostic_plugin_test_string_literals.c       | 210 +++++
 gcc/testsuite/gcc.dg/plugin/plugin.exp             |   2 +
 libcpp/charset.c                                   | 355 ++++++--
 libcpp/include/cpplib.h                            |  50 ++
 libcpp/internal.h                                  |   4 +-
 libcpp/lex.c                                       |   2 +-
 14 files changed, 1807 insertions(+), 64 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 936ddfb..90fdc79 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -12901,4 +12901,9 @@ diagnose_mismatched_attributes (tree olddecl, tree newdecl)
   return warned;
 }
 
+/* The global record of string concatentations, for use in
+   extracting locations within string literals.  */
+
+GTY(()) string_concat_db *g_string_concat_db;
+
 #include "gt-c-family-c-common.h"
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 3ad5400..7e2e432 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1116,6 +1116,8 @@ extern tree c_build_bind_expr (location_t, tree, tree);
 extern enum cpp_ttype
 conflict_marker_get_final_tok_kind (enum cpp_ttype tok1_kind);
 
+extern GTY(()) string_concat_db *g_string_concat_db;
+
 /* In c-pch.c  */
 extern void pch_init (void);
 extern void pch_cpp_save_state (void);
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
index 8f33d86..4c7e385 100644
--- a/gcc/c-family/c-lex.c
+++ b/gcc/c-family/c-lex.c
@@ -1097,13 +1097,16 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   tree value;
   size_t concats = 0;
   struct obstack str_ob;
+  struct obstack loc_ob;
   cpp_string istr;
   enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
+  location_t init_loc = tok->src_loc;
   cpp_string *strs = &str;
+  location_t *locs = NULL;
 
   /* objc_at_sign_was_seen is only used when doing Objective-C string
      concatenation.  It is 'true' if we have seen an '@' before the
@@ -1142,16 +1145,21 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 	  else
 	    error ("unsupported non-standard concatenation of string literals");
 	}
+      /* FALLTHROUGH */
 
     case CPP_STRING:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
+	  gcc_obstack_init (&loc_ob);
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
+	  obstack_grow (&loc_ob, &init_loc, sizeof (location_t));
 	}
 
       concats++;
       obstack_grow (&str_ob, &tok->val.str, sizeof (cpp_string));
+      obstack_grow (&loc_ob, &tok->src_loc, sizeof (location_t));
+
       if (objc_string)
 	objc_at_sign_was_seen = false;
       goto retry;
@@ -1164,7 +1172,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   /* We have read one more token than we want.  */
   _cpp_backup_tokens (parse_in, 1);
   if (concats)
-    strs = XOBFINISH (&str_ob, cpp_string *);
+    {
+      strs = XOBFINISH (&str_ob, cpp_string *);
+      locs = XOBFINISH (&loc_ob, location_t *);
+    }
 
   if (concats && !objc_string && !in_system_header_at (input_location))
     warning (OPT_Wtraditional,
@@ -1176,6 +1187,12 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
+      if (concats)
+	{
+	  gcc_assert (locs);
+	  gcc_assert (g_string_concat_db);
+	  g_string_concat_db->record_string_concatenation (concats + 1, locs);
+	}
     }
   else
     {
@@ -1227,7 +1244,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   *valp = fix_string_type (value);
 
   if (concats)
-    obstack_free (&str_ob, 0);
+    {
+      obstack_free (&str_ob, 0);
+      obstack_free (&loc_ob, 0);
+    }
 
   return objc_string ? CPP_OBJC_STRING : type;
 }
diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index ff6339c..16f525b 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -216,6 +216,9 @@ c_common_init_options (unsigned int decoded_options_count,
   unsigned int i;
   struct cpp_callbacks *cb;
 
+  g_string_concat_db
+    = new (ggc_alloc <string_concat_db> ()) string_concat_db ();
+
   parse_in = cpp_create_reader (c_dialect_cxx () ? CLK_GNUCXX: CLK_GNUC89,
 				ident_hash, line_table);
   cb = cpp_get_callbacks (parse_in);
diff --git a/gcc/input.c b/gcc/input.c
index a916597..316b1b5 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1139,6 +1139,249 @@ dump_location_info (FILE *stream)
 				MAX_SOURCE_LOCATION + 1, UINT_MAX);
 }
 
+/* string_concat's constructor.  */
+
+string_concat::string_concat (int num, location_t *locs)
+  : m_num (num)
+{
+  m_locs = ggc_vec_alloc <location_t> (num);
+  for (int i = 0; i < num; i++)
+    m_locs[i] = locs[i];
+}
+
+/* string_concat_db's constructor.  */
+
+string_concat_db::string_concat_db ()
+{
+  m_table = hash_map <location_hash, string_concat *>::create_ggc (64);
+}
+
+/* Record that a string concatenation occurred, covering NUM
+   string literal tokens.  LOCS is an array of size NUM, containing the
+   locations of the tokens.  A copy of LOCS is taken.  */
+
+void
+string_concat_db::record_string_concatenation (int num, location_t *locs)
+{
+  gcc_assert (num > 1);
+  gcc_assert (locs);
+
+  location_t key_loc = get_key_loc (locs[0]);
+
+  string_concat *concat
+    = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
+  m_table->put (key_loc, concat);
+}
+
+/* Determine if LOC was the location of the the initial token of a
+   concatenation of string literal tokens.
+   If so, *OUT_NUM is written to with the number of tokens, and
+   *OUT_LOCS with the location of an array of locations of the
+   tokens, and return true.  *OUT_LOCS is a borrowed pointer to
+   storage owned by the string_concat_db.
+   Otherwise, return false.  */
+
+bool
+string_concat_db::get_string_concatenation (location_t loc,
+					    int *out_num,
+					    location_t **out_locs)
+{
+  gcc_assert (out_num);
+  gcc_assert (out_locs);
+
+  location_t key_loc = get_key_loc (loc);
+
+  string_concat **concat = m_table->get (key_loc);
+  if (!concat)
+    return false;
+
+  *out_num = (*concat)->m_num;
+  *out_locs =(*concat)->m_locs;
+  return true;
+}
+
+/* Internal function.  Canonicalize LOC into a form suitable for
+   use as a key within the database, stripping away macro expansion,
+   ad-hoc information, and range information, using the location of
+   the start of LOC within an ordinary linemap.  */
+
+location_t
+string_concat_db::get_key_loc (location_t loc)
+{
+  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
+				  NULL);
+
+  loc = get_range_from_loc (line_table, loc).m_start;
+
+  return loc;
+}
+
+/* Helper class for use within get_substring_ranges_for_loc.
+   An vec of cpp_string with responsibility for releasing all of the
+   str->text for each str in the vector.  */
+
+class auto_cpp_string_vec :  public auto_vec <cpp_string>
+{
+ public:
+  auto_cpp_string_vec (int alloc)
+    : auto_vec <cpp_string> (alloc) {}
+
+  ~auto_cpp_string_vec ()
+  {
+    /* Clean up the copies within this vec.  */
+    int i;
+    cpp_string *str;
+    FOR_EACH_VEC_ELT (*this, i, str)
+      free (const_cast <unsigned char *> (str->text));
+  }
+};
+
+/* Attempt to populate RANGES with source location information on the
+   individual characters within the string literal found at STRLOC.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC  was concatenated with are also added to RANGES.
+
+   Return true if successful, or false if any errors occurred (in
+   which case RANGES may be only partially populated and should not
+   be used).
+
+   This is implemented by re-parsing the relevant source line(s).  */
+
+static bool
+get_substring_ranges_for_loc (cpp_reader *pfile,
+			      string_concat_db *concats,
+			      location_t strloc,
+			      cpp_substring_ranges &ranges)
+{
+  gcc_assert (pfile);
+
+  if (strloc == UNKNOWN_LOCATION)
+    return false;
+
+  /* If string concatenation has occurred at STRLOC, get the locations
+     of all of the literal tokens making up the compound string.
+     Otherwise, just use STRLOC.  */
+  int num_locs = 1;
+  location_t *strlocs = &strloc;
+  if (concats)
+    concats->get_string_concatenation (strloc, &num_locs, &strlocs);
+
+  auto_cpp_string_vec strs (num_locs);
+  auto_vec <cpp_string_location_reader> loc_readers (num_locs);
+  for (int i = 0; i < num_locs; i++)
+    {
+      /* Get range of strloc.  We will use it to locate the start and finish
+	 of the literal token within the line.  */
+      source_range src_range = get_range_from_loc (line_table, strlocs[i]);
+
+      if (src_range.m_start >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token started within
+	   its line.  */
+	return false;
+
+      if (src_range.m_finish >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token finished within
+	   its line.  */
+	return false;
+
+      expanded_location start
+	= expand_location_to_spelling_point (src_range.m_start);
+      expanded_location finish
+	= expand_location_to_spelling_point (src_range.m_finish);
+      if (start.file != finish.file)
+	return false;
+      if (start.line != finish.line)
+	return false;
+      if (start.column > finish.column)
+	return false;
+
+      int line_width;
+      const char *line = location_get_source_line (start.file, start.line,
+						   &line_width);
+      if (line == NULL)
+	return false;
+
+      /* Determine the location of the literal (including quotes
+	 and leading prefix chars, such as the 'u' in a u""
+	 token).  */
+      const char *literal = line + start.column - 1;
+      int literal_length = finish.column - start.column + 1;
+
+      gcc_assert (line_width >= (start.column - 1 + literal_length));
+      cpp_string from;
+      from.len = literal_length;
+      /* Make a copy of the literal, to avoid having to rely on
+	 the lifetime of the copy of the line within the cache.
+	 This will be released by the auto_cpp_string_vec dtor.  */
+      from.text = XDUPVEC (unsigned char, literal, literal_length);
+      strs.safe_push (from);
+      cpp_string_location_reader loc_reader (strlocs[i], line_table);
+      loc_readers.safe_push (loc_reader);
+    }
+
+  /* Rerun cpp_interpret_string, or rather, a modified version of it.  */
+  if (!cpp_interpret_string_ranges (pfile, strs.address (),
+				    loc_readers.address (),
+				    num_locs, &ranges))
+    return false;
+
+  /* Success: "ranges" should now contain information on the string.  */
+  return true;
+}
+
+/* Attempt to populate *OUT with source location information on the
+   range of given characters within the string literal found at STRLOC.
+   START_IDX and END_IDX refer to offsets within the execution character
+   set.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC was concatenated with are also considered.
+
+   Return true if successful, or false if any errors occurred.
+
+   This is implemented by re-parsing the relevant source line(s).  */
+
+bool
+get_source_range_for_substring (cpp_reader *pfile,
+				string_concat_db *concats,
+				location_t strloc,
+				int start_idx, int end_idx, source_range *out)
+{
+  gcc_checking_assert (start_idx >= 0);
+  gcc_checking_assert (end_idx >= 0);
+  gcc_assert (out);
+
+  cpp_substring_ranges ranges;
+  if (!get_substring_ranges_for_loc (pfile, concats, strloc, ranges))
+    return false;
+
+  if (start_idx >= ranges.get_num_ranges ()
+      || end_idx >= ranges.get_num_ranges ())
+      return false;
+
+  out->m_start = ranges.get_range (start_idx).m_start;
+  out->m_finish = ranges.get_range (end_idx).m_finish;
+  return true;
+}
+
+/* As get_source_range_for_substring, but write to *OUT the number
+   of ranges that are available.  */
+
+bool
+get_num_source_ranges_for_substring (cpp_reader *pfile,
+				     string_concat_db *concats,
+				     location_t strloc,
+				     int *out)
+{
+  gcc_assert (out);
+
+  cpp_substring_ranges ranges;
+  if (!get_substring_ranges_for_loc (pfile, concats, strloc, ranges))
+    return false;
+
+  *out = ranges.get_num_ranges ();
+  return true;
+}
+
 #if CHECKING_P
 
 namespace selftest {
@@ -1481,6 +1724,729 @@ test_lexer (const line_table_case &case_)
   cpp_destroy (parser);
 }
 
+/* A struct for writing lexer tests.  */
+
+struct lexer_test
+{
+  lexer_test (const line_table_case &case_, const char *content);
+  ~lexer_test ();
+
+  const cpp_token *get_token ();
+
+  temp_source_file m_tempfile;
+  temp_line_table m_tmp_lt;
+  cpp_reader *m_parser;
+  string_concat_db m_concats;
+};
+
+/* Constructor.  Override line_table with a new instance based on CASE_,
+   and write CONTENT to a tempfile.  Create a cpp_reader, and use it to
+   start parsing the tempfile.  */
+
+lexer_test::lexer_test (const line_table_case &case_, const char *content) :
+  /* Create a tempfile and write the text to it.  */
+  m_tempfile (SELFTEST_LOCATION, ".c", content),
+  m_tmp_lt (case_),
+  m_parser (cpp_create_reader (CLK_GNUC99, NULL, line_table)),
+  m_concats ()
+{
+  cpp_init_iconv (m_parser);
+
+  /* Parse the file.  */
+  const char *fname = cpp_read_main_file (m_parser,
+					  m_tempfile.get_filename ());
+  ASSERT_NE (fname, NULL);
+}
+
+/* Destructor.  Verify that the next token in m_parser is EOF.  */
+
+lexer_test::~lexer_test ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  ASSERT_EQ (tok->type, CPP_EOF);
+
+  cpp_finish (m_parser, NULL);
+  cpp_destroy (m_parser);
+}
+
+/* Get the next token from m_parser.  */
+
+const cpp_token *
+lexer_test::get_token ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  return tok;
+}
+
+/* Verify that locations within string literals are correctly handled.  */
+
+/* Verify get_source_range_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the character at index IDX is on EXPECTED_LINE,
+   and that it begins at column EXPECTED_START_COL and ends at
+   EXPECTED_FINISH_COL (unless the locations are beyond
+   LINE_MAP_MAX_LOCATION_WITH_COLS, in which case don't check their
+   columns).  */
+
+static void
+assert_char_at_range (const location &loc,
+		      lexer_test& test,
+		      location_t strloc, int idx, int expected_line,
+		      int expected_start_col, int expected_finish_col)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  source_range actual_range;
+  bool result = get_source_range_for_substring (pfile, concats, strloc,
+						idx, idx, &actual_range);
+  if (should_have_column_data_p (strloc))
+    ASSERT_TRUE_AT (loc, result);
+  else
+    {
+      ASSERT_FALSE_AT (loc, result);
+      return;
+    }
+
+  int actual_start_line = LOCATION_LINE (actual_range.m_start);
+  ASSERT_EQ_AT (loc, expected_line, actual_start_line);
+  int actual_finish_line = LOCATION_LINE (actual_range.m_finish);
+  ASSERT_EQ_AT (loc, expected_line, actual_finish_line);
+
+  if (should_have_column_data_p (actual_range.m_start))
+    {
+      int actual_start_col = LOCATION_COLUMN (actual_range.m_start);
+      ASSERT_EQ_AT (loc, expected_start_col, actual_start_col);
+    }
+  if (should_have_column_data_p (actual_range.m_finish))
+    {
+      int actual_finish_col = LOCATION_COLUMN (actual_range.m_finish);
+      ASSERT_EQ_AT (loc, expected_finish_col, actual_finish_col);
+    }
+}
+
+/* Macro for calling assert_char_at_range, supplying SELFTEST_LOCATION for
+   the effective location of any errors.  */
+
+#define ASSERT_CHAR_AT_RANGE(LEXER_TEST, STRLOC, IDX, EXPECTED_LINE, \
+			     EXPECTED_START_COL,			\
+			     EXPECTED_FINISH_COL)			\
+  assert_char_at_range (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), (IDX), \
+			(EXPECTED_LINE), \
+			(EXPECTED_START_COL), (EXPECTED_FINISH_COL))
+
+/* Verify get_num_source_ranges_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the token(s) at STRLOC contain EXPECTED_NUM_RANGES.  */
+
+static void
+assert_num_substring_ranges (const location &loc,
+			     lexer_test& test,
+			     location_t strloc,
+			     int expected_num_ranges)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  int actual_num_ranges;
+  bool result
+    = get_num_source_ranges_for_substring (pfile, concats, strloc,
+					   &actual_num_ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_TRUE (result);
+  else
+    {
+      ASSERT_FALSE (result);
+      return;
+    }
+  ASSERT_EQ_AT (loc, expected_num_ranges, actual_num_ranges);
+}
+
+/* Macro for calling assert_num_substring_ranges, supplying
+   SELFTEST_LOCATION for the effective location of any errors.  */
+
+#define ASSERT_NUM_SUBSTRING_RANGES(LEXER_TEST, STRLOC, EXPECTED_NUM_RANGES) \
+    assert_num_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), \
+				 (EXPECTED_NUM_RANGES))
+
+/* Lex a simple string literal.  Verify the substring location data, before
+   and after running cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_simple (const line_table_case &case_)
+{
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1,
+			  10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 10);
+}
+
+/* Lex a string literal containing a hex-escaped character.
+   Verify the substring location data, before and after running
+   cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_hex (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.
+     ....................000000000.111111.11112222.
+     ....................123456789.012345.67890123.  */
+  const char *content = "        \"01234\\x35 789\"\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\x35 789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 23);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+  ASSERT_EQ (tok->val.str.len, 15);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 10);
+}
+
+/* Lex a string literal containing an octal-escaped character.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_oct (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.
+     ....................000000000.111111.11112222.2222223333333333444
+     ....................123456789.012345.67890123.4567890123456789012  */
+  const char *content = "        \"01234\\065 789\" /* not a string */\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\065 789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 10);
+}
+
+/* Lex a string literal containing UCN 4 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn4 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     ....................000000000.111111.111122.222222223.33333333344444
+     ....................123456789.012345.678901.234567890.12345678901234  */
+  const char *content = "        \"01234\\u2174\\u2175789\" /* non-str */\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\u2174\\u2175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The string should be encoded in the execution character
+     set.  Assuming that that is UTF-8, we should have the following:
+     -----------  ----  -----  -------  ----------------
+     Byte offset  Byte  Octal  Unicode  Source Column(s)
+     -----------  ----  -----  -------  ----------------
+     0            0x30         '0'      10
+     1            0x31         '1'      11
+     2            0x32         '2'      12
+     3            0x33         '3'      13
+     4            0x34         '4'      14
+     5            0xE2  \342   U+2174   15-20
+     6            0x85  \205    (cont)  15-20
+     7            0xB4  \264    (cont)  15-20
+     8            0xE2  \342   U+2175   21-26
+     9            0x85  \205    (cont)  21-26
+     10           0xB5  \265    (cont)  21-26
+     11           0x37         '7'      27
+     12           0x38         '8'      28
+     13           0x39         '9'      29
+     -----------  ----  -----  -------  ---------------.  */
+
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i < 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i < 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 15, 20);
+  /* U+2175.  */
+  for (int i = 8; i < 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 21, 26);
+  /* '789'.  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 16 + i, 16 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 14);
+}
+
+/* Lex a string literal containing UCN 8 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn8 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     ....................000000000.111111.1111222222.2222333333333.344444
+     ....................123456789.012345.6789012345.6789012345678.901234  */
+  const char *content = "        \"01234\\U00002174\\U00002175789\" /* */\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok,
+			   "\"01234\\U00002174\\U00002175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The UTF-8 encoding of the string is identical to that from
+     the ucn4 testcase above; the only difference is the column
+     locations.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i < 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i < 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 15, 24);
+  /* U+2175.  */
+  for (int i = 8; i < 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 25, 34);
+  /* '789' at columns 35-37  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 24 + i, 24 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 14);
+}
+
+/* Lex a u8-string literal.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_u8 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "      u8\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_UTF8STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u8\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+}
+
+/* Lex a string literal containing UTF-8 source characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_utf8_source (const line_table_case &case_)
+{
+ /* This string literal is written out to the source file as UTF-8,
+    and is of the form "before mojibake after", where "mojibake"
+    is written as the following four unicode code points:
+       U+6587 CJK UNIFIED IDEOGRAPH-6587
+       U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+       U+5316 CJK UNIFIED IDEOGRAPH-5316
+       U+3051 HIRAGANA LETTER KE.
+     Each of these is 3 bytes wide when encoded in UTF-8, whereas the
+     "before" and "after" are 1 byte per unicode character.
+
+     The numbering shown are "columns", which are *byte* numbers within
+     the line, rather than unicode character numbers.
+
+     .................... 000000000.1111111.
+     .................... 123456789.0123456.  */
+  const char *content = ("        \"before "
+			 /* U+6587 CJK UNIFIED IDEOGRAPH-6587
+			      UTF-8: 0xE6 0x96 0x87
+			      C octal escaped UTF-8: \346\226\207
+			    "column" numbers: 17-19.  */
+			 "\346\226\207"
+
+			 /* U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+			      UTF-8: 0xE5 0xAD 0x97
+			      C octal escaped UTF-8: \345\255\227
+			    "column" numbers: 20-22.  */
+			 "\345\255\227"
+
+			 /* U+5316 CJK UNIFIED IDEOGRAPH-5316
+			      UTF-8: 0xE5 0x8C 0x96
+			      C octal escaped UTF-8: \345\214\226
+			    "column" numbers: 23-25.  */
+			 "\345\214\226"
+
+			 /* U+3051 HIRAGANA LETTER KE
+			      UTF-8: 0xE3 0x81 0x91
+			      C octal escaped UTF-8: \343\201\221
+			    "column" numbers: 26-28.  */
+			 "\343\201\221"
+
+			 /* column numbers 29 onwards
+			  2333333.33334444444444
+			  9012345.67890123456789. */
+			 " after\" /* non-str */\n");
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"before \346\226\207\345\255\227\345\214\226\343\201\221 after\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ
+    ("before \346\226\207\345\255\227\345\214\226\343\201\221 after",
+     (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     Assuming that both source and execution encodings are UTF-8, we have
+     a run of 25 octets in each.  */
+  for (int i = 0; i < 25; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 25);
+}
+
+/* Test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_1 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111111.11112222222222
+     .....................123456789.012345.67890123456789.  */
+  const char *content = ("        \"01234\" /* non-str */\n"
+			 "        \"56789\" /* non-str */\n");
+  lexer_test test (case_, content);
+
+  location_t input_locs[2];
+
+  /* Verify that we get the expected tokens back.  */
+  auto_vec <cpp_string> input_strings;
+  const cpp_token *tok_a = test.get_token ();
+  ASSERT_EQ (tok_a->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok_a,
+     "\"01234\"");
+  input_strings.safe_push (tok_a->val.str);
+  input_locs[0] = tok_a->src_loc;
+
+  const cpp_token *tok_b = test.get_token ();
+  ASSERT_EQ (tok_b->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok_b,
+     "\"56789\"");
+  input_strings.safe_push (tok_b->val.str);
+  input_locs[1] = tok_b->src_loc;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 2,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (2, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, i, 1, 10 + i, 10 + i);
+  for (int i = 5; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, i, 2, 5 + i, 5 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, 10);
+}
+
+/* Another test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_2 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111.11111112222222
+     .....................123456789.012.34567890123456.  */
+  const char *content = ("        \"01\" /* non-str */\n"
+			 "        \"23\" /* non-str */\n"
+			 "        \"45\" /* non-str */\n"
+			 "        \"67\" /* non-str */\n"
+			 "        \"89\" /* non-str */\n");
+  lexer_test test (case_, content);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[5];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 5; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 5,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (5, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  /* Within ASSERT_CHAR_AT_RANGE (actually assert_char_at_range), we can
+     detect if the initial loc is after LINE_MAP_MAX_LOCATION_WITH_COLS
+     and expect get_source_range_for_substring to fail.
+     However, for a string concatenation test, we can have a case
+     where the initial string is fully before LINE_MAP_MAX_LOCATION_WITH_COLS,
+     but subsequent strings can be after it.
+     Attempting to detect this within assert_char_at_range
+     would overcomplicate the logic for the common test cases, so
+     we detect it here.  */
+  if (should_have_column_data_p (input_locs[0])
+      && !should_have_column_data_p (input_locs[4]))
+    {
+      /* Verify that get_source_range_for_substring gracefully rejects
+	 this case.  */
+      source_range actual_range;
+      bool result
+	= get_source_range_for_substring (test.m_parser, &test.m_concats,
+					  initial_loc, 0, 0, &actual_range);
+      ASSERT_FALSE (result);
+      return;
+    }
+
+  for (int i = 0; i < 5; i++)
+    for (int j = 0; j < 2; j++)
+      ASSERT_CHAR_AT_RANGE (test, initial_loc, (i * 2) + j,
+			    i + 1, 10 + j, 10 + j);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, 10);
+}
+
+/* Another test of string literal concatenation, this time combined with
+   various kinds of escaped characters.  */
+
+static void
+test_lexer_string_locations_concatenation_3 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  const char *content
+    /* .000000000.111111.111.1.2222.222.2.2233.333.3333.34444444444555
+       .123456789.012345.678.9.0123.456.7.8901.234.5678.90123456789012. */
+    = ("        \"01234\"  \"\\x35\"  \"\\066\"  \"789\" /* non-str */\n");
+  lexer_test test (case_, content);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[4];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 4; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 4,
+				      &dst_string, CPP_STRING);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (4, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i < 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, 5, 1, 19, 22);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, 6, 1, 27, 30);
+  for (int i = 7; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, i, 1, 28 + i, 28 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, 10);
+}
+
+/* Test of string literal in a macro.  */
+
+static void
+test_lexer_string_locations_macro (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("#define MACRO     \"0123456789\" /* non-str */\n"
+			 "  MACRO");
+  lexer_test test (case_, content);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"0123456789\"");
+
+  /* Verify ranges of individual characters.  We ought to
+     see columns within the macro definition.  */
+  for (int i = 0; i < 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, i, 1, 20 + i, 20 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, 10);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
 /* A table of interesting location_t values, giving one axis of our test
    matrix.  */
 
@@ -1539,6 +2505,17 @@ input_c_tests ()
 	  /* Run all tests for the given case within the test matrix.  */
 	  test_accessing_ordinary_linemaps (c);
 	  test_lexer (c);
+	  test_lexer_string_locations_simple (c);
+	  test_lexer_string_locations_hex (c);
+	  test_lexer_string_locations_oct (c);
+	  test_lexer_string_locations_ucn4 (c);
+	  test_lexer_string_locations_ucn8 (c);
+	  test_lexer_string_locations_u8 (c);
+	  test_lexer_string_locations_utf8_source (c);
+	  test_lexer_string_locations_concatenation_1 (c);
+	  test_lexer_string_locations_concatenation_2 (c);
+	  test_lexer_string_locations_concatenation_3 (c);
+	  test_lexer_string_locations_macro (c);
 
 	  num_cases_tested++;
 	}
diff --git a/gcc/input.h b/gcc/input.h
index ae4fecf..2f77afe 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -82,4 +82,39 @@ void dump_location_info (FILE *stream);
 
 void diagnostics_file_cache_fini (void);
 
+struct GTY(()) string_concat
+{
+  string_concat (int num, location_t *locs);
+
+  int m_num;
+  location_t * GTY ((atomic)) m_locs;
+};
+
+struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
+
+class GTY(()) string_concat_db
+{
+ public:
+  string_concat_db ();
+  void record_string_concatenation (int num, location_t *locs);
+
+  bool get_string_concatenation (location_t loc,
+				 int *out_num,
+				 location_t **out_locs);
+
+ private:
+  static location_t get_key_loc (location_t loc);
+
+ public:
+  /* This would be private, but must be public for use by
+     gtype-desc.c.  */
+  hash_map <location_hash, string_concat *> *m_table;
+};
+
+extern bool get_source_range_for_substring (cpp_reader *pfile,
+					    string_concat_db *concats,
+					    location_t strloc,
+					    int start_idx, int end_idx,
+					    source_range *out);
+
 #endif
diff --git a/gcc/selftest.h b/gcc/selftest.h
index 967e76b..02e4694 100644
--- a/gcc/selftest.h
+++ b/gcc/selftest.h
@@ -102,13 +102,19 @@ extern int num_passes;
    ::selftest::fail if it false.  */
 
 #define ASSERT_TRUE(EXPR)				\
+  ASSERT_TRUE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_TRUE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_TRUE_AT(LOC, EXPR)			\
   SELFTEST_BEGIN_STMT					\
   const char *desc = "ASSERT_TRUE (" #EXPR ")";		\
   bool actual = (EXPR);					\
   if (actual)						\
-    ::selftest::pass (SELFTEST_LOCATION, desc);	\
+    ::selftest::pass ((LOC), desc);			\
   else							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);		\
+    ::selftest::fail ((LOC), desc);			\
   SELFTEST_END_STMT
 
 /* Evaluate EXPR and coerce to bool, calling
@@ -116,13 +122,19 @@ extern int num_passes;
    ::selftest::fail if it true.  */
 
 #define ASSERT_FALSE(EXPR)					\
+  ASSERT_FALSE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_FALSE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_FALSE_AT(LOC, EXPR)				\
   SELFTEST_BEGIN_STMT						\
-  const char *desc = "ASSERT_FALSE (" #EXPR ")";		\
-  bool actual = (EXPR);					\
-  if (actual)							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);				\
-  else								\
-    ::selftest::pass (SELFTEST_LOCATION, desc);				\
+  const char *desc = "ASSERT_FALSE (" #EXPR ")";			\
+  bool actual = (EXPR);							\
+  if (actual)								\
+    ::selftest::fail ((LOC), desc);			\
+  else									\
+    ::selftest::pass ((LOC), desc);					\
   SELFTEST_END_STMT
 
 /* Evaluate EXPECTED and ACTUAL and compare them with ==, calling
@@ -167,7 +179,7 @@ extern int num_passes;
 			    (EXPECTED), (ACTUAL));		    \
   SELFTEST_END_STMT
 
-/* Like ASSERT_STREQ_AT, but treat LOC as the effective location of the
+/* Like ASSERT_STREQ, but treat LOC as the effective location of the
    selftest.  */
 
 #define ASSERT_STREQ_AT(LOC, EXPECTED, ACTUAL)			    \
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
new file mode 100644
index 0000000..73c5b2b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -0,0 +1,172 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdiagnostics-show-caret" } */
+
+/* This is a collection of unittests for ranges within string literals,
+   using diagnostic_plugin_test_string_literals, which handles
+   "__emit_string_literal_range" by generating a warning at the given
+   subset of a string literal.
+
+   The indices are 0-based.  It's easiest to verify things using string
+   literals that are runs of 0-based digits (to avoid having to count
+   characters).  */
+
+extern void __emit_string_literal_range (const char *literal,
+					 int start_idx, int end_idx);
+
+void
+test_simple_string_literal (void)
+{
+  __emit_string_literal_range ("0123456789", /* { dg-warning "range" } */
+			       6, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("0123456789",
+                                       ^~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_concatenated_string_literal (void)
+{
+  __emit_string_literal_range ("01234" "56789", /* { dg-warning "range" } */
+			       3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234" "56789",
+                                    ^~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiline_string_literal (void)
+{
+  __emit_string_literal_range ("01234" /* { dg-warning "range" } */
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~  
+   { dg-end-multiline-output "" } */
+  /* FIXME: why does the above need two trailing spaces?  */
+}
+
+/* Tests of various unicode encodings.
+
+   Digits 0 through 9 are unicode code points:
+      U+0030 DIGIT ZERO
+      ...
+      U+0039 DIGIT NINE
+   However, these are not always valid as UCN (see the comment in
+   libcpp/charset.c:_cpp_valid_ucn).
+
+   Hence we need to test UCN using an alternative unicode
+   representation of numbers; let's use Roman numerals,
+   (though these start at one, not zero):
+      U+2170 SMALL ROMAN NUMERAL ONE
+      ...
+      U+2174 SMALL ROMAN NUMERAL FIVE  ("v")
+      U+2175 SMALL ROMAN NUMERAL SIX   ("vi")
+      ...
+      U+2178 SMALL ROMAN NUMERAL NINE.  */
+
+void
+test_hex (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.  */
+  __emit_string_literal_range ("01234\x35 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\x35 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_oct (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.  */
+  __emit_string_literal_range ("01234\065 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\065 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiple (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  __emit_string_literal_range ("01234"  "\x35"  "\066"  "789", /* { dg-warning "range" } */
+			       3, 8);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"  "\x35"  "\066"  "789",
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn4 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     The resulting string is encoded as UTF-8.  Most of the digits are 1 byte
+     each, but digits 5 and 6 are encoded with 3 bytes each.
+     Hence to underline digits 4-7 we need to underling using bytes 4-11 in
+     the UTF-8 encoding.  */
+  __emit_string_literal_range ("01234\u2174\u2175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\u2174\u2175789",
+                                     ^~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn8 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     The resulting string is the same as as in test_ucn4 above, and hence
+     has the same UTF-8 encoding, and so we again need to underline bytes
+     4-11 in the UTF-8 encoding in order to underline digits 4-7.  */
+  __emit_string_literal_range ("01234\U00002174\U00002175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\U00002174\U00002175789",
+                                     ^~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u8 (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u8"0123456789", /* { dg-warning "range" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u8"0123456789",
+                                       ^~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_macro (void)
+{
+#define START "01234"  /* { dg-warning "range" } */
+  __emit_string_literal_range (START
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+ #define START "01234"
+                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   __emit_string_literal_range (START
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~
+   { dg-end-multiline-output "" } */
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
new file mode 100644
index 0000000..d92c2b5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
@@ -0,0 +1,210 @@
+/* This plugin uses the diagnostics code to verify tracking of source code
+   locations within string literals.  */
+/* { dg-options "-O" } */
+
+#include "gcc-plugin.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "stringpool.h"
+#include "toplev.h"
+#include "basic-block.h"
+#include "hash-table.h"
+#include "vec.h"
+#include "ggc.h"
+#include "basic-block.h"
+#include "tree-ssa-alias.h"
+#include "internal-fn.h"
+#include "gimple-fold.h"
+#include "tree-eh.h"
+#include "gimple-expr.h"
+#include "is-a.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "intl.h"
+#include "plugin-version.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "context.h"
+#include "print-tree.h"
+#include "cpplib.h"
+#include "c-family/c-pragma.h"
+
+int plugin_is_GPL_compatible;
+
+/* A custom pass for printing string literal location information.  */
+
+const pass_data pass_data_test_string_literals =
+{
+  GIMPLE_PASS, /* type */
+  "test_string_literals", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_test_string_literals : public gimple_opt_pass
+{
+public:
+  pass_test_string_literals(gcc::context *ctxt)
+    : gimple_opt_pass(pass_data_test_string_literals, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_test_string_literals
+
+/* Determine if STMT is a call with NUM_ARGS arguments to a function
+   named FUNCNAME.
+   If so, return STMT as a gcall *.  Otherwise return NULL.  */
+
+static gcall *
+check_for_named_call (gimple *stmt,
+		      const char *funcname, unsigned int num_args)
+{
+  gcc_assert (funcname);
+
+  gcall *call = dyn_cast <gcall *> (stmt);
+  if (!call)
+    return NULL;
+
+  tree fndecl = gimple_call_fndecl (call);
+  if (!fndecl)
+    return NULL;
+
+  if (strcmp (IDENTIFIER_POINTER (DECL_NAME (fndecl)), funcname))
+    return NULL;
+
+  if (gimple_call_num_args (call) != num_args)
+    {
+      error_at (stmt->location, "expected number of args: %i (got %i)",
+		num_args, gimple_call_num_args (call));
+      return NULL;
+    }
+
+  return call;
+}
+
+/* Emit a warning covering SRC_RANGE, with the caret at the start of
+   SRC_RANGE.  */
+
+static void
+emit_warning (source_range src_range)
+{
+  location_t loc
+    = make_location (src_range.m_start, src_range.m_start, src_range.m_finish);
+  warning_at (loc, 0, "range %i:%i-%i:%i",
+	      LOCATION_LINE (src_range.m_start),
+	      LOCATION_COLUMN (src_range.m_start),
+	      LOCATION_LINE (src_range.m_finish),
+	      LOCATION_COLUMN (src_range.m_finish));
+}
+
+/* Support code for verifying that we are correctly tracking ranges
+   within string literals, for use by diagnostic-test-string-literals-*.c.
+   Emit a warning showing the range of a string literal, for each call to
+   a function named "__emit_string_literal_range".
+   The initial argument should be a string literal; arguments 2 and 3
+   should be integer constants, giving the range within the string
+   to be printed.  */
+
+static void
+test_string_literals (gimple *stmt)
+{
+  gcall *call = check_for_named_call (stmt, "__emit_string_literal_range", 3);
+  if (!call)
+    return;
+
+  /* We expect an ADDR_EXPR with a STRING_CST inside it for the
+     initial arg.  */
+  tree t_addr_string = gimple_call_arg (call, 0);
+  if (TREE_CODE (t_addr_string) != ADDR_EXPR)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_string = TREE_OPERAND (t_addr_string, 0);
+  if (TREE_CODE (t_string) != STRING_CST)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_start_idx = gimple_call_arg (call, 1);
+  if (TREE_CODE (t_start_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 2");
+      return;
+    }
+  int start_idx = TREE_INT_CST_LOW (t_start_idx);
+
+  tree t_end_idx = gimple_call_arg (call, 2);
+  if (TREE_CODE (t_end_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 3");
+      return;
+    }
+  int end_idx = TREE_INT_CST_LOW (t_end_idx);
+
+  /* A STRING_CST doesn't have a location, but the ADDR_EXPR does.  */
+  location_t strloc = EXPR_LOCATION (t_addr_string);
+  source_range src_range;
+  if (get_source_range_for_substring (parse_in, g_string_concat_db, strloc,
+				      start_idx, end_idx, &src_range))
+    emit_warning (src_range);
+  else
+    error_at (strloc, "unable to read substring range");
+}
+
+/* Call test_string_literals on every statement within FUN.  */
+
+unsigned int
+pass_test_string_literals::execute (function *fun)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, fun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	test_string_literals (stmt);
+      }
+
+  return 0;
+}
+
+/* Entrypoint for the plugin.  Create and register the custom pass.  */
+
+int
+plugin_init (struct plugin_name_args *plugin_info,
+	     struct plugin_gcc_version *version)
+{
+  struct register_pass_info pass_info;
+  const char *plugin_name = plugin_info->base_name;
+  int argc = plugin_info->argc;
+  struct plugin_argument *argv = plugin_info->argv;
+
+  if (!plugin_default_version_check (version, &gcc_version))
+    return 1;
+
+  pass_info.pass = new pass_test_string_literals (g);
+  pass_info.reference_pass_name = "ssa";
+  pass_info.ref_pass_instance_number = 1;
+  pass_info.pos_op = PASS_POS_INSERT_AFTER;
+  register_callback (plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,
+		     &pass_info);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/plugin.exp b/gcc/testsuite/gcc.dg/plugin/plugin.exp
index f039c8d..3c2383a 100644
--- a/gcc/testsuite/gcc.dg/plugin/plugin.exp
+++ b/gcc/testsuite/gcc.dg/plugin/plugin.exp
@@ -71,6 +71,8 @@ set plugin_test_list [list \
 	  diagnostic-test-expressions-1.c } \
     { diagnostic_plugin_show_trees.c \
 	  diagnostic-test-show-trees-1.c } \
+    { diagnostic_plugin_test_string_literals.c \
+	  diagnostic-test-string-literals-1.c } \
     { location_overflow_plugin.c \
 	  location-overflow-test-1.c \
 	  location-overflow-test-2.c } \
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 2d07942..a4bddc8 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -812,6 +812,51 @@ cpp_host_to_exec_charset (cpp_reader *pfile, cppchar_t c)
 
 \f
 
+/* cpp_substring_ranges's constructor. */
+
+cpp_substring_ranges::cpp_substring_ranges () :
+  m_ranges (NULL),
+  m_num_ranges (0),
+  m_alloc_ranges (8)
+{
+  m_ranges = XNEWVEC (source_range, m_alloc_ranges);
+}
+
+/* cpp_substring_ranges's destructor. */
+
+cpp_substring_ranges::~cpp_substring_ranges ()
+{
+  free (m_ranges);
+}
+
+/* Add RANGE to the vector of source_range information.  */
+
+void
+cpp_substring_ranges::add_range (source_range range)
+{
+  if (m_num_ranges >= m_alloc_ranges)
+    {
+      m_alloc_ranges *= 2;
+      m_ranges
+	= (source_range *)xrealloc (m_ranges,
+				    sizeof (source_range) * m_alloc_ranges);
+    }
+  m_ranges[m_num_ranges++] = range;
+}
+
+/* Read NUM ranges from LOC_READER, adding them to the vector of source_range
+   information.  */
+
+void
+cpp_substring_ranges::add_n_ranges (int num,
+				    cpp_string_location_reader &loc_reader)
+{
+  for (int i = 0; i < num; i++)
+    add_range (loc_reader.get_next ());
+}
+
+\f
+
 /* Utility routine that computes a mask of the form 0000...111... with
    WIDTH 1-bits.  */
 static inline size_t
@@ -980,18 +1025,27 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
    one beyond the UCN, or to the syntactically invalid character.
 
    IDENTIFIER_POS is 0 when not in an identifier, 1 for the start of
-   an identifier, or 2 otherwise.  */
+   an identifier, or 2 otherwise.
+
+   If CHAR_RANGE and LOC_READER are non-NULL, then position information is
+   read from *LOC_READER and CHAR_RANGE->m_finish is updated accordingly.  */
 
 bool
 _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 		const uchar *limit, int identifier_pos,
-		struct normalize_state *nst, cppchar_t *cp)
+		struct normalize_state *nst, cppchar_t *cp,
+		source_range *char_range,
+		cpp_string_location_reader *loc_reader)
 {
   cppchar_t result, c;
   unsigned int length;
   const uchar *str = *pstr;
   const uchar *base = str - 2;
 
+  /* char_range and loc_reader must either be both NULL, or both be
+     non-NULL.  */
+  gcc_assert ((char_range != NULL) == (loc_reader != NULL));
+
   if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99))
     cpp_error (pfile, CPP_DL_WARNING,
 	       "universal character names are only valid in C++ and C99");
@@ -1021,6 +1075,8 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
       if (!ISXDIGIT (c))
 	break;
       str++;
+      if (loc_reader)
+	char_range->m_finish = loc_reader->get_next ().m_finish;
       result = (result << 4) + hex_value (c);
     }
   while (--length && str < limit);
@@ -1086,11 +1142,18 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 }
 
 /* Convert an UCN, pointed to by FROM, to UTF-8 encoding, then translate
-   it to the execution character set and write the result into TBUF.
-   An advanced pointer is returned.  Issues all relevant diagnostics.  */
+   it to the execution character set and write the result into TBUF,
+   if TBUF is non-NULL.
+   An advanced pointer is returned.  Issues all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t ucn;
   uchar buf[6];
@@ -1099,8 +1162,17 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
   int rval;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   from++;  /* Skip u/U.  */
-  _cpp_valid_ucn (pfile, &from, limit, 0, &nst, &ucn);
+
+  if (loc_reader)
+    /* The u/U is part of the spelling of this character.  */
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
+  _cpp_valid_ucn (pfile, &from, limit, 0, &nst,
+		  &ucn, &char_range, loc_reader);
 
   rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
   if (rval)
@@ -1109,9 +1181,20 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
       cpp_errno (pfile, CPP_DL_ERROR,
 		 "converting UCN to source character set");
     }
-  else if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting UCN to execution character set");
+  else
+    {
+      if (tbuf)
+	if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
+	  cpp_errno (pfile, CPP_DL_ERROR,
+		     "converting UCN to execution character set");
+
+      if (loc_reader)
+	{
+	  int num_encoded_bytes = 6 - bytesleft;
+	  for (int i = 0; i < num_encoded_bytes; i++)
+	    ranges->add_range (char_range);
+	}
+    }
 
   return from;
 }
@@ -1167,31 +1250,48 @@ emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
 }
 
 /* Convert a hexadecimal escape, pointed to by FROM, to the execution
-   character set and write it into the string buffer TBUF.  Returns an
-   advanced pointer, and issues diagnostics as necessary.
+   character set and write it into the string buffer TBUF (if non-NULL).
+   Returns an advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given hex
-   number.  You can, e.g. generate surrogate pairs this way.  */
+   number.  You can, e.g. generate surrogate pairs this way.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
   size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   if (CPP_WTRADITIONAL (pfile))
     cpp_warning (pfile, CPP_W_TRADITIONAL,
 	         "the meaning of '\\x' is different in traditional C");
 
-  from++;  /* Skip 'x'.  */
+  /* Skip 'x'.  */
+  from++;
+
+  /* The 'x' is part of the spelling of this character.  */
+  if (loc_reader)
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
   while (from < limit)
     {
       c = *from;
       if (! hex_p (c))
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 4 >> 4);
       n = (n << 4) + hex_value (c);
       digits_found = 1;
@@ -1211,7 +1311,10 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
@@ -1221,10 +1324,16 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
    advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given octal
-   number.  */
+   number.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
@@ -1232,12 +1341,17 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   while (from < limit && count++ < 3)
     {
       c = *from;
       if (c < '0' || c > '7')
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 3 >> 3);
       n = (n << 3) + c - '0';
     }
@@ -1249,18 +1363,26 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
 
 /* Convert an escape sequence (pointed to by FROM) to its value on
    the target, and to the execution character set.  Do not scan past
-   LIMIT.  Write the converted value into TBUF.  Returns an advanced
-   pointer.  Handles all relevant diagnostics.  */
+   LIMIT.  Write the converted value into TBUF, if TBUF is non-NULL.
+   Returns an advanced pointer.  Handles all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL: location
+   information is read from *LOC_READER, and *RANGES is updated
+   accordingly.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+		cpp_string_location_reader *loc_reader,
+		cpp_substring_ranges *ranges)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1273,20 +1395,28 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 
   uchar c;
 
+  /* Record the location of the backslash.  */
+  source_range char_range;
+  if (loc_reader)
+    char_range = loc_reader->get_next ();
+
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, cvt);
+      return convert_ucn (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, cvt);
+      return convert_hex (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, cvt);
+      return convert_oct (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1338,10 +1468,11 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 	}
     }
 
-  /* Now convert what we have to the execution character set.  */
-  if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting escape sequence to execution character set");
+  if (tbuf)
+    /* Now convert what we have to the execution character set.  */
+    if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
+      cpp_errno (pfile, CPP_DL_ERROR,
+		 "converting escape sequence to execution character set");
 
   return from + 1;
 }
@@ -1374,28 +1505,50 @@ converter_for_type (cpp_reader *pfile, enum cpp_ttype type)
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
    concatenated.  WIDE indicates whether or not to produce a wide
-   string.  The result is written into TO.  Returns true for success,
-   false for failure.  */
-bool
-cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to,  enum cpp_ttype type)
+   string.  If TO is non-NULL, the result is written into TO.
+   If LOC_READERS and OUT are non-NULL, then location information
+   is read from LOC_READERS (which must be an array of length COUNT),
+   and location information is written to *RANGES.
+   Returns true for success, false for failure.  */
+static bool
+cpp_interpret_string_1 (cpp_reader *pfile, const cpp_string *from, size_t count,
+			cpp_string *to,  enum cpp_ttype type,
+			cpp_string_location_reader *loc_readers,
+			cpp_substring_ranges *out)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
   struct cset_converter cvt = converter_for_type (pfile, type);
 
-  tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
-  tbuf.text = XNEWVEC (uchar, tbuf.asize);
-  tbuf.len = 0;
+  /* loc_readers and out must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_readers != NULL) == (out != NULL));
+
+  if (to)
+    {
+      tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
+      tbuf.text = XNEWVEC (uchar, tbuf.asize);
+      tbuf.len = 0;
+    }
 
   for (i = 0; i < count; i++)
     {
+      cpp_string_location_reader *loc_reader = NULL;
+      if (loc_readers)
+	loc_reader = &loc_readers[i];
+
       p = from[i].text;
       if (*p == 'u')
 	{
-	  if (*++p == '8')
-	    p++;
+	  p++;
+	  if (loc_reader)
+	    loc_reader->get_next ();
+	  if (*p == '8')
+	    {
+	      p++;
+	      if (loc_reader)
+		loc_reader->get_next ();
+	    }
 	}
       else if (*p == 'L' || *p == 'U') p++;
       if (*p == 'R')
@@ -1414,13 +1567,26 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 
 	  /* Raw strings are all normal characters; these can be fed
 	     directly to convert_cset.  */
-	  if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
-	    goto fail;
+	  if (to)
+	    if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
+	      goto fail;
+
+	  if (loc_reader)
+	    /* FIXME: If generating source ranges, assume we have a 1:1
+	       correspondence between bytes in the source encoding and bytes
+	       in the execution encoding (e.g. if we have a UTF-8 to UTF-8
+	       conversion), so that the run of byte in the source file
+	       corresponds to a run of bytes in the execution string.  */
+	    out->add_n_ranges (limit - p, *loc_reader);
 
 	  continue;
 	}
 
-      p++; /* Skip leading quote.  */
+      /* Skip leading quote.  */
+      p++;
+      if (loc_reader)
+	loc_reader->get_next ();
+
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
       for (;;)
@@ -1432,29 +1598,80 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 	    {
 	      /* We have a run of normal characters; these can be fed
 		 directly to convert_cset.  */
-	      if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
-		goto fail;
+	      if (to)
+		if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
+		  goto fail;
+	    /* FIXME: similar to above: this assumes we have a 1:1
+	       correspondence between bytes in the source encoding and bytes
+	       in the execution encoding.  */
+	      if (loc_reader)
+		out->add_n_ranges (p - base, *loc_reader);
 	    }
 	  if (p == limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
+	  struct _cpp_strbuf *tbuf_ptr = to ? &tbuf : NULL;
+	  p = convert_escape (pfile, p + 1, limit, tbuf_ptr, cvt,
+			      loc_reader, out);
 	}
     }
-  /* NUL-terminate the 'to' buffer and translate it to a cpp_string
-     structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, cvt);
-  tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
-  to->text = tbuf.text;
-  to->len = tbuf.len;
+
+  if (to)
+    {
+      /* NUL-terminate the 'to' buffer and translate it to a cpp_string
+	 structure.  */
+      emit_numeric_escape (pfile, 0, &tbuf, cvt);
+      tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
+      to->text = tbuf.text;
+      to->len = tbuf.len;
+    }
+
   return true;
 
  fail:
   cpp_errno (pfile, CPP_DL_ERROR, "converting to execution character set");
-  free (tbuf.text);
+  if (to)
+    free (tbuf.text);
   return false;
 }
 
+/* FROM is an array of cpp_string structures of length COUNT.  These
+   are to be converted from the source to the execution character set,
+   escape sequences translated, and finally all are to be
+   concatenated.  WIDE indicates whether or not to produce a wide
+   string.  The result is written into TO.  Returns true for success,
+   false for failure.  */
+bool
+cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
+		      cpp_string *to,  enum cpp_ttype type)
+{
+  return cpp_interpret_string_1 (pfile, from, count, to, type, NULL, NULL);
+}
+
+/* This function mimics the behavior of cpp_interpret_string, but
+   rather than generating a string in the execution character set,
+   *OUT is written to with the source code ranges of the characters
+   in such a string.
+   FROM and LOC_READERS should both be arrays of length COUNT.
+   Returns true for success, false for failure.  */
+
+bool
+cpp_interpret_string_ranges (cpp_reader *pfile, const cpp_string *from,
+			     cpp_string_location_reader *loc_readers,
+			     size_t count,
+			     cpp_substring_ranges *out)
+{
+  /* SOURCE_CHARSET "UTF-8" */
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+
+  /* We assume UTF-8 to UTF-8 conversion.  */
+  return cpp_interpret_string_1 (pfile, from, count, NULL, CPP_STRING,
+				 loc_readers, out);
+#else
+  return false;
+#endif
+}
+
 /* Subroutine of do_line and do_linemarker.  Convert escape sequences
    in a string, but do not perform character set conversion.  */
 bool
@@ -1818,3 +2035,39 @@ _cpp_default_encoding (void)
 
   return current_encoding;
 }
+
+/* Implementation of class cpp_string_location_reader.  */
+
+/* Constructor for cpp_string_location_reader.  */
+
+cpp_string_location_reader::
+cpp_string_location_reader (source_location src_loc,
+			    line_maps *line_table)
+: m_line_table (line_table)
+{
+  src_loc = get_range_from_loc (line_table, src_loc).m_start;
+
+  /* SRC_LOC might be a macro location.  It only makes sense to do
+     column-by-column calculations on ordinary maps, so get the
+     corresponding location in an ordinary map.  */
+  m_loc
+    = linemap_resolve_location (line_table, src_loc,
+				LRK_SPELLING_LOCATION, NULL);
+
+  const line_map_ordinary *map
+    = linemap_check_ordinary (linemap_lookup (line_table, m_loc));
+  m_offset_per_column = (1 << map->m_range_bits);
+}
+
+/* Get the range of the next source byte.  */
+
+source_range
+cpp_string_location_reader::get_next ()
+{
+  source_range result;
+  result.m_start = m_loc;
+  result.m_finish = m_loc;
+  if (m_loc <= LINE_MAP_MAX_LOCATION_WITH_COLS)
+    m_loc += m_offset_per_column;
+  return result;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 543f3b9..b6dd39c 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -743,6 +743,51 @@ struct GTY(()) cpp_hashnode {
   union _cpp_hashnode_value GTY ((desc ("CPP_HASHNODE_VALUE_IDX (%1)"))) value;
 };
 
+/* A class for iterating through the source locations within a
+   string token (before escapes are interpreted, and before
+   concatenation).  */
+
+class cpp_string_location_reader {
+ public:
+  cpp_string_location_reader (source_location src_loc,
+			      line_maps *line_table);
+
+  source_range get_next ();
+
+ private:
+  source_location m_loc;
+  int m_offset_per_column;
+  line_maps *m_line_table;
+};
+
+/* A class for storing the source ranges of all of the characters within
+   a string literal, after escapes are interpreted, and after
+   concatenation.
+
+   This is not GTY-marked, as instances are intended to be temporary.  */
+
+class cpp_substring_ranges
+{
+ public:
+  cpp_substring_ranges ();
+  ~cpp_substring_ranges ();
+
+  int get_num_ranges () const { return m_num_ranges; }
+  source_range get_range (int idx) const
+  {
+    linemap_assert (idx < m_num_ranges);
+    return m_ranges[idx];
+  }
+
+  void add_range (source_range range);
+  void add_n_ranges (int num, cpp_string_location_reader &loc_reader);
+
+ private:
+  source_range *m_ranges;
+  int m_num_ranges;
+  int m_alloc_ranges;
+};
+
 /* Call this first to get a handle to pass to other functions.
 
    If you want cpplib to manage its own hashtable, pass in a NULL
@@ -829,6 +874,11 @@ extern cppchar_t cpp_interpret_charconst (cpp_reader *, const cpp_token *,
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
 				  cpp_string *, enum cpp_ttype);
+extern bool cpp_interpret_string_ranges (cpp_reader *pfile,
+					 const cpp_string *from,
+					 cpp_string_location_reader *readers,
+					 size_t count,
+					 cpp_substring_ranges *out);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
 					      cpp_string *, enum cpp_ttype);
diff --git a/libcpp/internal.h b/libcpp/internal.h
index ca2b498..4a5cd3c 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -754,7 +754,9 @@ struct normalize_state
 extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    const unsigned char *, int,
 			    struct normalize_state *state,
-			    cppchar_t *);
+			    cppchar_t *,
+			    source_range *char_range,
+			    cpp_string_location_reader *loc_reader);
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/lex.c b/libcpp/lex.c
index 236418d..4e71965 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -1247,7 +1247,7 @@ forms_identifier_p (cpp_reader *pfile, int first,
       cppchar_t s;
       buffer->cur += 2;
       if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
-			  state, &s))
+			  state, &s, NULL, NULL))
 	return true;
       buffer->cur -= 2;
     }
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-07-08 21:22 [PATCH] RFC: On-demand locations within string-literals David Malcolm
@ 2016-07-20 19:38 ` David Malcolm
  2016-07-21 16:38   ` Jeff Law
  2016-07-23 21:36 ` [PATCH] RFC: " Martin Sebor
  1 sibling, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-20 19:38 UTC (permalink / raw)
  To: gcc-patches

On Fri, 2016-07-08 at 17:49 -0400, David Malcolm wrote:
[...]

> Also, this patch currently makes the assumption (in charset.c)
> that there's a 1:1 correspondence between bytes in the source
> character set and bytes in the execution character set.  This can
> be the case if both are, say, UTF-8, but might not hold in
> general.
> 
> The source char set is UTF-8 or UTF-EBCDIC, and safe-ctype.c has:
> 
> # if HOST_CHARSET == HOST_CHARSET_EBCDIC
>   #error "FIXME: write tables for EBCDIC"
> 
> so presumably we don't actually have any hosts that supports EBCDIC
> (do we?); as far as I can tell, we only currently support UTF-8
> as the source char set.
> 
> Similarly, do we support any targets for which the execution
> character set is *not* UTF-8?

I brought this up in this thread on the gcc mailing list:
"gcc/libcpp: non-UTF-8 source or execution encodings?"
  https://gcc.gnu.org/ml/gcc/2016-07/msg00091.html
and in particular:
  https://gcc.gnu.org/ml/gcc/2016-07/msg00106.html
it's possible to select the execution char set using at the command
-line for C-family frontends using:
  -fexec-charset=
  -fwide-exec-charset=
e.g. "-fexec-charset=IBM1047" will give one of the variants of EBCDIC.

Given that the internal interface already has a failure mode, I'm
thinking that a reasonable restriction is to only support locations
within string literals for the case where source character set ==
execution character set, and hence we have "convert_no_conversion" as
the converter.  Does that sound sane?  (I can write test coverage for
this).

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-07-20 19:38 ` David Malcolm
@ 2016-07-21 16:38   ` Jeff Law
  2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-07-21 16:38 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

On 07/20/2016 01:38 PM, David Malcolm wrote:
> On Fri, 2016-07-08 at 17:49 -0400, David Malcolm wrote:
> [...]
>
>> Also, this patch currently makes the assumption (in charset.c)
>> that there's a 1:1 correspondence between bytes in the source
>> character set and bytes in the execution character set.  This can
>> be the case if both are, say, UTF-8, but might not hold in
>> general.
>>
>> The source char set is UTF-8 or UTF-EBCDIC, and safe-ctype.c has:
>>
>> # if HOST_CHARSET == HOST_CHARSET_EBCDIC
>>   #error "FIXME: write tables for EBCDIC"
>>
>> so presumably we don't actually have any hosts that supports EBCDIC
>> (do we?); as far as I can tell, we only currently support UTF-8
>> as the source char set.
>>
>> Similarly, do we support any targets for which the execution
>> character set is *not* UTF-8?
>
> I brought this up in this thread on the gcc mailing list:
> "gcc/libcpp: non-UTF-8 source or execution encodings?"
>   https://gcc.gnu.org/ml/gcc/2016-07/msg00091.html
> and in particular:
>   https://gcc.gnu.org/ml/gcc/2016-07/msg00106.html
> it's possible to select the execution char set using at the command
> -line for C-family frontends using:
>   -fexec-charset=
>   -fwide-exec-charset=
> e.g. "-fexec-charset=IBM1047" will give one of the variants of EBCDIC.
>
> Given that the internal interface already has a failure mode, I'm
> thinking that a reasonable restriction is to only support locations
> within string literals for the case where source character set ==
> execution character set, and hence we have "convert_no_conversion" as
> the converter.  Does that sound sane?  (I can write test coverage for
> this).
I think this is sane.  We can always revisit later if we change our 
minds, particularly if folks want to do something crazy like self-host 
on an EBCDIC system.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-07-08 21:22 [PATCH] RFC: On-demand locations within string-literals David Malcolm
  2016-07-20 19:38 ` David Malcolm
@ 2016-07-23 21:36 ` Martin Sebor
  2016-07-24  0:37   ` David Malcolm
  1 sibling, 1 reply; 61+ messages in thread
From: Martin Sebor @ 2016-07-23 21:36 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1732 bytes --]

On 07/08/2016 03:49 PM, David Malcolm wrote:
> This patch implements precise tracking of source locations for the
> individual chars within string literals, so that we can e.g. underline
> specific ranges in -Wformat diagnostics.
...
> Successfully bootstrapped&regrtested on x86_64-pc-linux-gnu.
>
> Thoughts?

I applied the patch to my -Wformat-length branch and managed to
get it to work in the sprintf-length pass in the middle end, but
I'm not sure I did it right (I suspect not, or at least not the
way it should be done).  I couldn't find other places where the
bits it depends on are used (I copied bits and pieces of the code
I used from the C front end until it all fell into place ).  It
would be great if there was a tutorial on how to plug it in where
it isn't used yet (as in the middle end).  For example, how do
I determine what CLK_XXX constant I should pass to the
cpp_create_reader() function in the middle end?

I ran my tests with the patch and while it handled a good number
of test cases it eventually crashed.  A test case for the ICE is
attached though it needs my -Wformat-length patch to trigger it.
You may be able to tell what's wrong from the debugger context
(included in the test case).  If not, you may be able to
reproduce it by applying my latest patch and hacking the three
argument overload of location_from_offset() the way I did (in
the xyz.i comment).  Or I can send you my latest patch with all
this in place.

Beyond that, the range normally works fine, except when macros
are involved like they are in my tests.  You can see the effect
in the range.out file.  (This works without your patch but it
could very well be because I didn't set it up right.)

That's all I have for now.

Martin


[-- Attachment #2: xyz.i --]
[-- Type: text/plain, Size: 2482 bytes --]

# 1 "/src/gcc-49905/gcc/testsuite/gcc.dg/format/c99-sprintf-length-1.c"

char buffer [8];
extern char *ptr;

void f (__builtin_va_list va)
{
  __builtin_sprintf (buffer + sizeof buffer, "%c", 0);
}

/*
(gdb) r
Starting program: /build/gcc-49905/gcc/cc1 -fpreprocessed xyz.i -quiet -dumpbase xyz.i -mtune=generic -march=x86-64 -auxbase xyz -Wformat=1 -Wformat-length=1 -o xyz.s

Breakpoint 1, location_from_offset (loc=2147483652, offset=1, length=2)
    at /src/gcc-49905/gcc/gimple-ssa-sprintf.c:369
369	    = cpp_create_reader (CLK_GNUC11, ident_hash, line_table);
(gdb) l
364	static location_t
365	location_from_offset (location_t loc, int offset, int length)
366	{
367	#if 1
368	  static struct cpp_reader* parse_in
369	    = cpp_create_reader (CLK_GNUC11, ident_hash, line_table);
370	
371	  static string_concat_db *g_string_concat_db
372	    = new (ggc_alloc <string_concat_db> ()) string_concat_db ();
373	
(gdb) l
374	  source_range range;
375	  if (get_source_range_for_substring (parse_in, g_string_concat_db, loc,
376					      offset, offset + length, &range))
377	    return make_location (range.m_start, range.m_start, range.m_finish);
378	  return loc;
379	#else
(gdb) c
/src/gcc-49905/gcc/testsuite/gcc.dg/format/c99-sprintf-length-1.c: In function â€™:
/src/gcc-49905/gcc/testsuite/gcc.dg/format/c99-sprintf-length-1.c:5:6: internal compiler error: in get_substring_ranges_for_loc, at input.c:1310
    and avoid exercising any of the others.  The buffer and objsize macros
      ^
0x1921136 get_substring_ranges_for_loc
	/src/gcc-49905/gcc/input.c:1310
0x19212cc get_source_range_for_substring(cpp_reader*, string_concat_db*, unsigned int, int, int, source_range*)
	/src/gcc-49905/gcc/input.c:1354
0x17eb1ae location_from_offset
	/src/gcc-49905/gcc/gimple-ssa-sprintf.c:375
0x17ed220 compute_format_length
	/src/gcc-49905/gcc/gimple-ssa-sprintf.c:1343
0x17edf4c pass_sprintf_length::compute_format_length(pass_sprintf_length::call_info const&, format_result*)
	/src/gcc-49905/gcc/gimple-ssa-sprintf.c:1880
0x17ee780 pass_sprintf_length::handle_gimple_call(gimple_stmt_iterator)
	/src/gcc-49905/gcc/gimple-ssa-sprintf.c:2122
0x17ee8cf pass_sprintf_length::execute(function*)
	/src/gcc-49905/gcc/gimple-ssa-sprintf.c:2153
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
[Inferior 1 (process 23446) exited with code 04]


 */

[-- Attachment #3: range.out --]
[-- Type: text/plain, Size: 1739 bytes --]

$ cat xyz.c && /build/gcc-49905/gcc/xgcc -B/build/gcc-49905/gcc -Wformat -Wformat-length=1 -S xyz.c
char d[2];

#define P(n, f) __builtin_vsprintf (d + sizeof d - n, f, va)

void f (__builtin_va_list va)
{
   __builtin_vsprintf (d + sizeof d - 2, "%#3x", va);
  P (2, "%#3x");
}
xyz.c: In function â€˜fâ€™:
xyz.c:7:42: warning: â€˜%#3xâ€™ directive writing between 3 and 10 bytes into a region of size 2 [-Wformat-length=]
    __builtin_vsprintf (d + sizeof d - 2, "%#3x", va);
                                          ^~~~~~
xyz.c:7:42: note: using the range [â€˜1uâ€™, â€˜2147483648uâ€™] for directive argument
xyz.c:7:4: note: format output between 4 and 11 bytes into a destination of size 2
    __builtin_vsprintf (d + sizeof d - 2, "%#3x", va);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xyz.c:8:9: warning: â€˜%#3xâ€™ directive writing between 3 and 10 bytes into a region of size 2 [-Wformat-length=]
   P (2, "%#3x");
         ^
xyz.c:3:55: note: in definition of macro â€˜Pâ€™
 #define P(n, f) __builtin_vsprintf (d + sizeof d - n, f, va)
                                                       ^
xyz.c:8:9: note: using the range [â€˜1uâ€™, â€˜2147483648uâ€™] for directive argument
   P (2, "%#3x");
         ^
xyz.c:3:55: note: in definition of macro â€˜Pâ€™
 #define P(n, f) __builtin_vsprintf (d + sizeof d - n, f, va)
                                                       ^
xyz.c:3:17: note: format output between 4 and 11 bytes into a destination of size 2
 #define P(n, f) __builtin_vsprintf (d + sizeof d - n, f, va)
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
xyz.c:8:3: note: in expansion of macro â€˜Pâ€™
   P (2, "%#3x");
   ^

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-07-23 21:36 ` [PATCH] RFC: " Martin Sebor
@ 2016-07-24  0:37   ` David Malcolm
  2016-08-23  3:25     ` Martin Sebor
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-24  0:37 UTC (permalink / raw)
  To: Martin Sebor, gcc-patches

On Sat, 2016-07-23 at 15:35 -0600, Martin Sebor wrote:
> On 07/08/2016 03:49 PM, David Malcolm wrote:
> > This patch implements precise tracking of source locations for the
> > individual chars within string literals, so that we can e.g.
> > underline
> > specific ranges in -Wformat diagnostics.
> ...
> > Successfully bootstrapped&regrtested on x86_64-pc-linux-gnu.
> > 
> > Thoughts?
> 
> I applied the patch to my -Wformat-length branch and managed to
> get it to work in the sprintf-length pass in the middle end, but
> I'm not sure I did it right (I suspect not, or at least not the
> way it should be done).  I couldn't find other places where the
> bits it depends on are used (I copied bits and pieces of the code
> I used from the C front end until it all fell into place ).  It
> would be great if there was a tutorial on how to plug it in where
> it isn't used yet (as in the middle end).  For example, how do
> I determine what CLK_XXX constant I should pass to the
> cpp_create_reader() function in the middle end?

Thanks for trying it out.  I've been reworking the patch, and I have a
much more robust version that I think is close to being ready, with an
easier interface, some bug fixes, and, ahem, EBCDIC support...  and
more usefully to the average user, gracefully handling of other
encodings.

I'd hoped to post the new version on Monday (along with code that wires
it up into c-common/c-format.c).

Though reading your post, I realize now that my interface is in c
-common.h, and that's not going to be usable for you from the middle
end.  Presumably I'm going to need to rework things to be usable from a
gimple-*.c file.  Is it acceptable to somehow wire this up to a
langhook, so that locations for sprintf-style strings are only
available for c-family, with a graceful fallback on other frontends?

Alternatively, I can post the c-common.h interface for review, and we
can get that good enough for trunk, and then we can work on
generalizing it so it's usable from the middle-end for your new pass.

> I ran my tests with the patch and while it handled a good number
> of test cases it eventually crashed.  A test case for the ICE is
> attached though it needs my -Wformat-length patch to trigger it.
> You may be able to tell what's wrong from the debugger context
> (included in the test case).  If not, you may be able to
> reproduce it by applying my latest patch and hacking the three
> argument overload of location_from_offset() the way I did (in
> the xyz.i comment).  Or I can send you my latest patch with all
> this in place.

My patch has changed a lot, so I don't know how useful debugging is
going to be.  I'd be interested in seeing your latest patch.

> Beyond that, the range normally works fine, except when macros
> are involved like they are in my tests.  You can see the effect
> in the range.out file.  (This works without your patch but it
> could very well be because I didn't set it up right.)

Sadly I can't figure out what's going wrong - but the code's changed a
lot at my end since then.  Sorry.

> That's all I have for now.

Martin
> 

Thanks for posting this, it's very helpful.
Dave

> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-21 16:38   ` Jeff Law
@ 2016-07-26 16:43     ` David Malcolm
  2016-07-26 16:43       ` [PATCH 3/3] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
                         ` (3 more replies)
  0 siblings, 4 replies; 61+ messages in thread
From: David Malcolm @ 2016-07-26 16:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Malcolm

This is an updated version of:
  "[PATCH] RFC: On-demand locations within string-literals"
    https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00441.html

Changes in v2:
- Tweaks to substring location selftests
- Many more selftests (EBCDIC, the various wide string types, etc)
- Clean up conditions in charset.c; require source == execution charset
  to have substring locations
- Make string_concat_db field private
- Return error messages rather than bool
- Fix source_range for charset.c:convert_escape
- Introduce class substring_loc
- Handle bad input locations more gracefully
- Ensure that we can read substring information for a token which
  starts in one linemap and ends in another (seen in
  gcc.dg/cpp/pr69985.c)

This patch implements precise tracking of source locations for the
individual chars within string literals, so that we can e.g. underline
specific ranges in -Wformat diagnostics.  It handles macros,
concatenated tokens, escaped characters etc.

The idea is to replace the limited implementation of this we currently
have in c-format.c (see r223470 [1]).  Doing so happens in patch 2 of
the kit; this patch just provides the infrastructure to do so.

As before the patch implements a new mode within libcpp's string literal
lexer.  It's disabled during the regular lexer, but it's available
through a low-level interface in input.{c|h} which can rerun the libcpp
code and capture the per-char source_ranges for when we need to issue a
diagnostic.  It also now adds a higher-level interface in c-common.h:
class substring_loc.

As before, to handle concatentation the patch adds some extra data
storage: every time a string concatenation happens in c-lex.c: it stores
the locations of the component tokens in a hash_map, keyed by the
spelling location of the start first token (see class string_concat_db
in input.h).

Hence it's only storing extra data for string concatenations,
not for simple string literals.

As before, this doesn't support the C++ frontend yet, but it doesn't
regress the status quo for c-format.c from C++.  I have a patch for
the C++ FE that records string concatenation information to the lexer,
but given that it's not used yet, I didn't add that in this patch, as
the data would be redundant.

This version of the patch properly handles encodings (and adds a
lot of test coverage for this to input.c).  It makes the simplifying
restriction that precise source location information is only available
if source charset == execution charset, as discussed on this list,
failing gracefully when this isn't the case.

I believe I can self-approve the changes to input.c, input.h, libcpp,
and the testsuite; the remaining changes needing approval are those
to c-family, to gcc.c, and to selftest.h.

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

Successful selftest run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111)
in conjunction with the rest of the patch kit.

config-list.mk test run is in progress.

OK for trunk if it passes testing? (by itself)

[1]  https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=d5a2ddc76a109258297ff345957c35cb50116c94#patch2

gcc/c-family/ChangeLog:
	* c-common.c (get_cpp_ttype_from_string_type): New function.
	(g_string_concat_db): New global.
	(substring_loc::get_range): New method.
	* c-common.h (g_string_concat_db): New declaration.
	(class substring_loc): New class.
	* c-lex.c (lex_string): When concatenating strings, capture the
	locations of all tokens using a new obstack, and record the
	concatenation locations within g_string_concat_db.
	* c-opts.c (c_common_init_options): Construct g_string_concat_db
	on the ggc-heap.

gcc/ChangeLog:
	* gcc.c (cpp_options): Rename string to...
	(cpp_options_): ...this, to avoid clashing with struct in
	cpplib.h.
	(static_specs): Update initialize for above renaming
	* input.c (string_concat::string_concat): New constructor.
	(string_concat_db::string_concat_db): New constructor.
	(string_concat_db::record_string_concatenation): New method.
	(string_concat_db::get_string_concatenation): New method.
	(string_concat_db::get_key_loc): New method.
	(class auto_cpp_string_vec): New class.
	(get_substring_ranges_for_loc): New function.
	(get_source_range_for_substring): New function.
	(get_num_source_ranges_for_substring): New function.
	(class selftest::lexer_test_options): New class.
	(struct selftest::lexer_test): New struct.
	(class selftest::ebcdic_execution_charset): New class.
	(selftest::ebcdic_execution_charset::s_singleton): New variable.
	(selftest::lexer_test::lexer_test): New constructor.
	(selftest::lexer_test::~lexer_test): New destructor.
	(selftest::lexer_test::get_token): New method.
	(selftest::assert_char_at_range): New function.
	(ASSERT_CHAR_AT_RANGE): New macro.
	(selftest::assert_num_substring_ranges): New function.
	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
	(selftest::assert_has_no_substring_ranges): New function.
	(ASSERT_HAS_NO_SUBSTRING_RANGES): New macro.
	(selftest::test_lexer_string_locations_simple): New function.
	(selftest::test_lexer_string_locations_ebcdic): New function.
	(selftest::test_lexer_string_locations_hex): New function.
	(selftest::test_lexer_string_locations_oct): New function.
	(selftest::test_lexer_string_locations_letter_escape_1): New function.
	(selftest::test_lexer_string_locations_letter_escape_2): New function.
	(selftest::test_lexer_string_locations_ucn4): New function.
	(selftest::test_lexer_string_locations_ucn8): New function.
	(selftest::uint32_from_big_endian): New function.
	(selftest::test_lexer_string_locations_wide_string): New function.
	(selftest::uint16_from_big_endian): New function.
	(selftest::test_lexer_string_locations_string16): New function.
	(selftest::test_lexer_string_locations_string32): New function.
	(selftest::test_lexer_string_locations_u8): New function.
	(selftest::test_lexer_string_locations_utf8_source): New function.
	(selftest::test_lexer_string_locations_concatenation_1): New
	function.
	(selftest::test_lexer_string_locations_concatenation_2): New
	function.
	(selftest::test_lexer_string_locations_concatenation_3): New
	function.
	(selftest::test_lexer_string_locations_macro): New function.
	(selftest::test_lexer_string_locations_non_string): New function.
	(selftest::test_lexer_string_locations_long_line): New function.
	(selftest::input_c_tests): Call the new test functions once per
	case within the line_table test matrix.
	* input.h (struct string_concat): New struct.
	(struct location_hash): New struct.
	(class string_concat_db): New class.
	(get_source_range_for_substring): New prototype.
	* selftest.h (ASSERT_TRUE): Reimplement in terms of...
	(ASSERT_TRUE_AT): New macro.
	(ASSERT_FALSE): Reimplement in terms of...
	(ASSERT_FALSE_AT): New macro.
	(ASSERT_STREQ_AT): Fix typo in comment.

gcc/testsuite/ChangeLog:
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New file.
	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add
	diagnostic_plugin_test_string_literals.c and
	diagnostic-test-string-literals-1.c.

libcpp/ChangeLog:
	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
	constructor.
	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
	(cpp_substring_ranges::add_range): New method.
	(cpp_substring_ranges::add_n_ranges): New method.
	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
	they are non-NULL, read position information from *loc_reader
	and update char_range->m_finish accordingly.
	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
	params.  If loc_reader is non-NULL, read location information from
	it, and update *ranges accordingly, using char_range.
	Conditionalize the conversion into tbuf on tbuf being non-NULL.
	(convert_hex): Likewise, conditionalizing the call to
	emit_numeric_escape on tbuf.
	(convert_oct): Likewise.
	(convert_escape): Add params "loc_reader" and "ranges".  If
	loc_reader is non-NULL, read location information from it, and
	update *ranges accordingly.  Conditionalize the conversion into
	tbuf on tbuf being non-NULL.
	(cpp_interpret_string): Rename to...
	(cpp_interpret_string_1): ...this, adding params "loc_readers" and
	"out".  Use "to" to conditionalize the initialization and usage of
	"tbuf", such as running the converter.  If "loc_readers" is
	non-NULL, use the instances within it, reading location
	information from them, and passing them to convert_escape; likewise
	write to "out" if loc_readers is non-NULL.  Update boundary check
	from "== limit" to ">= limit" to protect against erroneous location
	values to calls that are not parsing string literals.
	(cpp_interpret_string): Reimplement in terms to
	cpp_interpret_string_1.
	(cpp_interpret_string_ranges): New function.
	(cpp_string_location_reader::cpp_string_location_reader): New
	constructor.
	(cpp_string_location_reader::get_next): New method.
	* include/cpplib.h (class cpp_string_location_reader): New class.
	(class cpp_substring_ranges): New class.
	(cpp_interpret_string_ranges): New prototype.
	* internal.h (_cpp_valid_ucn): Add params "char_range" and
	"loc_reader".
	* lex.c (forms_identifier_p): Pass NULL for new params to
	_cpp_valid_ucn.
---
 gcc/c-family/c-common.c                            |   61 +
 gcc/c-family/c-common.h                            |   29 +
 gcc/c-family/c-lex.c                               |   24 +-
 gcc/c-family/c-opts.c                              |    3 +
 gcc/gcc.c                                          |    4 +-
 gcc/input.c                                        | 1465 ++++++++++++++++++++
 gcc/input.h                                        |   43 +
 gcc/selftest.h                                     |   30 +-
 .../plugin/diagnostic-test-string-literals-1.c     |  211 +++
 .../diagnostic_plugin_test_string_literals.c       |  212 +++
 gcc/testsuite/gcc.dg/plugin/plugin.exp             |    2 +
 libcpp/charset.c                                   |  387 +++++-
 libcpp/include/cpplib.h                            |   51 +
 libcpp/internal.h                                  |    4 +-
 libcpp/lex.c                                       |    2 +-
 15 files changed, 2461 insertions(+), 67 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 936ddfb..f4ffc0e 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -1093,6 +1093,67 @@ fix_string_type (tree value)
   TREE_STATIC (value) = 1;
   return value;
 }
+
+/* Given a string of type STRING_TYPE, determine what kind of string
+   token created it: CPP_STRING, CPP_STRING16, CPP_STRING32, or
+   CPP_WSTRING.  Return CPP_OTHER in case of error.
+
+   This effectively reverses part of the logic in
+   lex_string and fix_string_type.  */
+
+static enum cpp_ttype
+get_cpp_ttype_from_string_type (tree string_type)
+{
+  gcc_assert (string_type);
+  if (TREE_CODE (string_type) != ARRAY_TYPE)
+    return CPP_OTHER;
+
+  tree element_type = TREE_TYPE (string_type);
+  if (TREE_CODE (element_type) != INTEGER_TYPE)
+    return CPP_OTHER;
+
+  int bits_per_character = TYPE_PRECISION (element_type);
+  switch (bits_per_character)
+    {
+    case 8:
+      return CPP_STRING;  /* It could have also been CPP_UTF8STRING.  */
+    case 16:
+      return CPP_STRING16;
+    case 32:
+      return CPP_STRING32;
+    }
+
+  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
+    return CPP_WSTRING;
+
+  return CPP_OTHER;
+}
+
+/* The global record of string concatentations, for use in
+   extracting locations within string literals.  */
+
+GTY(()) string_concat_db *g_string_concat_db;
+
+/* Attempt to determine the source range of the substring.
+   If successful, return NULL and write the source range to *OUT_RANGE.
+   Otherwise return an error message.  Error messages are intended
+   for GCC developers (to help debugging) rather than for end-users.  */
+
+const char *
+substring_loc::get_range (source_range *out_range) const
+{
+  gcc_assert (out_range);
+
+  enum cpp_ttype tok_type = get_cpp_ttype_from_string_type (m_string_type);
+  if (tok_type == CPP_OTHER)
+    return "unrecognized string type";
+
+  return get_source_range_for_substring (parse_in, g_string_concat_db,
+					 m_fmt_string_loc, tok_type,
+					 m_start_idx, m_end_idx,
+					 out_range);
+}
+
 \f
 /* Fold X for consideration by one of the warning functions when checking
    whether an expression has a constant value.  */
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 8c80574..7b5da57 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch (cpp_reader *pfile);
    __TIME__ can store.  */
 #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
 
+extern GTY(()) string_concat_db *g_string_concat_db;
+
+/* libcpp can calculate location information about a range of characters
+   within a string literal, but doing so is non-trivial.
+
+   This class encapsulates such a source location, so that it can be
+   passed around (e.g. within c-format.c).  It is effectively a deferred
+   call into libcpp.  If needed by a diagnostic, the actual source_range
+   can be calculated by calling the get_range method.  */
+
+class substring_loc
+{
+ public:
+  substring_loc (location_t fmt_string_loc, tree string_type,
+		 int start_idx, int end_idx)
+  : m_fmt_string_loc (fmt_string_loc), m_string_type (string_type),
+    m_start_idx (start_idx), m_end_idx (end_idx) {}
+
+  const char *get_range (source_range *out_range) const;
+
+  location_t get_fmt_string_loc () const { return m_fmt_string_loc; }
+
+ private:
+  location_t m_fmt_string_loc;
+  tree m_string_type;
+  int m_start_idx;
+  int m_end_idx;
+};
+
 /* In c-gimplify.c  */
 extern void c_genericize (tree);
 extern int c_gimplify_expr (tree *, gimple_seq *, gimple_seq *);
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
index 8f33d86..4c7e385 100644
--- a/gcc/c-family/c-lex.c
+++ b/gcc/c-family/c-lex.c
@@ -1097,13 +1097,16 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   tree value;
   size_t concats = 0;
   struct obstack str_ob;
+  struct obstack loc_ob;
   cpp_string istr;
   enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
+  location_t init_loc = tok->src_loc;
   cpp_string *strs = &str;
+  location_t *locs = NULL;
 
   /* objc_at_sign_was_seen is only used when doing Objective-C string
      concatenation.  It is 'true' if we have seen an '@' before the
@@ -1142,16 +1145,21 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 	  else
 	    error ("unsupported non-standard concatenation of string literals");
 	}
+      /* FALLTHROUGH */
 
     case CPP_STRING:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
+	  gcc_obstack_init (&loc_ob);
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
+	  obstack_grow (&loc_ob, &init_loc, sizeof (location_t));
 	}
 
       concats++;
       obstack_grow (&str_ob, &tok->val.str, sizeof (cpp_string));
+      obstack_grow (&loc_ob, &tok->src_loc, sizeof (location_t));
+
       if (objc_string)
 	objc_at_sign_was_seen = false;
       goto retry;
@@ -1164,7 +1172,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   /* We have read one more token than we want.  */
   _cpp_backup_tokens (parse_in, 1);
   if (concats)
-    strs = XOBFINISH (&str_ob, cpp_string *);
+    {
+      strs = XOBFINISH (&str_ob, cpp_string *);
+      locs = XOBFINISH (&loc_ob, location_t *);
+    }
 
   if (concats && !objc_string && !in_system_header_at (input_location))
     warning (OPT_Wtraditional,
@@ -1176,6 +1187,12 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
+      if (concats)
+	{
+	  gcc_assert (locs);
+	  gcc_assert (g_string_concat_db);
+	  g_string_concat_db->record_string_concatenation (concats + 1, locs);
+	}
     }
   else
     {
@@ -1227,7 +1244,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   *valp = fix_string_type (value);
 
   if (concats)
-    obstack_free (&str_ob, 0);
+    {
+      obstack_free (&str_ob, 0);
+      obstack_free (&loc_ob, 0);
+    }
 
   return objc_string ? CPP_OBJC_STRING : type;
 }
diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index c11e7e7..0715b2e 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -216,6 +216,9 @@ c_common_init_options (unsigned int decoded_options_count,
   unsigned int i;
   struct cpp_callbacks *cb;
 
+  g_string_concat_db
+    = new (ggc_alloc <string_concat_db> ()) string_concat_db ();
+
   parse_in = cpp_create_reader (c_dialect_cxx () ? CLK_GNUCXX: CLK_GNUC89,
 				ident_hash, line_table);
   cb = cpp_get_callbacks (parse_in);
diff --git a/gcc/gcc.c b/gcc/gcc.c
index 7460f6a..062fcce 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -1117,7 +1117,7 @@ static const char *cpp_unique_options =
    options to the preprocessor so that it the cc1 spec may manipulate
    options used to set target flags.  Those special target flags settings may
    in turn cause preprocessor symbols to be defined specially.  */
-static const char *cpp_options =
+static const char *cpp_options_ =
 "%(cpp_unique_options) %1 %{m*} %{std*&ansi&trigraphs} %{W*&pedantic*} %{w}\
  %{f*} %{g*:%{!g0:%{g*} %{!fno-working-directory:-fworking-directory}}} %{O*}\
  %{undef} %{save-temps*:-fpch-preprocess}";
@@ -1558,7 +1558,7 @@ static struct spec_list static_specs[] =
   INIT_STATIC_SPEC ("asm_options",		&asm_options),
   INIT_STATIC_SPEC ("invoke_as",		&invoke_as),
   INIT_STATIC_SPEC ("cpp",			&cpp_spec),
-  INIT_STATIC_SPEC ("cpp_options",		&cpp_options),
+  INIT_STATIC_SPEC ("cpp_options",		&cpp_options_),
   INIT_STATIC_SPEC ("cpp_debug_options",	&cpp_debug_options),
   INIT_STATIC_SPEC ("cpp_unique_options",	&cpp_unique_options),
   INIT_STATIC_SPEC ("trad_capable_cpp",		&trad_capable_cpp),
diff --git a/gcc/input.c b/gcc/input.c
index 47845d00..5033824 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1139,6 +1139,274 @@ dump_location_info (FILE *stream)
 				MAX_SOURCE_LOCATION + 1, UINT_MAX);
 }
 
+/* string_concat's constructor.  */
+
+string_concat::string_concat (int num, location_t *locs)
+  : m_num (num)
+{
+  m_locs = ggc_vec_alloc <location_t> (num);
+  for (int i = 0; i < num; i++)
+    m_locs[i] = locs[i];
+}
+
+/* string_concat_db's constructor.  */
+
+string_concat_db::string_concat_db ()
+{
+  m_table = hash_map <location_hash, string_concat *>::create_ggc (64);
+}
+
+/* Record that a string concatenation occurred, covering NUM
+   string literal tokens.  LOCS is an array of size NUM, containing the
+   locations of the tokens.  A copy of LOCS is taken.  */
+
+void
+string_concat_db::record_string_concatenation (int num, location_t *locs)
+{
+  gcc_assert (num > 1);
+  gcc_assert (locs);
+
+  location_t key_loc = get_key_loc (locs[0]);
+
+  string_concat *concat
+    = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
+  m_table->put (key_loc, concat);
+}
+
+/* Determine if LOC was the location of the the initial token of a
+   concatenation of string literal tokens.
+   If so, *OUT_NUM is written to with the number of tokens, and
+   *OUT_LOCS with the location of an array of locations of the
+   tokens, and return true.  *OUT_LOCS is a borrowed pointer to
+   storage owned by the string_concat_db.
+   Otherwise, return false.  */
+
+bool
+string_concat_db::get_string_concatenation (location_t loc,
+					    int *out_num,
+					    location_t **out_locs)
+{
+  gcc_assert (out_num);
+  gcc_assert (out_locs);
+
+  location_t key_loc = get_key_loc (loc);
+
+  string_concat **concat = m_table->get (key_loc);
+  if (!concat)
+    return false;
+
+  *out_num = (*concat)->m_num;
+  *out_locs =(*concat)->m_locs;
+  return true;
+}
+
+/* Internal function.  Canonicalize LOC into a form suitable for
+   use as a key within the database, stripping away macro expansion,
+   ad-hoc information, and range information, using the location of
+   the start of LOC within an ordinary linemap.  */
+
+location_t
+string_concat_db::get_key_loc (location_t loc)
+{
+  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
+				  NULL);
+
+  loc = get_range_from_loc (line_table, loc).m_start;
+
+  return loc;
+}
+
+/* Helper class for use within get_substring_ranges_for_loc.
+   An vec of cpp_string with responsibility for releasing all of the
+   str->text for each str in the vector.  */
+
+class auto_cpp_string_vec :  public auto_vec <cpp_string>
+{
+ public:
+  auto_cpp_string_vec (int alloc)
+    : auto_vec <cpp_string> (alloc) {}
+
+  ~auto_cpp_string_vec ()
+  {
+    /* Clean up the copies within this vec.  */
+    int i;
+    cpp_string *str;
+    FOR_EACH_VEC_ELT (*this, i, str)
+      free (const_cast <unsigned char *> (str->text));
+  }
+};
+
+/* Attempt to populate RANGES with source location information on the
+   individual characters within the string literal found at STRLOC.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC  was concatenated with are also added to RANGES.
+
+   Return NULL if successful, or an error message if any errors occurred (in
+   which case RANGES may be only partially populated and should not
+   be used).
+
+   This is implemented by re-parsing the relevant source line(s).  */
+
+static const char *
+get_substring_ranges_for_loc (cpp_reader *pfile,
+			      string_concat_db *concats,
+			      location_t strloc,
+			      enum cpp_ttype type,
+			      cpp_substring_ranges &ranges)
+{
+  gcc_assert (pfile);
+
+  if (strloc == UNKNOWN_LOCATION)
+    return "unknown location";
+
+  /* If string concatenation has occurred at STRLOC, get the locations
+     of all of the literal tokens making up the compound string.
+     Otherwise, just use STRLOC.  */
+  int num_locs = 1;
+  location_t *strlocs = &strloc;
+  if (concats)
+    concats->get_string_concatenation (strloc, &num_locs, &strlocs);
+
+  auto_cpp_string_vec strs (num_locs);
+  auto_vec <cpp_string_location_reader> loc_readers (num_locs);
+  for (int i = 0; i < num_locs; i++)
+    {
+      /* Get range of strloc.  We will use it to locate the start and finish
+	 of the literal token within the line.  */
+      source_range src_range = get_range_from_loc (line_table, strlocs[i]);
+
+      if (src_range.m_start >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token started within
+	   its line.  */
+	return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      if (src_range.m_finish >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token finished within
+	   its line.  */
+	return "range ends after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      expanded_location start
+	= expand_location_to_spelling_point (src_range.m_start);
+      expanded_location finish
+	= expand_location_to_spelling_point (src_range.m_finish);
+      if (start.file != finish.file)
+	return "range endpoints are in different files";
+      if (start.line != finish.line)
+	return "range endpoints are on different lines";
+      if (start.column > finish.column)
+	return "range endpoints are reversed";
+
+      int line_width;
+      const char *line = location_get_source_line (start.file, start.line,
+						   &line_width);
+      if (line == NULL)
+	return "unable to read source line";
+
+      /* Determine the location of the literal (including quotes
+	 and leading prefix chars, such as the 'u' in a u""
+	 token).  */
+      const char *literal = line + start.column - 1;
+      int literal_length = finish.column - start.column + 1;
+
+      gcc_assert (line_width >= (start.column - 1 + literal_length));
+      cpp_string from;
+      from.len = literal_length;
+      /* Make a copy of the literal, to avoid having to rely on
+	 the lifetime of the copy of the line within the cache.
+	 This will be released by the auto_cpp_string_vec dtor.  */
+      from.text = XDUPVEC (unsigned char, literal, literal_length);
+      strs.safe_push (from);
+
+      /* For very long lines, a new linemap could have started
+	 halfway through the token.
+	 Ensure that the loc_reader uses the linemap of the
+	 *end* of the token for its start location.  */
+      const line_map_ordinary *final_ord_map;
+      linemap_resolve_location (line_table, src_range.m_finish,
+				LRK_MACRO_EXPANSION_POINT, &final_ord_map);
+      location_t start_loc
+	= linemap_position_for_line_and_column (line_table, final_ord_map,
+						start.line, start.column);
+
+      cpp_string_location_reader loc_reader (start_loc, line_table);
+      loc_readers.safe_push (loc_reader);
+    }
+
+  /* Rerun cpp_interpret_string, or rather, a modified version of it.  */
+  const char *err = cpp_interpret_string_ranges (pfile, strs.address (),
+						 loc_readers.address (),
+						 num_locs, &ranges, type);
+  if (err)
+    return err;
+
+  /* Success: "ranges" should now contain information on the string.  */
+  return NULL;
+}
+
+/* Attempt to populate *OUT_RANGE with source location information on the
+   range of given characters within the string literal found at STRLOC.
+   START_IDX and END_IDX refer to offsets within the execution character
+   set.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC was concatenated with are also considered.
+
+   This is implemented by re-parsing the relevant source line(s).
+
+   Return NULL if successful, or an error message if any errors occurred.
+   Error messages are intended for GCC developers (to help debugging) rather
+   than for end-users.  */
+
+const char *
+get_source_range_for_substring (cpp_reader *pfile,
+				string_concat_db *concats,
+				location_t strloc,
+				enum cpp_ttype type,
+				int start_idx, int end_idx,
+				source_range *out_range)
+{
+  gcc_checking_assert (start_idx >= 0);
+  gcc_checking_assert (end_idx >= 0);
+  gcc_assert (out_range);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+  if (err)
+    return err;
+
+  if (start_idx >= ranges.get_num_ranges ())
+    return "start_idx out of range";
+  if (end_idx >= ranges.get_num_ranges ())
+    return "end_idx out of range";
+
+  out_range->m_start = ranges.get_range (start_idx).m_start;
+  out_range->m_finish = ranges.get_range (end_idx).m_finish;
+  return NULL;
+}
+
+/* As get_source_range_for_substring, but write to *OUT the number
+   of ranges that are available.  */
+
+const char *
+get_num_source_ranges_for_substring (cpp_reader *pfile,
+				     string_concat_db *concats,
+				     location_t strloc,
+				     enum cpp_ttype type,
+				     int *out)
+{
+  gcc_assert (out);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+
+  if (err)
+    return err;
+
+  *out = ranges.get_num_ranges ();
+  return NULL;
+}
+
 #if CHECKING_P
 
 namespace selftest {
@@ -1481,6 +1749,1184 @@ test_lexer (const line_table_case &case_)
   cpp_destroy (parser);
 }
 
+/* Forward decls.  */
+
+struct lexer_test;
+class lexer_test_options;
+
+/* A class for specifying options of a lexer_test.
+   The "apply" vfunc is called during the lexer_test constructor.  */
+
+class lexer_test_options
+{
+ public:
+  virtual void apply (lexer_test &) = 0;
+};
+
+/* A struct for writing lexer tests.  */
+
+struct lexer_test
+{
+  lexer_test (const line_table_case &case_, const char *content,
+	      lexer_test_options *options);
+  ~lexer_test ();
+
+  const cpp_token *get_token ();
+
+  temp_source_file m_tempfile;
+  temp_line_table m_tmp_lt;
+  cpp_reader *m_parser;
+  string_concat_db m_concats;
+};
+
+/* Use an EBCDIC encoding for the execution charset, specifically
+   IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+
+   This exercises iconv integration within libcpp.
+   Not every build of iconv supports the given charset,
+   so we need to flag this error and handle it gracefully.  */
+
+class ebcdic_execution_charset : public lexer_test_options
+{
+ public:
+  ebcdic_execution_charset () : m_num_iconv_errors (0)
+    {
+      gcc_assert (s_singleton == NULL);
+      s_singleton = this;
+    }
+  ~ebcdic_execution_charset ()
+    {
+      gcc_assert (s_singleton == this);
+      s_singleton = NULL;
+    }
+
+  void apply (lexer_test &test) FINAL OVERRIDE
+  {
+    cpp_options *cpp_opts = cpp_get_options (test.m_parser);
+    cpp_opts->narrow_charset = "IBM1047";
+
+    cpp_callbacks *callbacks = cpp_get_callbacks (test.m_parser);
+    callbacks->error = on_error;
+  }
+
+  static bool on_error (cpp_reader *pfile ATTRIBUTE_UNUSED,
+			int level ATTRIBUTE_UNUSED,
+			int reason ATTRIBUTE_UNUSED,
+			rich_location *richloc ATTRIBUTE_UNUSED,
+			const char *msgid, va_list *ap ATTRIBUTE_UNUSED)
+    ATTRIBUTE_FPTR_PRINTF(5,0)
+  {
+    gcc_assert (s_singleton);
+    /* Detect and record errors emitted by libcpp/charset.c:init_iconv_desc
+       when the local iconv build doesn't support the conversion.  */
+    if (strstr (msgid, "not supported by iconv"))
+      {
+	s_singleton->m_num_iconv_errors++;
+	return true;
+      }
+
+    /* Otherwise, we have an unexpected error.  */
+    abort ();
+  }
+
+  bool iconv_errors_occurred_p () const { return m_num_iconv_errors > 0; }
+
+ private:
+  static ebcdic_execution_charset *s_singleton;
+  int m_num_iconv_errors;
+};
+
+ebcdic_execution_charset *ebcdic_execution_charset::s_singleton;
+
+/* Constructor.  Override line_table with a new instance based on CASE_,
+   and write CONTENT to a tempfile.  Create a cpp_reader, and use it to
+   start parsing the tempfile.  */
+
+lexer_test::lexer_test (const line_table_case &case_, const char *content,
+			lexer_test_options *options) :
+  /* Create a tempfile and write the text to it.  */
+  m_tempfile (SELFTEST_LOCATION, ".c", content),
+  m_tmp_lt (case_),
+  m_parser (cpp_create_reader (CLK_GNUC99, NULL, line_table)),
+  m_concats ()
+{
+  if (options)
+    options->apply (*this);
+
+  cpp_init_iconv (m_parser);
+
+  /* Parse the file.  */
+  const char *fname = cpp_read_main_file (m_parser,
+					  m_tempfile.get_filename ());
+  ASSERT_NE (fname, NULL);
+}
+
+/* Destructor.  Verify that the next token in m_parser is EOF.  */
+
+lexer_test::~lexer_test ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  ASSERT_EQ (tok->type, CPP_EOF);
+
+  cpp_finish (m_parser, NULL);
+  cpp_destroy (m_parser);
+}
+
+/* Get the next token from m_parser.  */
+
+const cpp_token *
+lexer_test::get_token ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  return tok;
+}
+
+/* Verify that locations within string literals are correctly handled.  */
+
+/* Verify get_source_range_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the character at index IDX is on EXPECTED_LINE,
+   and that it begins at column EXPECTED_START_COL and ends at
+   EXPECTED_FINISH_COL (unless the locations are beyond
+   LINE_MAP_MAX_LOCATION_WITH_COLS, in which case don't check their
+   columns).  */
+
+static void
+assert_char_at_range (const location &loc,
+		      lexer_test& test,
+		      location_t strloc, enum cpp_ttype type, int idx,
+		      int expected_line, int expected_start_col,
+		      int expected_finish_col)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  source_range actual_range;
+  const char *err
+    = get_source_range_for_substring (pfile, concats, strloc, type,
+				      idx, idx, &actual_range);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+
+  int actual_start_line = LOCATION_LINE (actual_range.m_start);
+  ASSERT_EQ_AT (loc, expected_line, actual_start_line);
+  int actual_finish_line = LOCATION_LINE (actual_range.m_finish);
+  ASSERT_EQ_AT (loc, expected_line, actual_finish_line);
+
+  if (should_have_column_data_p (actual_range.m_start))
+    {
+      int actual_start_col = LOCATION_COLUMN (actual_range.m_start);
+      ASSERT_EQ_AT (loc, expected_start_col, actual_start_col);
+    }
+  if (should_have_column_data_p (actual_range.m_finish))
+    {
+      int actual_finish_col = LOCATION_COLUMN (actual_range.m_finish);
+      ASSERT_EQ_AT (loc, expected_finish_col, actual_finish_col);
+    }
+}
+
+/* Macro for calling assert_char_at_range, supplying SELFTEST_LOCATION for
+   the effective location of any errors.  */
+
+#define ASSERT_CHAR_AT_RANGE(LEXER_TEST, STRLOC, TYPE, IDX, EXPECTED_LINE, \
+			     EXPECTED_START_COL, EXPECTED_FINISH_COL)	\
+  assert_char_at_range (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), (TYPE), \
+			(IDX), (EXPECTED_LINE), (EXPECTED_START_COL), \
+			(EXPECTED_FINISH_COL))
+
+/* Verify get_num_source_ranges_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the token(s) at STRLOC contain EXPECTED_NUM_RANGES.  */
+
+static void
+assert_num_substring_ranges (const location &loc,
+			     lexer_test& test,
+			     location_t strloc,
+			     enum cpp_ttype type,
+			     int expected_num_ranges)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  int actual_num_ranges;
+  const char *err
+    = get_num_source_ranges_for_substring (pfile, concats, strloc, type,
+					   &actual_num_ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+  ASSERT_EQ_AT (loc, expected_num_ranges, actual_num_ranges);
+}
+
+/* Macro for calling assert_num_substring_ranges, supplying
+   SELFTEST_LOCATION for the effective location of any errors.  */
+
+#define ASSERT_NUM_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, \
+				    EXPECTED_NUM_RANGES)		\
+  assert_num_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), \
+			       (TYPE), (EXPECTED_NUM_RANGES))
+
+
+/* Verify that get_num_source_ranges_for_substring for token(s) at STRLOC
+   returns an error (using the string concatenation database for TEST).  */
+
+static void
+assert_has_no_substring_ranges (const location &loc,
+				lexer_test& test,
+				location_t strloc,
+				enum cpp_ttype type,
+				const char *expected_err)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+  cpp_substring_ranges ranges;
+  const char *actual_err
+    = get_substring_ranges_for_loc (pfile, concats, strloc,
+				    type, ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_STREQ_AT (loc, expected_err, actual_err);
+  else
+    ASSERT_STREQ_AT (loc,
+		     "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		     actual_err);
+}
+
+#define ASSERT_HAS_NO_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, ERR)    \
+    assert_has_no_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), \
+				    (STRLOC), (TYPE), (ERR))
+
+/* Lex a simple string literal.  Verify the substring location data, before
+   and after running cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_simple (const line_table_case &case_)
+{
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1,
+			  10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* As test_lexer_string_locations_simple, but use an EBCDIC execution
+   encoding.  */
+
+static void
+test_lexer_string_locations_ebcdic (const line_table_case &case_)
+{
+  /* EBCDIC support requires iconv.  */
+  if (!HAVE_ICONV)
+    return;
+
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  ebcdic_execution_charset use_ebcdic;
+  lexer_test test (case_, content, &use_ebcdic);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* The remainder of the test requires an iconv implementation that
+     can convert from UTF-8 to the EBCDIC encoding requested above.  */
+  if (use_ebcdic.iconv_errors_occurred_p ())
+    return;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* We should now have EBCDIC-encoded text, specifically
+     IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+     The digits 0-9 are encoded as 240-249 i.e. 0xf0-0xf9.  */
+  ASSERT_STREQ ("\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify that we don't attempt to record substring location information
+     for such cases.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a string literal containing a hex-escaped character.
+   Verify the substring location data, before and after running
+   cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_hex (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.
+     ....................000000000.111111.11112222.
+     ....................123456789.012345.67890123.  */
+  const char *content = "        \"01234\\x35 789\"\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\x35 789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 23);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+  ASSERT_EQ (tok->val.str.len, 15);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Lex a string literal containing an octal-escaped character.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_oct (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.
+     ....................000000000.111111.11112222.2222223333333333444
+     ....................123456789.012345.67890123.4567890123456789012  */
+  const char *content = "        \"01234\\065 789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\065 789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Test of string literal containing letter escapes.  */
+
+static void
+test_lexer_string_locations_letter_escape_1 (const line_table_case &case_)
+{
+  /* The string "\tfoo\\\nbar" i.e. tab, "foo", backslash, newline, bar.
+     .....................000000000.1.11111.1.1.11222.22222223333333
+     .....................123456789.0.12345.6.7.89012.34567890123456.  */
+  const char *content = ("        \"\\tfoo\\\\\\nbar\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"\\tfoo\\\\\\nbar\"");
+
+  /* Verify ranges of individual characters. */
+  /* "\t".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			0, 1, 10, 11);
+  /* "foo". */
+  for (int i = 1; i <= 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 11 + i, 11 + i);
+  /* "\\" and "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			4, 1, 15, 16);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			5, 1, 17, 18);
+
+  /* "bar".  */
+  for (int i = 6; i <= 8; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 9);
+}
+
+/* Another test of a string literal containing a letter escape.
+   Based on string seen in
+     printf ("%-%\n");
+   in gcc.dg/format/c90-printf-1.c.  */
+
+static void
+test_lexer_string_locations_letter_escape_2 (const line_table_case &case_)
+{
+  /* .....................000000000.1111.11.1111.22222222223.
+     .....................123456789.0123.45.6789.01234567890.  */
+  const char *content = ("        \"%-%\\n\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"%-%\\n\"");
+
+  /* Verify ranges of individual characters. */
+  /* "%-%".  */
+  for (int i = 0; i < 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 10 + i, 10 + i);
+  /* "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			3, 1, 13, 14);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 4);
+}
+
+/* Lex a string literal containing UCN 4 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn4 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     ....................000000000.111111.111122.222222223.33333333344444
+     ....................123456789.012345.678901.234567890.12345678901234  */
+  const char *content = "        \"01234\\u2174\\u2175789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\u2174\\u2175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The string should be encoded in the execution character
+     set.  Assuming that that is UTF-8, we should have the following:
+     -----------  ----  -----  -------  ----------------
+     Byte offset  Byte  Octal  Unicode  Source Column(s)
+     -----------  ----  -----  -------  ----------------
+     0            0x30         '0'      10
+     1            0x31         '1'      11
+     2            0x32         '2'      12
+     3            0x33         '3'      13
+     4            0x34         '4'      14
+     5            0xE2  \342   U+2174   15-20
+     6            0x85  \205    (cont)  15-20
+     7            0xB4  \264    (cont)  15-20
+     8            0xE2  \342   U+2175   21-26
+     9            0x85  \205    (cont)  21-26
+     10           0xB5  \265    (cont)  21-26
+     11           0x37         '7'      27
+     12           0x38         '8'      28
+     13           0x39         '9'      29
+     -----------  ----  -----  -------  ---------------.  */
+
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 20);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 21, 26);
+  /* '789'.  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 16 + i, 16 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Lex a string literal containing UCN 8 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn8 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     ....................000000000.111111.1111222222.2222333333333.344444
+     ....................123456789.012345.6789012345.6789012345678.901234  */
+  const char *content = "        \"01234\\U00002174\\U00002175789\" /* */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok,
+			   "\"01234\\U00002174\\U00002175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The UTF-8 encoding of the string is identical to that from
+     the ucn4 testcase above; the only difference is the column
+     locations.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 24);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 25, 34);
+  /* '789' at columns 35-37  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 24 + i, 24 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Fetch a big-endian 32-bit value and convert to host endianness.  */
+
+static uint32_t
+uint32_from_big_endian (const uint32_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return (((uint32_t) buf[0] << 24)
+	  | ((uint32_t) buf[1] << 16)
+	  | ((uint32_t) buf[2] << 8)
+	  | (uint32_t) buf[3]);
+}
+
+/* Lex a wide string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_wide_string (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       L\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_WSTRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "L\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_WSTRING.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_WSTRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* The cpp_reader defaults to big-endian with
+     CHAR_BIT * sizeof (int) for the wchar_precision, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Fetch a big-endian 16-bit value and convert to host endianness.  */
+
+static uint16_t
+uint16_from_big_endian (const uint16_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return ((uint16_t) buf[0] << 8) | (uint16_t) buf[1];
+}
+
+/* Lex a u"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string16 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       u\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING16);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING16.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING16;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-16BE.  */
+  const uint16_t *be16_chars = (const uint16_t *)dst_string.text;
+  ASSERT_EQ ('0', uint16_from_big_endian (&be16_chars[0]));
+  ASSERT_EQ ('5', uint16_from_big_endian (&be16_chars[5]));
+  ASSERT_EQ ('9', uint16_from_big_endian (&be16_chars[9]));
+  ASSERT_EQ (0, uint16_from_big_endian (&be16_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a U"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string32 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       U\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING32);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "U\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING32.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING32;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a u8-string literal.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_u8 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "      u8\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_UTF8STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u8\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+}
+
+/* Lex a string literal containing UTF-8 source characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_utf8_source (const line_table_case &case_)
+{
+ /* This string literal is written out to the source file as UTF-8,
+    and is of the form "before mojibake after", where "mojibake"
+    is written as the following four unicode code points:
+       U+6587 CJK UNIFIED IDEOGRAPH-6587
+       U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+       U+5316 CJK UNIFIED IDEOGRAPH-5316
+       U+3051 HIRAGANA LETTER KE.
+     Each of these is 3 bytes wide when encoded in UTF-8, whereas the
+     "before" and "after" are 1 byte per unicode character.
+
+     The numbering shown are "columns", which are *byte* numbers within
+     the line, rather than unicode character numbers.
+
+     .................... 000000000.1111111.
+     .................... 123456789.0123456.  */
+  const char *content = ("        \"before "
+			 /* U+6587 CJK UNIFIED IDEOGRAPH-6587
+			      UTF-8: 0xE6 0x96 0x87
+			      C octal escaped UTF-8: \346\226\207
+			    "column" numbers: 17-19.  */
+			 "\346\226\207"
+
+			 /* U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+			      UTF-8: 0xE5 0xAD 0x97
+			      C octal escaped UTF-8: \345\255\227
+			    "column" numbers: 20-22.  */
+			 "\345\255\227"
+
+			 /* U+5316 CJK UNIFIED IDEOGRAPH-5316
+			      UTF-8: 0xE5 0x8C 0x96
+			      C octal escaped UTF-8: \345\214\226
+			    "column" numbers: 23-25.  */
+			 "\345\214\226"
+
+			 /* U+3051 HIRAGANA LETTER KE
+			      UTF-8: 0xE3 0x81 0x91
+			      C octal escaped UTF-8: \343\201\221
+			    "column" numbers: 26-28.  */
+			 "\343\201\221"
+
+			 /* column numbers 29 onwards
+			  2333333.33334444444444
+			  9012345.67890123456789. */
+			 " after\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"before \346\226\207\345\255\227\345\214\226\343\201\221 after\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ
+    ("before \346\226\207\345\255\227\345\214\226\343\201\221 after",
+     (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     Assuming that both source and execution encodings are UTF-8, we have
+     a run of 25 octets in each.  */
+  for (int i = 0; i < 25; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 25);
+}
+
+/* Test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_1 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111111.11112222222222
+     .....................123456789.012345.67890123456789.  */
+  const char *content = ("        \"01234\" /* non-str */\n"
+			 "        \"56789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  location_t input_locs[2];
+
+  /* Verify that we get the expected tokens back.  */
+  auto_vec <cpp_string> input_strings;
+  const cpp_token *tok_a = test.get_token ();
+  ASSERT_EQ (tok_a->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok_a,
+     "\"01234\"");
+  input_strings.safe_push (tok_a->val.str);
+  input_locs[0] = tok_a->src_loc;
+
+  const cpp_token *tok_b = test.get_token ();
+  ASSERT_EQ (tok_b->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok_b,
+     "\"56789\"");
+  input_strings.safe_push (tok_b->val.str);
+  input_locs[1] = tok_b->src_loc;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 2,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (2, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  for (int i = 5; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 2, 5 + i, 5 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_2 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111.11111112222222
+     .....................123456789.012.34567890123456.  */
+  const char *content = ("        \"01\" /* non-str */\n"
+			 "        \"23\" /* non-str */\n"
+			 "        \"45\" /* non-str */\n"
+			 "        \"67\" /* non-str */\n"
+			 "        \"89\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[5];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 5; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 5,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (5, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  /* Within ASSERT_CHAR_AT_RANGE (actually assert_char_at_range), we can
+     detect if the initial loc is after LINE_MAP_MAX_LOCATION_WITH_COLS
+     and expect get_source_range_for_substring to fail.
+     However, for a string concatenation test, we can have a case
+     where the initial string is fully before LINE_MAP_MAX_LOCATION_WITH_COLS,
+     but subsequent strings can be after it.
+     Attempting to detect this within assert_char_at_range
+     would overcomplicate the logic for the common test cases, so
+     we detect it here.  */
+  if (should_have_column_data_p (input_locs[0])
+      && !should_have_column_data_p (input_locs[4]))
+    {
+      /* Verify that get_source_range_for_substring gracefully rejects
+	 this case.  */
+      source_range actual_range;
+      const char *err
+	= get_source_range_for_substring (test.m_parser, &test.m_concats,
+					  initial_loc, type, 0, 0,
+					  &actual_range);
+      ASSERT_STREQ ("range starts after LINE_MAP_MAX_LOCATION_WITH_COLS", err);
+      return;
+    }
+
+  for (int i = 0; i < 5; i++)
+    for (int j = 0; j < 2; j++)
+      ASSERT_CHAR_AT_RANGE (test, initial_loc, type, (i * 2) + j,
+			    i + 1, 10 + j, 10 + j);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation, this time combined with
+   various kinds of escaped characters.  */
+
+static void
+test_lexer_string_locations_concatenation_3 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  const char *content
+    /* .000000000.111111.111.1.2222.222.2.2233.333.3333.34444444444555
+       .123456789.012345.678.9.0123.456.7.8901.234.5678.90123456789012. */
+    = ("        \"01234\"  \"\\x35\"  \"\\066\"  \"789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[4];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 4; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 4,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (4, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 5, 1, 19, 22);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 6, 1, 27, 30);
+  for (int i = 7; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 28 + i, 28 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Test of string literal in a macro.  */
+
+static void
+test_lexer_string_locations_macro (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("#define MACRO     \"0123456789\" /* non-str */\n"
+			 "  MACRO");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"0123456789\"");
+
+  /* Verify ranges of individual characters.  We ought to
+     see columns within the macro definition.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 20 + i, 20 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 10);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
+/* Ensure that we are fail gracefully if something attempts to pass
+   in a location that isn't a string literal token.  Seen on this code:
+
+     const char a[] = " %d ";
+     __builtin_printf (a, 0.5);
+                       ^
+
+   when c-format.c erroneously used the indicated one-character
+   location as the format string location, leading to a read past the
+   end of a string buffer in cpp_interpret_string_1.  */
+
+static void
+test_lexer_string_locations_non_string (const line_table_case &case_)
+{
+  /* .....................000000000111111111122222222223.
+     .....................123456789012345678901234567890.  */
+  const char *content = ("         a\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_NAME);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "a");
+
+  /* At this point, libcpp is attempting to interpret the name as a
+     string literal, despite it not starting with a quote.  We don't detect
+     that, but we should at least fail gracefully.  */
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 0);
+}
+
+/* Ensure that we can read substring information for a token which
+   starts in one linemap and ends in another .  Adapted from
+   gcc.dg/cpp/pr69985.c.  */
+
+static void
+test_lexer_string_locations_long_line (const line_table_case &case_)
+{
+  /* .....................000000.000111111111
+     .....................123456.789012346789.  */
+  const char *content = ("/* A very long line, so that we start a new line map.  */\n"
+			 "     \"0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789\"\n");
+
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+
+  if (!should_have_column_data_p (line_table->highest_location))
+    return;
+
+  /* Verify ranges of individual characters.  */
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 130);
+  for (int i = 0; i < 130; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 2, 7 + i, 7 + i);
+}
+
 /* A table of interesting location_t values, giving one axis of our test
    matrix.  */
 
@@ -1539,6 +2985,25 @@ input_c_tests ()
 	  /* Run all tests for the given case within the test matrix.  */
 	  test_accessing_ordinary_linemaps (c);
 	  test_lexer (c);
+	  test_lexer_string_locations_simple (c);
+	  test_lexer_string_locations_ebcdic (c);
+	  test_lexer_string_locations_hex (c);
+	  test_lexer_string_locations_oct (c);
+	  test_lexer_string_locations_letter_escape_1 (c);
+	  test_lexer_string_locations_letter_escape_2 (c);
+	  test_lexer_string_locations_ucn4 (c);
+	  test_lexer_string_locations_ucn8 (c);
+	  test_lexer_string_locations_wide_string (c);
+	  test_lexer_string_locations_string16 (c);
+	  test_lexer_string_locations_string32 (c);
+	  test_lexer_string_locations_u8 (c);
+	  test_lexer_string_locations_utf8_source (c);
+	  test_lexer_string_locations_concatenation_1 (c);
+	  test_lexer_string_locations_concatenation_2 (c);
+	  test_lexer_string_locations_concatenation_3 (c);
+	  test_lexer_string_locations_macro (c);
+	  test_lexer_string_locations_non_string (c);
+	  test_lexer_string_locations_long_line (c);
 
 	  num_cases_tested++;
 	}
diff --git a/gcc/input.h b/gcc/input.h
index ae4fecf..b61cf19 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -22,6 +22,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_INPUT_H
 
 #include "line-map.h"
+#include <cpplib.h>
 
 extern GTY(()) struct line_maps *line_table;
 
@@ -82,4 +83,46 @@ void dump_location_info (FILE *stream);
 
 void diagnostics_file_cache_fini (void);
 
+struct GTY(()) string_concat
+{
+  string_concat (int num, location_t *locs);
+
+  int m_num;
+  location_t * GTY ((atomic)) m_locs;
+};
+
+struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
+
+class GTY(()) string_concat_db
+{
+ public:
+  string_concat_db ();
+  void record_string_concatenation (int num, location_t *locs);
+
+  bool get_string_concatenation (location_t loc,
+				 int *out_num,
+				 location_t **out_locs);
+
+ private:
+  static location_t get_key_loc (location_t loc);
+
+  /* For the fields to be private, we must grant access to the
+     generated code in gtype-desc.c.  */
+
+  friend void ::gt_ggc_mx_string_concat_db (void *x_p);
+  friend void ::gt_pch_nx_string_concat_db (void *x_p);
+  friend void ::gt_pch_p_16string_concat_db (void *this_obj, void *x_p,
+					     gt_pointer_operator op,
+					     void *cookie);
+
+  hash_map <location_hash, string_concat *> *m_table;
+};
+
+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+						   string_concat_db *concats,
+						   location_t strloc,
+						   enum cpp_ttype type,
+						   int start_idx, int end_idx,
+						   source_range *out_range);
+
 #endif
diff --git a/gcc/selftest.h b/gcc/selftest.h
index 0bee476..397e998 100644
--- a/gcc/selftest.h
+++ b/gcc/selftest.h
@@ -104,13 +104,19 @@ extern int num_passes;
    ::selftest::fail if it false.  */
 
 #define ASSERT_TRUE(EXPR)				\
+  ASSERT_TRUE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_TRUE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_TRUE_AT(LOC, EXPR)			\
   SELFTEST_BEGIN_STMT					\
   const char *desc = "ASSERT_TRUE (" #EXPR ")";		\
   bool actual = (EXPR);					\
   if (actual)						\
-    ::selftest::pass (SELFTEST_LOCATION, desc);	\
+    ::selftest::pass ((LOC), desc);			\
   else							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);		\
+    ::selftest::fail ((LOC), desc);			\
   SELFTEST_END_STMT
 
 /* Evaluate EXPR and coerce to bool, calling
@@ -118,13 +124,19 @@ extern int num_passes;
    ::selftest::fail if it true.  */
 
 #define ASSERT_FALSE(EXPR)					\
+  ASSERT_FALSE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_FALSE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_FALSE_AT(LOC, EXPR)				\
   SELFTEST_BEGIN_STMT						\
-  const char *desc = "ASSERT_FALSE (" #EXPR ")";		\
-  bool actual = (EXPR);					\
-  if (actual)							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);				\
-  else								\
-    ::selftest::pass (SELFTEST_LOCATION, desc);				\
+  const char *desc = "ASSERT_FALSE (" #EXPR ")";			\
+  bool actual = (EXPR);							\
+  if (actual)								\
+    ::selftest::fail ((LOC), desc);			\
+  else									\
+    ::selftest::pass ((LOC), desc);					\
   SELFTEST_END_STMT
 
 /* Evaluate EXPECTED and ACTUAL and compare them with ==, calling
@@ -169,7 +181,7 @@ extern int num_passes;
 			    (EXPECTED), (ACTUAL));		    \
   SELFTEST_END_STMT
 
-/* Like ASSERT_STREQ_AT, but treat LOC as the effective location of the
+/* Like ASSERT_STREQ, but treat LOC as the effective location of the
    selftest.  */
 
 #define ASSERT_STREQ_AT(LOC, EXPECTED, ACTUAL)			    \
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
new file mode 100644
index 0000000..82689b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -0,0 +1,211 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdiagnostics-show-caret" } */
+
+/* This is a collection of unittests for ranges within string literals,
+   using diagnostic_plugin_test_string_literals, which handles
+   "__emit_string_literal_range" by generating a warning at the given
+   subset of a string literal.
+
+   The indices are 0-based.  It's easiest to verify things using string
+   literals that are runs of 0-based digits (to avoid having to count
+   characters).
+
+   LITERAL is a const void * to allow testing the various kinds of wide
+   string literal, rather than just const char *.  */
+
+extern void __emit_string_literal_range (const void *literal,
+					 int start_idx, int end_idx);
+
+void
+test_simple_string_literal (void)
+{
+  __emit_string_literal_range ("0123456789", /* { dg-warning "range" } */
+			       6, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("0123456789",
+                                       ^~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_concatenated_string_literal (void)
+{
+  __emit_string_literal_range ("01234" "56789", /* { dg-warning "range" } */
+			       3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234" "56789",
+                                    ^~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiline_string_literal (void)
+{
+  __emit_string_literal_range ("01234" /* { dg-warning "range" } */
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~  
+   { dg-end-multiline-output "" } */
+  /* FIXME: why does the above need two trailing spaces?  */
+}
+
+/* Tests of various unicode encodings.
+
+   Digits 0 through 9 are unicode code points:
+      U+0030 DIGIT ZERO
+      ...
+      U+0039 DIGIT NINE
+   However, these are not always valid as UCN (see the comment in
+   libcpp/charset.c:_cpp_valid_ucn).
+
+   Hence we need to test UCN using an alternative unicode
+   representation of numbers; let's use Roman numerals,
+   (though these start at one, not zero):
+      U+2170 SMALL ROMAN NUMERAL ONE
+      ...
+      U+2174 SMALL ROMAN NUMERAL FIVE  ("v")
+      U+2175 SMALL ROMAN NUMERAL SIX   ("vi")
+      ...
+      U+2178 SMALL ROMAN NUMERAL NINE.  */
+
+void
+test_hex (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.  */
+  __emit_string_literal_range ("01234\x35 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\x35 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_oct (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.  */
+  __emit_string_literal_range ("01234\065 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\065 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiple (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  __emit_string_literal_range ("01234"  "\x35"  "\066"  "789", /* { dg-warning "range" } */
+			       3, 8);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"  "\x35"  "\066"  "789",
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn4 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     The resulting string is encoded as UTF-8.  Most of the digits are 1 byte
+     each, but digits 5 and 6 are encoded with 3 bytes each.
+     Hence to underline digits 4-7 we need to underling using bytes 4-11 in
+     the UTF-8 encoding.  */
+  __emit_string_literal_range ("01234\u2174\u2175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\u2174\u2175789",
+                                     ^~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn8 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     The resulting string is the same as as in test_ucn4 above, and hence
+     has the same UTF-8 encoding, and so we again need to underline bytes
+     4-11 in the UTF-8 encoding in order to underline digits 4-7.  */
+  __emit_string_literal_range ("01234\U00002174\U00002175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\U00002174\U00002175789",
+                                     ^~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u8 (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u8"0123456789", /* { dg-warning "range" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u8"0123456789",
+                                       ^~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_U (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (U"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (U"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_L (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (L"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (L"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_macro (void)
+{
+#define START "01234"  /* { dg-warning "range" } */
+  __emit_string_literal_range (START
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+ #define START "01234"
+                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   __emit_string_literal_range (START
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~
+   { dg-end-multiline-output "" } */
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
new file mode 100644
index 0000000..d44612a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
@@ -0,0 +1,212 @@
+/* This plugin uses the diagnostics code to verify tracking of source code
+   locations within string literals.  */
+/* { dg-options "-O" } */
+
+#include "gcc-plugin.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "stringpool.h"
+#include "toplev.h"
+#include "basic-block.h"
+#include "hash-table.h"
+#include "vec.h"
+#include "ggc.h"
+#include "basic-block.h"
+#include "tree-ssa-alias.h"
+#include "internal-fn.h"
+#include "gimple-fold.h"
+#include "tree-eh.h"
+#include "gimple-expr.h"
+#include "is-a.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "intl.h"
+#include "plugin-version.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "context.h"
+#include "print-tree.h"
+#include "cpplib.h"
+#include "c-family/c-pragma.h"
+
+int plugin_is_GPL_compatible;
+
+/* A custom pass for printing string literal location information.  */
+
+const pass_data pass_data_test_string_literals =
+{
+  GIMPLE_PASS, /* type */
+  "test_string_literals", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_test_string_literals : public gimple_opt_pass
+{
+public:
+  pass_test_string_literals(gcc::context *ctxt)
+    : gimple_opt_pass(pass_data_test_string_literals, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_test_string_literals
+
+/* Determine if STMT is a call with NUM_ARGS arguments to a function
+   named FUNCNAME.
+   If so, return STMT as a gcall *.  Otherwise return NULL.  */
+
+static gcall *
+check_for_named_call (gimple *stmt,
+		      const char *funcname, unsigned int num_args)
+{
+  gcc_assert (funcname);
+
+  gcall *call = dyn_cast <gcall *> (stmt);
+  if (!call)
+    return NULL;
+
+  tree fndecl = gimple_call_fndecl (call);
+  if (!fndecl)
+    return NULL;
+
+  if (strcmp (IDENTIFIER_POINTER (DECL_NAME (fndecl)), funcname))
+    return NULL;
+
+  if (gimple_call_num_args (call) != num_args)
+    {
+      error_at (stmt->location, "expected number of args: %i (got %i)",
+		num_args, gimple_call_num_args (call));
+      return NULL;
+    }
+
+  return call;
+}
+
+/* Emit a warning covering SRC_RANGE, with the caret at the start of
+   SRC_RANGE.  */
+
+static void
+emit_warning (source_range src_range)
+{
+  location_t loc
+    = make_location (src_range.m_start, src_range.m_start, src_range.m_finish);
+  warning_at (loc, 0, "range %i:%i-%i:%i",
+	      LOCATION_LINE (src_range.m_start),
+	      LOCATION_COLUMN (src_range.m_start),
+	      LOCATION_LINE (src_range.m_finish),
+	      LOCATION_COLUMN (src_range.m_finish));
+}
+
+/* Support code for verifying that we are correctly tracking ranges
+   within string literals, for use by diagnostic-test-string-literals-*.c.
+   Emit a warning showing the range of a string literal, for each call to
+   a function named "__emit_string_literal_range".
+   The initial argument should be a string literal; arguments 2 and 3
+   should be integer constants, giving the range within the string
+   to be printed.  */
+
+static void
+test_string_literals (gimple *stmt)
+{
+  gcall *call = check_for_named_call (stmt, "__emit_string_literal_range", 3);
+  if (!call)
+    return;
+
+  /* We expect an ADDR_EXPR with a STRING_CST inside it for the
+     initial arg.  */
+  tree t_addr_string = gimple_call_arg (call, 0);
+  if (TREE_CODE (t_addr_string) != ADDR_EXPR)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_string = TREE_OPERAND (t_addr_string, 0);
+  if (TREE_CODE (t_string) != STRING_CST)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_start_idx = gimple_call_arg (call, 1);
+  if (TREE_CODE (t_start_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 2");
+      return;
+    }
+  int start_idx = TREE_INT_CST_LOW (t_start_idx);
+
+  tree t_end_idx = gimple_call_arg (call, 2);
+  if (TREE_CODE (t_end_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 3");
+      return;
+    }
+  int end_idx = TREE_INT_CST_LOW (t_end_idx);
+
+  /* A STRING_CST doesn't have a location, but the ADDR_EXPR does.  */
+  location_t strloc = EXPR_LOCATION (t_addr_string);
+  source_range src_range;
+  substring_loc substr_loc (strloc, TREE_TYPE (t_string),
+			    start_idx, end_idx);
+  const char *err = substr_loc.get_range (&src_range);
+  if (err)
+    error_at (strloc, "unable to read substring range: %s", err);
+  else
+    emit_warning (src_range);
+}
+
+/* Call test_string_literals on every statement within FUN.  */
+
+unsigned int
+pass_test_string_literals::execute (function *fun)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, fun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	test_string_literals (stmt);
+      }
+
+  return 0;
+}
+
+/* Entrypoint for the plugin.  Create and register the custom pass.  */
+
+int
+plugin_init (struct plugin_name_args *plugin_info,
+	     struct plugin_gcc_version *version)
+{
+  struct register_pass_info pass_info;
+  const char *plugin_name = plugin_info->base_name;
+  int argc = plugin_info->argc;
+  struct plugin_argument *argv = plugin_info->argv;
+
+  if (!plugin_default_version_check (version, &gcc_version))
+    return 1;
+
+  pass_info.pass = new pass_test_string_literals (g);
+  pass_info.reference_pass_name = "ssa";
+  pass_info.ref_pass_instance_number = 1;
+  pass_info.pos_op = PASS_POS_INSERT_AFTER;
+  register_callback (plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,
+		     &pass_info);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/plugin.exp b/gcc/testsuite/gcc.dg/plugin/plugin.exp
index faebb75..f22d8a7 100644
--- a/gcc/testsuite/gcc.dg/plugin/plugin.exp
+++ b/gcc/testsuite/gcc.dg/plugin/plugin.exp
@@ -70,6 +70,8 @@ set plugin_test_list [list \
 	  diagnostic-test-expressions-1.c } \
     { diagnostic_plugin_show_trees.c \
 	  diagnostic-test-show-trees-1.c } \
+    { diagnostic_plugin_test_string_literals.c \
+	  diagnostic-test-string-literals-1.c } \
     { location_overflow_plugin.c \
 	  location-overflow-test-1.c \
 	  location-overflow-test-2.c } \
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 2d07942..d0744e3 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -812,6 +812,51 @@ cpp_host_to_exec_charset (cpp_reader *pfile, cppchar_t c)
 
 \f
 
+/* cpp_substring_ranges's constructor. */
+
+cpp_substring_ranges::cpp_substring_ranges () :
+  m_ranges (NULL),
+  m_num_ranges (0),
+  m_alloc_ranges (8)
+{
+  m_ranges = XNEWVEC (source_range, m_alloc_ranges);
+}
+
+/* cpp_substring_ranges's destructor. */
+
+cpp_substring_ranges::~cpp_substring_ranges ()
+{
+  free (m_ranges);
+}
+
+/* Add RANGE to the vector of source_range information.  */
+
+void
+cpp_substring_ranges::add_range (source_range range)
+{
+  if (m_num_ranges >= m_alloc_ranges)
+    {
+      m_alloc_ranges *= 2;
+      m_ranges
+	= (source_range *)xrealloc (m_ranges,
+				    sizeof (source_range) * m_alloc_ranges);
+    }
+  m_ranges[m_num_ranges++] = range;
+}
+
+/* Read NUM ranges from LOC_READER, adding them to the vector of source_range
+   information.  */
+
+void
+cpp_substring_ranges::add_n_ranges (int num,
+				    cpp_string_location_reader &loc_reader)
+{
+  for (int i = 0; i < num; i++)
+    add_range (loc_reader.get_next ());
+}
+
+\f
+
 /* Utility routine that computes a mask of the form 0000...111... with
    WIDTH 1-bits.  */
 static inline size_t
@@ -980,18 +1025,27 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
    one beyond the UCN, or to the syntactically invalid character.
 
    IDENTIFIER_POS is 0 when not in an identifier, 1 for the start of
-   an identifier, or 2 otherwise.  */
+   an identifier, or 2 otherwise.
+
+   If CHAR_RANGE and LOC_READER are non-NULL, then position information is
+   read from *LOC_READER and CHAR_RANGE->m_finish is updated accordingly.  */
 
 bool
 _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 		const uchar *limit, int identifier_pos,
-		struct normalize_state *nst, cppchar_t *cp)
+		struct normalize_state *nst, cppchar_t *cp,
+		source_range *char_range,
+		cpp_string_location_reader *loc_reader)
 {
   cppchar_t result, c;
   unsigned int length;
   const uchar *str = *pstr;
   const uchar *base = str - 2;
 
+  /* char_range and loc_reader must either be both NULL, or both be
+     non-NULL.  */
+  gcc_assert ((char_range != NULL) == (loc_reader != NULL));
+
   if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99))
     cpp_error (pfile, CPP_DL_WARNING,
 	       "universal character names are only valid in C++ and C99");
@@ -1021,6 +1075,8 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
       if (!ISXDIGIT (c))
 	break;
       str++;
+      if (loc_reader)
+	char_range->m_finish = loc_reader->get_next ().m_finish;
       result = (result << 4) + hex_value (c);
     }
   while (--length && str < limit);
@@ -1086,11 +1142,18 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 }
 
 /* Convert an UCN, pointed to by FROM, to UTF-8 encoding, then translate
-   it to the execution character set and write the result into TBUF.
-   An advanced pointer is returned.  Issues all relevant diagnostics.  */
+   it to the execution character set and write the result into TBUF,
+   if TBUF is non-NULL.
+   An advanced pointer is returned.  Issues all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t ucn;
   uchar buf[6];
@@ -1099,8 +1162,17 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
   int rval;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   from++;  /* Skip u/U.  */
-  _cpp_valid_ucn (pfile, &from, limit, 0, &nst, &ucn);
+
+  if (loc_reader)
+    /* The u/U is part of the spelling of this character.  */
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
+  _cpp_valid_ucn (pfile, &from, limit, 0, &nst,
+		  &ucn, &char_range, loc_reader);
 
   rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
   if (rval)
@@ -1109,9 +1181,20 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
       cpp_errno (pfile, CPP_DL_ERROR,
 		 "converting UCN to source character set");
     }
-  else if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting UCN to execution character set");
+  else
+    {
+      if (tbuf)
+	if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
+	  cpp_errno (pfile, CPP_DL_ERROR,
+		     "converting UCN to execution character set");
+
+      if (loc_reader)
+	{
+	  int num_encoded_bytes = 6 - bytesleft;
+	  for (int i = 0; i < num_encoded_bytes; i++)
+	    ranges->add_range (char_range);
+	}
+    }
 
   return from;
 }
@@ -1167,31 +1250,48 @@ emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
 }
 
 /* Convert a hexadecimal escape, pointed to by FROM, to the execution
-   character set and write it into the string buffer TBUF.  Returns an
-   advanced pointer, and issues diagnostics as necessary.
+   character set and write it into the string buffer TBUF (if non-NULL).
+   Returns an advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given hex
-   number.  You can, e.g. generate surrogate pairs this way.  */
+   number.  You can, e.g. generate surrogate pairs this way.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
   size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   if (CPP_WTRADITIONAL (pfile))
     cpp_warning (pfile, CPP_W_TRADITIONAL,
 	         "the meaning of '\\x' is different in traditional C");
 
-  from++;  /* Skip 'x'.  */
+  /* Skip 'x'.  */
+  from++;
+
+  /* The 'x' is part of the spelling of this character.  */
+  if (loc_reader)
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
   while (from < limit)
     {
       c = *from;
       if (! hex_p (c))
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 4 >> 4);
       n = (n << 4) + hex_value (c);
       digits_found = 1;
@@ -1211,7 +1311,10 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
@@ -1221,10 +1324,16 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
    advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given octal
-   number.  */
+   number.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
@@ -1232,12 +1341,17 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   while (from < limit && count++ < 3)
     {
       c = *from;
       if (c < '0' || c > '7')
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 3 >> 3);
       n = (n << 3) + c - '0';
     }
@@ -1249,18 +1363,26 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
 
 /* Convert an escape sequence (pointed to by FROM) to its value on
    the target, and to the execution character set.  Do not scan past
-   LIMIT.  Write the converted value into TBUF.  Returns an advanced
-   pointer.  Handles all relevant diagnostics.  */
+   LIMIT.  Write the converted value into TBUF, if TBUF is non-NULL.
+   Returns an advanced pointer.  Handles all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL: location
+   information is read from *LOC_READER, and *RANGES is updated
+   accordingly.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+		cpp_string_location_reader *loc_reader,
+		cpp_substring_ranges *ranges)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1273,20 +1395,28 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 
   uchar c;
 
+  /* Record the location of the backslash.  */
+  source_range char_range;
+  if (loc_reader)
+    char_range = loc_reader->get_next ();
+
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, cvt);
+      return convert_ucn (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, cvt);
+      return convert_hex (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, cvt);
+      return convert_oct (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1338,10 +1468,17 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 	}
     }
 
-  /* Now convert what we have to the execution character set.  */
-  if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting escape sequence to execution character set");
+  if (tbuf)
+    /* Now convert what we have to the execution character set.  */
+    if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
+      cpp_errno (pfile, CPP_DL_ERROR,
+		 "converting escape sequence to execution character set");
+
+  if (loc_reader)
+    {
+      char_range.m_finish = loc_reader->get_next ().m_finish;
+      ranges->add_range (char_range);
+    }
 
   return from + 1;
 }
@@ -1374,28 +1511,52 @@ converter_for_type (cpp_reader *pfile, enum cpp_ttype type)
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
    concatenated.  WIDE indicates whether or not to produce a wide
-   string.  The result is written into TO.  Returns true for success,
-   false for failure.  */
-bool
-cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to,  enum cpp_ttype type)
+   string.  If TO is non-NULL, the result is written into TO.
+   If LOC_READERS and OUT are non-NULL, then location information
+   is read from LOC_READERS (which must be an array of length COUNT),
+   and location information is written to *RANGES.
+
+   Returns true for success, false for failure.  */
+
+static bool
+cpp_interpret_string_1 (cpp_reader *pfile, const cpp_string *from, size_t count,
+			cpp_string *to,  enum cpp_ttype type,
+			cpp_string_location_reader *loc_readers,
+			cpp_substring_ranges *out)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
   struct cset_converter cvt = converter_for_type (pfile, type);
 
-  tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
-  tbuf.text = XNEWVEC (uchar, tbuf.asize);
-  tbuf.len = 0;
+  /* loc_readers and out must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_readers != NULL) == (out != NULL));
+
+  if (to)
+    {
+      tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
+      tbuf.text = XNEWVEC (uchar, tbuf.asize);
+      tbuf.len = 0;
+    }
 
   for (i = 0; i < count; i++)
     {
+      cpp_string_location_reader *loc_reader = NULL;
+      if (loc_readers)
+	loc_reader = &loc_readers[i];
+
       p = from[i].text;
       if (*p == 'u')
 	{
-	  if (*++p == '8')
-	    p++;
+	  p++;
+	  if (loc_reader)
+	    loc_reader->get_next ();
+	  if (*p == '8')
+	    {
+	      p++;
+	      if (loc_reader)
+		loc_reader->get_next ();
+	    }
 	}
       else if (*p == 'L' || *p == 'U') p++;
       if (*p == 'R')
@@ -1414,13 +1575,31 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 
 	  /* Raw strings are all normal characters; these can be fed
 	     directly to convert_cset.  */
-	  if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
-	    goto fail;
+	  if (to)
+	    if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
+	      goto fail;
+
+	  if (loc_reader)
+	    {
+	      /* If generating source ranges, assume we have a 1:1
+		 correspondence between bytes in the source encoding and bytes
+		 in the execution encoding (e.g. if we have a UTF-8 to UTF-8
+		 conversion), so that this run of bytes in the source file
+		 corresponds to a run of bytes in the execution string.
+		 This requirement is guaranteed by an early-reject in
+		 cpp_interpret_string_ranges.  */
+	      gcc_assert (cvt.func == convert_no_conversion);
+	      out->add_n_ranges (limit - p, *loc_reader);
+	    }
 
 	  continue;
 	}
 
-      p++; /* Skip leading quote.  */
+      /* Skip leading quote.  */
+      p++;
+      if (loc_reader)
+	loc_reader->get_next ();
+
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
       for (;;)
@@ -1432,29 +1611,97 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 	    {
 	      /* We have a run of normal characters; these can be fed
 		 directly to convert_cset.  */
-	      if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
-		goto fail;
+	      if (to)
+		if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
+		  goto fail;
+	    /* Similar to above: assumes we have a 1:1 correspondence
+	       between bytes in the source encoding and bytes in the
+	       execution encoding.  */
+	      if (loc_reader)
+		{
+		  gcc_assert (cvt.func == convert_no_conversion);
+		  out->add_n_ranges (p - base, *loc_reader);
+		}
 	    }
-	  if (p == limit)
+	  if (p >= limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
+	  struct _cpp_strbuf *tbuf_ptr = to ? &tbuf : NULL;
+	  p = convert_escape (pfile, p + 1, limit, tbuf_ptr, cvt,
+			      loc_reader, out);
 	}
     }
-  /* NUL-terminate the 'to' buffer and translate it to a cpp_string
-     structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, cvt);
-  tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
-  to->text = tbuf.text;
-  to->len = tbuf.len;
+
+  if (to)
+    {
+      /* NUL-terminate the 'to' buffer and translate it to a cpp_string
+	 structure.  */
+      emit_numeric_escape (pfile, 0, &tbuf, cvt);
+      tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
+      to->text = tbuf.text;
+      to->len = tbuf.len;
+    }
+
   return true;
 
  fail:
   cpp_errno (pfile, CPP_DL_ERROR, "converting to execution character set");
-  free (tbuf.text);
+  if (to)
+    free (tbuf.text);
   return false;
 }
 
+/* FROM is an array of cpp_string structures of length COUNT.  These
+   are to be converted from the source to the execution character set,
+   escape sequences translated, and finally all are to be
+   concatenated.  WIDE indicates whether or not to produce a wide
+   string.  The result is written into TO.  Returns true for success,
+   false for failure.  */
+bool
+cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
+		      cpp_string *to,  enum cpp_ttype type)
+{
+  return cpp_interpret_string_1 (pfile, from, count, to, type, NULL, NULL);
+}
+
+/* This function mimics the behavior of cpp_interpret_string, but
+   rather than generating a string in the execution character set,
+   *OUT is written to with the source code ranges of the characters
+   in such a string.
+   FROM and LOC_READERS should both be arrays of length COUNT.
+   Returns NULL for success, or an error message for failure.  */
+
+const char *
+cpp_interpret_string_ranges (cpp_reader *pfile, const cpp_string *from,
+			     cpp_string_location_reader *loc_readers,
+			     size_t count,
+			     cpp_substring_ranges *out,
+			     enum cpp_ttype type)
+{
+  /* There are a couple of cases in the range-handling in
+     cpp_interpret_string_1 that rely on there being a 1:1 correspondence
+     between bytes in the source encoding and bytes in the execution
+     encoding, so that each byte in the execution string can correspond
+     to the location of a byte in the source string.
+
+     This holds for the typical case of a UTF-8 to UTF-8 conversion.
+     Enforce this requirement by only attempting to track substring
+     locations if we have source encoding == execution encoding.
+
+     This is a stronger condition than we need, since we could e.g.
+     have ASCII to EBCDIC (with 1 byte per character before and after),
+     but it seems to be a reasonable restriction.  */
+  struct cset_converter cvt = converter_for_type (pfile, type);
+  if (cvt.func != convert_no_conversion)
+    return "execution character set != source character set";
+
+  if (cpp_interpret_string_1 (pfile, from, count, NULL, type,
+			      loc_readers, out))
+    return NULL;
+  else
+    return "cpp_interpret_string_1 failed";
+}
+
 /* Subroutine of do_line and do_linemarker.  Convert escape sequences
    in a string, but do not perform character set conversion.  */
 bool
@@ -1818,3 +2065,39 @@ _cpp_default_encoding (void)
 
   return current_encoding;
 }
+
+/* Implementation of class cpp_string_location_reader.  */
+
+/* Constructor for cpp_string_location_reader.  */
+
+cpp_string_location_reader::
+cpp_string_location_reader (source_location src_loc,
+			    line_maps *line_table)
+: m_line_table (line_table)
+{
+  src_loc = get_range_from_loc (line_table, src_loc).m_start;
+
+  /* SRC_LOC might be a macro location.  It only makes sense to do
+     column-by-column calculations on ordinary maps, so get the
+     corresponding location in an ordinary map.  */
+  m_loc
+    = linemap_resolve_location (line_table, src_loc,
+				LRK_SPELLING_LOCATION, NULL);
+
+  const line_map_ordinary *map
+    = linemap_check_ordinary (linemap_lookup (line_table, m_loc));
+  m_offset_per_column = (1 << map->m_range_bits);
+}
+
+/* Get the range of the next source byte.  */
+
+source_range
+cpp_string_location_reader::get_next ()
+{
+  source_range result;
+  result.m_start = m_loc;
+  result.m_finish = m_loc;
+  if (m_loc <= LINE_MAP_MAX_LOCATION_WITH_COLS)
+    m_loc += m_offset_per_column;
+  return result;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 543f3b9..c24e62e 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -743,6 +743,51 @@ struct GTY(()) cpp_hashnode {
   union _cpp_hashnode_value GTY ((desc ("CPP_HASHNODE_VALUE_IDX (%1)"))) value;
 };
 
+/* A class for iterating through the source locations within a
+   string token (before escapes are interpreted, and before
+   concatenation).  */
+
+class cpp_string_location_reader {
+ public:
+  cpp_string_location_reader (source_location src_loc,
+			      line_maps *line_table);
+
+  source_range get_next ();
+
+ private:
+  source_location m_loc;
+  int m_offset_per_column;
+  line_maps *m_line_table;
+};
+
+/* A class for storing the source ranges of all of the characters within
+   a string literal, after escapes are interpreted, and after
+   concatenation.
+
+   This is not GTY-marked, as instances are intended to be temporary.  */
+
+class cpp_substring_ranges
+{
+ public:
+  cpp_substring_ranges ();
+  ~cpp_substring_ranges ();
+
+  int get_num_ranges () const { return m_num_ranges; }
+  source_range get_range (int idx) const
+  {
+    linemap_assert (idx < m_num_ranges);
+    return m_ranges[idx];
+  }
+
+  void add_range (source_range range);
+  void add_n_ranges (int num, cpp_string_location_reader &loc_reader);
+
+ private:
+  source_range *m_ranges;
+  int m_num_ranges;
+  int m_alloc_ranges;
+};
+
 /* Call this first to get a handle to pass to other functions.
 
    If you want cpplib to manage its own hashtable, pass in a NULL
@@ -829,6 +874,12 @@ extern cppchar_t cpp_interpret_charconst (cpp_reader *, const cpp_token *,
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
 				  cpp_string *, enum cpp_ttype);
+extern const char *cpp_interpret_string_ranges (cpp_reader *pfile,
+						const cpp_string *from,
+						cpp_string_location_reader *,
+						size_t count,
+						cpp_substring_ranges *out,
+						enum cpp_ttype type);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
 					      cpp_string *, enum cpp_ttype);
diff --git a/libcpp/internal.h b/libcpp/internal.h
index ca2b498..4a5cd3c 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -754,7 +754,9 @@ struct normalize_state
 extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    const unsigned char *, int,
 			    struct normalize_state *state,
-			    cppchar_t *);
+			    cppchar_t *,
+			    source_range *char_range,
+			    cpp_string_location_reader *loc_reader);
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/lex.c b/libcpp/lex.c
index 236418d..4e71965 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -1247,7 +1247,7 @@ forms_identifier_p (cpp_reader *pfile, int first,
       cppchar_t s;
       buffer->cur += 2;
       if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
-			  state, &s))
+			  state, &s, NULL, NULL))
 	return true;
       buffer->cur -= 2;
     }
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 2/3] Use class substring_loc in c-format.c (PR c/52952)
  2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
  2016-07-26 16:43       ` [PATCH 3/3] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
@ 2016-07-26 16:43       ` David Malcolm
  2016-07-26 18:06       ` [PATCH 1/3] (v2) On-demand locations within string-literals Manuel López-Ibáñez
  2016-07-29 21:42       ` Joseph Myers
  3 siblings, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-07-26 16:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Malcolm

This patch updates c-format.c to use the new class substring_loc, added
in the previous patch, replacing location_column_from_byte_offset.
Hence with this patch, Wformat can underline the precise erroneous
format string in many more cases.

The patch also introduces two new functions for emitting Wformat
warnings: format_warning_at_substring and format_warning_at_char,
providing an inform in the face of macros where the pertinent part of
the format string may be separate from the function call.

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

Successful selftest run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111)
in conjunction with the rest of the patch kit.

config-list.mk test run is in progress.

OK for trunk if it passes testing? (on top of patch 1)

gcc/c-family/ChangeLog:
	PR c/52952
	* c-format.c: Include "diagnostic.h".
	(location_column_from_byte_offset): Delete.
	(location_from_offset): Delete.
	(format_warning_va): New function.
	(format_warning_at_substring): New function.
	(format_warning_at_char): New function.
	(check_format_arg): Capture location of format_tree and pass to
	check_format_info_main.
	(check_format_info_main): Add params FMT_PARAM_LOC and
	FORMAT_STRING_CST.  Convert calls to warning_at to calls to
	format_warning_at_char.  Pass a substring_loc instance to
	check_format_types.
	(check_format_types): Convert first param from a location_t
	to a const substring_loc & and rename to "fmt_loc".  Attempt
	to extract the range of the relevant parameter and pass it
	to format_type_warning.
	(format_type_warning): Convert first param from a location_t
	to a const substring_loc & and rename to "fmt_loc".  Add
	params "param_range" and "type".  Replace calls to warning_at
	with calls to format_warning_at_substring.

gcc/testsuite/ChangeLog:
	PR c/52952
	* gcc.dg/cpp/pr66415-1.c: Likewise.
	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
	* gcc.dg/format/c90-printf-1.c: Likewise.
	* gcc.dg/format/diagnostic-ranges.c: New test case.
---
 gcc/c-family/c-format.c                         | 476 +++++++++++++++---------
 gcc/testsuite/gcc.dg/cpp/pr66415-1.c            |   8 +-
 gcc/testsuite/gcc.dg/format/asm_fprintf-1.c     |   6 +-
 gcc/testsuite/gcc.dg/format/c90-printf-1.c      |  14 +-
 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c | 222 +++++++++++
 5 files changed, 544 insertions(+), 182 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c

diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index c19c411..5b79588 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -29,6 +29,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "intl.h"
 #include "langhooks.h"
 #include "c-format.h"
+#include "diagnostic.h"
 
 /* Handle attributes associated with format checking.  */
 
@@ -65,78 +66,169 @@ static int first_target_format_type;
 static const char *format_name (int format_num);
 static int format_flags (int format_num);
 
-/* Given a string S of length LINE_WIDTH, find the visual column
-   corresponding to OFFSET bytes.   */
+/* Emit a warning governed by option OPT, using GMSGID as the format
+   string and AP as its arguments.
 
-static unsigned int
-location_column_from_byte_offset (const char *s, int line_width,
-				  unsigned int offset)
-{
-  const char * c = s;
-  if (*c != '"')
-    return 0;
+   Attempt to obtain precise location information within a string
+   literal from FMT_LOC.
+
+   Case 1: if substring location is available, and is within the range of
+   the format string itself, the primary location of the
+   diagnostic is the substring range obtained from FMT_LOC, with the
+   caret at the *end* of the substring range.
+
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf ("hello %i", msg);
+                    ~^
+
+   Case 2: if the substring location is available, but is not within
+   the range of the format string, the primary location is that of the
+   format string, and an note is emitted showing the substring location.
+
+   For example:
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf("hello " INT_FMT " world", msg);
+            ^~~~~~~~~~~~~~~~~~~~~~~~~
+     test.c:19: note: format string is defined here
+     #define INT_FMT "%i"
+                      ~^
+
+   Case 3: if precise substring information is unavailable, the primary
+   location is that of the whole string passed to FMT_LOC's constructor.
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf(fmt, msg);
+            ^~~
+
+   For each of cases 1-3, if param_range is non-NULL, then it is used
+   as a secondary range within the warning.  For example, here it
+   is used with case 1:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo %s bar", long_i + long_j);
+                  ~^       ~~~~~~~~~~~~~~~
+
+   and here with case 2:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo " STR_FMT " bar", long_i + long_j);
+             ^~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~
+     test.c:89:16: note: format string is defined here
+     #define STR_FMT "%s"
+                      ~^
 
-  c++, offset--;
-  while (offset > 0)
+   and with case 3:
+
+     test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
+     printf(fmt, msg);
+            ^~~  ~~~
+
+   Return true if a warning was emitted, false otherwise.  */
+
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
+		   int opt, const char *gmsgid, va_list *ap)
+{
+  bool substring_within_range = false;
+  location_t primary_loc;
+  location_t substring_loc = UNKNOWN_LOCATION;
+  source_range fmt_loc_range
+    = get_range_from_loc (line_table, fmt_loc.get_fmt_string_loc ());
+  source_range fmt_substring_range;
+  const char *err = fmt_loc.get_range (&fmt_substring_range);
+  if (err)
+    /* Case 3: unable to get substring location.  */
+    primary_loc = fmt_loc.get_fmt_string_loc ();
+  else
     {
-      if (c - s >= line_width)
-	return 0;
+      substring_loc = make_location (fmt_substring_range.m_finish,
+				     fmt_substring_range.m_start,
+				     fmt_substring_range.m_finish);
 
-      switch (*c)
+      if (fmt_substring_range.m_start >= fmt_loc_range.m_start
+	  && fmt_substring_range.m_finish <= fmt_loc_range.m_finish)
+	/* Case 1.  */
 	{
-	case '\\':
-	  c++;
-	  if (c - s >= line_width)
-	    return 0;
-	  switch (*c)
-	    {
-	    case '\\': case '\'': case '"': case '?':
-	    case '(': case '{': case '[': case '%':
-	    case 'a': case 'b': case 'f': case 'n':
-	    case 'r': case 't': case 'v': 
-	    case 'e': case 'E':
-	      c++, offset--;
-	      break;
-
-	    default:
-	      return 0;
-	    }
-	  break;
-
-	case '"':
-	  /* We found the end of the string too early.  */
-	  return 0;
-	  
-	default:
-	  c++, offset--;
-	  break;
+	  substring_within_range = true;
+	  primary_loc = substring_loc;
 	}
+      else
+	/* Case 2.  */
+	{
+	  substring_within_range = false;
+	  primary_loc = fmt_loc.get_fmt_string_loc ();
+	}
+    }
+
+  rich_location richloc (line_table, primary_loc);
+
+  if (param_range)
+    {
+      location_t param_loc = make_location (param_range->m_start,
+					    param_range->m_start,
+					    param_range->m_finish);
+      richloc.add_range (param_loc, false);
     }
-  return c - s;
+
+  diagnostic_info diagnostic;
+  diagnostic_set_info (&diagnostic, gmsgid, ap, &richloc, DK_WARNING);
+  diagnostic.option_index = opt;
+  bool warned = report_diagnostic (&diagnostic);
+
+  if (!err && substring_loc && !substring_within_range)
+    /* Case 2.  */
+    if (warned)
+      inform (substring_loc, "format string is defined here");
+
+  return warned;
 }
 
-/* Return a location that encodes the same location as LOC but shifted
-   by OFFSET bytes.  */
+/* Variadic call to format_warning_va.  */
 
-static location_t
-location_from_offset (location_t loc, int offset)
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_at_substring (const substring_loc &fmt_loc,
+			     source_range *param_range,
+			     int opt, const char *gmsgid, ...)
 {
-  gcc_checking_assert (offset >= 0);
-  if (linemap_location_from_macro_expansion_p (line_table, loc)
-      || offset < 0)
-    return loc;
+  va_list ap;
+  va_start (ap, gmsgid);
+  bool warned = format_warning_va (fmt_loc, param_range, opt, gmsgid, &ap);
+  va_end (ap);
+
+  return warned;
+}
 
-  expanded_location s = expand_location_to_spelling_point (loc);
-  int line_width;
-  const char *line = location_get_source_line (s.file, s.line, &line_width);
-  if (line == NULL)
-    return loc;
-  line += s.column - 1 ;
-  line_width -= s.column - 1;
-  unsigned int column =
-    location_column_from_byte_offset (line, line_width, (unsigned) offset);
+/* Emit a warning as per format_warning_va, but construct the substring_loc
+   for the character at offset (CHAR_IDX - 1) within a string constant
+   FORMAT_STRING_CST at FMT_STRING_LOC.  */
 
-  return linemap_position_for_loc_and_offset (line_table, loc, column);
+ATTRIBUTE_GCC_DIAG (5,6)
+static bool
+format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
+			int char_idx, int opt, const char *gmsgid, ...)
+{
+  va_list ap;
+  va_start (ap, gmsgid);
+  tree string_type = TREE_TYPE (format_string_cst);
+
+  /* The callers are of the form:
+       format_warning (format_string_loc, format_string_cst,
+		       format_chars - orig_format_chars,
+      where format_chars has already been incremented, so that
+      CHAR_IDX is one character beyond where the warning should
+      be emitted.  Fix it.  */
+  char_idx -= 1;
+
+  substring_loc fmt_loc (fmt_string_loc, string_type, char_idx, char_idx);
+  bool warned = format_warning_va (fmt_loc, NULL, opt, gmsgid, &ap);
+  va_end (ap);
+
+  return warned;
 }
 
 /* Check that we have a pointer to a string suitable for use as a format.
@@ -1018,8 +1110,9 @@ format_flags (int format_num)
 static void check_format_info (function_format_info *, tree);
 static void check_format_arg (void *, tree, unsigned HOST_WIDE_INT);
 static void check_format_info_main (format_check_results *,
-				    function_format_info *,
-				    const char *, int, tree,
+				    function_format_info *, const char *,
+				    location_t, tree,
+				    int, tree,
 				    unsigned HOST_WIDE_INT,
 				    object_allocator<format_wanted_type> &);
 
@@ -1032,8 +1125,12 @@ static void finish_dollar_format_checking (format_check_results *, int);
 static const format_flag_spec *get_flag_spec (const format_flag_spec *,
 					      int, const char *);
 
-static void check_format_types (location_t, format_wanted_type *);
-static void format_type_warning (location_t, format_wanted_type *, tree, tree);
+static void check_format_types (const substring_loc &fmt_loc,
+				format_wanted_type *);
+static void format_type_warning (const substring_loc &fmt_loc,
+				 source_range *param_range,
+				 format_wanted_type *, tree,
+				 tree);
 
 /* Decode a format type from a string, returning the type, or
    format_type_error if not valid, in which case the caller should print an
@@ -1509,6 +1606,8 @@ check_format_arg (void *ctx, tree format_tree,
   tree array_size = 0;
   tree array_init;
 
+  location_t fmt_param_loc = EXPR_LOC_OR_LOC (format_tree, input_location);
+
   if (VAR_P (format_tree))
     {
       /* Pull out a constant value if the front end didn't.  */
@@ -1684,12 +1783,13 @@ check_format_arg (void *ctx, tree format_tree,
      need not adjust it for every return.  */
   res->number_other++;
   object_allocator <format_wanted_type> fwt_pool ("format_wanted_type pool");
-  check_format_info_main (res, info, format_chars, format_length,
-			  params, arg_num, fwt_pool);
+  check_format_info_main (res, info, format_chars, fmt_param_loc, format_tree,
+			  format_length, params, arg_num, fwt_pool);
 }
 
 
-/* Do the main part of checking a call to a format function.  FORMAT_CHARS
+/* Do the main part of checking a call to a format function.
+   FORMAT_STRING_CST is the STRING_CST format string.  FORMAT_CHARS
    is the NUL-terminated format string (which at this point may contain
    internal NUL characters); FORMAT_LENGTH is its length (excluding the
    terminating NUL character).  ARG_NUM is one less than the number of
@@ -1699,6 +1799,7 @@ check_format_arg (void *ctx, tree format_tree,
 static void
 check_format_info_main (format_check_results *res,
 			function_format_info *info, const char *format_chars,
+			location_t fmt_param_loc, tree format_string_cst,
 			int format_length, tree params,
 			unsigned HOST_WIDE_INT arg_num,
 			object_allocator <format_wanted_type> &fwt_pool)
@@ -1747,10 +1848,10 @@ check_format_info_main (format_check_results *res,
 	continue;
       if (*format_chars == 0)
 	{
-          warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "spurious trailing %<%%%> in format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "spurious trailing %<%%%> in format");
 	  continue;
 	}
       if (*format_chars == '%')
@@ -1758,6 +1859,7 @@ check_format_info_main (format_check_results *res,
 	  ++format_chars;
 	  continue;
 	}
+      const char *start_of_this_format = format_chars;
       flag_chars[0] = 0;
 
       if ((fki->flags & (int) FMT_FLAG_USE_DOLLAR) && has_operand_number != 0)
@@ -1794,11 +1896,10 @@ check_format_info_main (format_check_results *res,
 						     *format_chars, NULL);
 	  if (strchr (flag_chars, *format_chars) != 0)
 	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars + 1
-						- orig_format_chars),
-			  OPT_Wformat_,
-			  "repeated %s in format", _(s->name));
+	      format_warning_at_char (format_string_loc, format_string_cst,
+				      format_chars + 1 - orig_format_chars,
+				      OPT_Wformat_,
+				      "repeated %s in format", _(s->name));
 	    }
 	  else
 	    {
@@ -1921,10 +2022,11 @@ check_format_info_main (format_check_results *res,
 	  flag_chars[i++] = fki->left_precision_char;
 	  flag_chars[i] = 0;
 	  if (!ISDIGIT (*format_chars))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"empty left precision in %s format", fki->name);
+	    format_warning_at_char (format_string_loc, format_string_cst,
+				    format_chars - orig_format_chars,
+				    OPT_Wformat_,
+				    "empty left precision in %s format",
+				    fki->name);
 	  while (ISDIGIT (*format_chars))
 	    ++format_chars;
 	}
@@ -2002,10 +2104,11 @@ check_format_info_main (format_check_results *res,
 	    {
 	      if (!(fki->flags & (int) FMT_FLAG_EMPTY_PREC_OK)
 		  && !ISDIGIT (*format_chars))
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "empty precision in %s format", fki->name);
+		format_warning_at_char (format_string_loc, format_string_cst,
+					format_chars - orig_format_chars,
+					OPT_Wformat_,
+					"empty precision in %s format",
+					fki->name);
 	      while (ISDIGIT (*format_chars))
 		++format_chars;
 	    }
@@ -2090,11 +2193,10 @@ check_format_info_main (format_check_results *res,
 		{
 		  const format_flag_spec *s = get_flag_spec (flag_specs,
 							     *format_chars, NULL);
-		  warning_at (location_from_offset (format_string_loc,
-						    format_chars 
-						    - orig_format_chars),
-			      OPT_Wformat_,
-			      "repeated %s in format", _(s->name));
+		  format_warning_at_char (format_string_loc, format_string_cst,
+					  format_chars - orig_format_chars,
+					  OPT_Wformat_,
+					  "repeated %s in format", _(s->name));
 		}
 	      else
 		{
@@ -2111,10 +2213,10 @@ check_format_info_main (format_check_results *res,
 	  || (!(fki->flags & (int) FMT_FLAG_FANCY_PERCENT_OK)
 	      && format_char == '%'))
 	{
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "conversion lacks type at end of format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "conversion lacks type at end of format");
 	  continue;
 	}
       format_chars++;
@@ -2125,27 +2227,30 @@ check_format_info_main (format_check_results *res,
       if (fci->format_chars == 0)
 	{
 	  if (ISGRAPH (format_char))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character %qc in format",
-			format_char);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "unknown conversion type character %qc in format",
+	       format_char);
 	  else
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character 0x%x in format",
-			format_char);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "unknown conversion type character 0x%x in format",
+	       format_char);
 	  continue;
 	}
       if (pedantic)
 	{
 	  if (ADJ_STD (fci->std) > C_STD_VER)
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"%s does not support the %<%%%c%> %s format",
-			C_STD_NAME (fci->std), format_char, fki->name);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "%s does not support the %<%%%c%> %s format",
+	       C_STD_NAME (fci->std), format_char, fki->name);
 	}
 
       /* Validate the individual flags used, removing any that are invalid.  */
@@ -2160,11 +2265,11 @@ check_format_info_main (format_check_results *res,
 	      continue;
 	    if (strchr (fci->flag_chars, flag_chars[i]) == 0)
 	      {
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars 
-						  - orig_format_chars),
-			    OPT_Wformat_, "%s used with %<%%%c%> %s format",
-			    _(s->name), format_char, fki->name);
+		format_warning_at_char (format_string_loc, format_string_cst,
+					format_chars - orig_format_chars,
+					OPT_Wformat_,
+					"%s used with %<%%%c%> %s format",
+					_(s->name), format_char, fki->name);
 		d++;
 		continue;
 	      }
@@ -2277,10 +2382,10 @@ check_format_info_main (format_check_results *res,
 	    ++format_chars;
 	  if (*format_chars != ']')
 	    /* The end of the format string was reached.  */
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"no closing %<]%> for %<%%[%> format");
+	    format_warning_at_char (format_string_loc, format_string_cst,
+				    format_chars - orig_format_chars,
+				    OPT_Wformat_,
+				    "no closing %<]%> for %<%%[%> format");
 	}
 
       wanted_type = 0;
@@ -2293,12 +2398,14 @@ check_format_info_main (format_check_results *res,
 	  wanted_type_std = fci->types[length_chars_val].std;
 	  if (wanted_type == 0)
 	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars - orig_format_chars),
-			  OPT_Wformat_,
-			  "use of %qs length modifier with %qc type character"
-			  " has either no effect or undefined behavior",
-			  length_chars, format_char);
+	      format_warning_at_char
+		(format_string_loc, format_string_cst,
+		 format_chars - orig_format_chars,
+		 OPT_Wformat_,
+		 "use of %qs length modifier with %qc type"
+		 " character"
+		 " has either no effect or undefined behavior",
+		 length_chars, format_char);
 	      /* Heuristic: skip one argument when an invalid length/type
 		 combination is encountered.  */
 	      arg_num++;
@@ -2314,12 +2421,13 @@ check_format_info_main (format_check_results *res,
 		   && ADJ_STD (wanted_type_std) > ADJ_STD (fci->std))
 	    {
 	      if (ADJ_STD (wanted_type_std) > C_STD_VER)
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "%s does not support the %<%%%s%c%> %s format",
-			    C_STD_NAME (wanted_type_std), length_chars,
-			    format_char, fki->name);
+		format_warning_at_char
+		  (format_string_loc, format_string_cst,
+		   format_chars - orig_format_chars,
+		   OPT_Wformat_,
+		   "%s does not support the %<%%%s%c%> %s format",
+		   C_STD_NAME (wanted_type_std), length_chars,
+		   format_char, fki->name);
 	    }
 	}
 
@@ -2421,14 +2529,20 @@ check_format_info_main (format_check_results *res,
 	}
 
       if (first_wanted_type != 0)
-        check_format_types (format_string_loc, first_wanted_type);
+	{
+	  ptrdiff_t offset_to_format_start = (start_of_this_format - 1) - orig_format_chars;
+	  ptrdiff_t offset_to_format_end = (format_chars - 1) - orig_format_chars;
+	  substring_loc fmt_loc (fmt_param_loc, TREE_TYPE (format_string_cst),
+				 offset_to_format_start, offset_to_format_end);
+	  check_format_types (fmt_loc, first_wanted_type);
+	}
     }
 
   if (format_chars - orig_format_chars != format_length)
-    warning_at (location_from_offset (format_string_loc,
-				      format_chars + 1 - orig_format_chars),
-		OPT_Wformat_contains_nul,
-		"embedded %<\\0%> in format");
+    format_warning_at_char (format_string_loc, format_string_cst,
+			    format_chars + 1 - orig_format_chars,
+			    OPT_Wformat_contains_nul,
+			    "embedded %<\\0%> in format");
   if (info->first_arg_num != 0 && params != 0
       && has_operand_number <= 0)
     {
@@ -2439,12 +2553,12 @@ check_format_info_main (format_check_results *res,
     finish_dollar_format_checking (res, fki->flags & (int) FMT_FLAG_DOLLAR_GAP_POINTER_OK);
 }
 
-
 /* Check the argument types from a single format conversion (possibly
-   including width and precision arguments).  LOC is the location of
-   the format string.  */
+   including width and precision arguments).  FMT_LOC is the
+   location of the format conversion.  */
 static void
-check_format_types (location_t loc, format_wanted_type *types)
+check_format_types (const substring_loc &fmt_loc,
+		    format_wanted_type *types)
 {
   for (; types != 0; types = types->next)
     {
@@ -2471,7 +2585,7 @@ check_format_types (location_t loc, format_wanted_type *types)
       cur_param = types->param;
       if (!cur_param)
         {
-          format_type_warning (loc, types, wanted_type, NULL);
+          format_type_warning (fmt_loc, NULL, types, wanted_type, NULL);
           continue;
         }
 
@@ -2481,6 +2595,16 @@ check_format_types (location_t loc, format_wanted_type *types)
       orig_cur_type = cur_type;
       char_type_flag = 0;
 
+      source_range param_range;
+      source_range *param_range_ptr;
+      if (CAN_HAVE_LOCATION_P (cur_param))
+	{
+	  param_range = EXPR_LOCATION_RANGE (cur_param);
+	  param_range_ptr = &param_range;
+	}
+      else
+	param_range_ptr = NULL;
+
       STRIP_NOPS (cur_param);
 
       /* Check the types of any additional pointer arguments
@@ -2545,7 +2669,8 @@ check_format_types (location_t loc, format_wanted_type *types)
 	    }
 	  else
 	    {
-              format_type_warning (loc, types, wanted_type, orig_cur_type);
+	      format_type_warning (fmt_loc, param_range_ptr,
+				   types, wanted_type, orig_cur_type);
 	      break;
 	    }
 	}
@@ -2613,20 +2738,24 @@ check_format_types (location_t loc, format_wanted_type *types)
 	  && TYPE_PRECISION (cur_type) == TYPE_PRECISION (wanted_type))
 	continue;
       /* Now we have a type mismatch.  */
-      format_type_warning (loc, types, wanted_type, orig_cur_type);
+      format_type_warning (fmt_loc, param_range_ptr, types,
+			   wanted_type, orig_cur_type);
     }
 }
 
 
-/* Give a warning at LOC about a format argument of different type from that
-   expected.  WANTED_TYPE is the type the argument should have, possibly
-   stripped of pointer dereferences.  The description (such as "field
+/* Give a warning at FMT_LOC about a format argument of different type
+   from that expected.  If non-NULL, PARAM_RANGE is the source range of the
+   relevant argument.  WANTED_TYPE is the type the argument should have,
+   possibly stripped of pointer dereferences.  The description (such as "field
    precision"), the placement in the format string, a possibly more
    friendly name of WANTED_TYPE, and the number of pointer dereferences
    are taken from TYPE.  ARG_TYPE is the type of the actual argument,
    or NULL if it is missing.  */
 static void
-format_type_warning (location_t loc, format_wanted_type *type,
+format_type_warning (const substring_loc &fmt_loc,
+		     source_range *param_range,
+		     format_wanted_type *type,
 		     tree wanted_type, tree arg_type)
 {
   int kind = type->kind;
@@ -2635,7 +2764,6 @@ format_type_warning (location_t loc, format_wanted_type *type,
   int format_length = type->format_length;
   int pointer_count = type->pointer_count;
   int arg_num = type->arg_num;
-  unsigned int offset_loc = type->offset_loc;
 
   char *p;
   /* If ARG_TYPE is a typedef with a misleading name (for example,
@@ -2669,41 +2797,47 @@ format_type_warning (location_t loc, format_wanted_type *type,
       p[pointer_count + 1] = 0;
     }
 
-  loc = location_from_offset (loc, offset_loc);
-		      
   if (wanted_type_name)
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type_name, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type_name, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type_name, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type_name, p);
     }
   else
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type, p);
     }
 }
 
diff --git a/gcc/testsuite/gcc.dg/cpp/pr66415-1.c b/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
index 349ec48..1f67cb4 100644
--- a/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
+++ b/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
@@ -1,9 +1,15 @@
 /* PR c/66415 */
 /* { dg-do compile } */
-/* { dg-options "-Wformat" } */
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
 
 void
 fn1 (void)
 {
   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"); /* { dg-warning "71:format" } */
+
+/* { dg-begin-multiline-output "" }
+   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
+                                                                      ~^
+   { dg-end-multiline-output "" } */
+
 }
diff --git a/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c b/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
index 2eabbf9..50ca572 100644
--- a/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
+++ b/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
@@ -66,9 +66,9 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   asm_fprintf ("%d", i, i); /* { dg-warning "16:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   asm_fprintf (""); /* { dg-warning "16:zero-length" "warning for empty format" } */
-  asm_fprintf ("\0"); /* { dg-warning "17:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0", i); /* { dg-warning "19:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "19:embedded|too many" "warning for embedded NUL" } */
+  asm_fprintf ("\0"); /* { dg-warning "18:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0", i); /* { dg-warning "20:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "20:embedded|too many" "warning for embedded NUL" } */
   asm_fprintf (NULL); /* { dg-warning "null" "null format string warning" } */
   asm_fprintf ("%"); /* { dg-warning "17:trailing" "trailing % warning" } */
   asm_fprintf ("%++d", i); /* { dg-warning "19:repeated" "repeated flag warning" } */
diff --git a/gcc/testsuite/gcc.dg/format/c90-printf-1.c b/gcc/testsuite/gcc.dg/format/c90-printf-1.c
index 5329dad..338b971 100644
--- a/gcc/testsuite/gcc.dg/format/c90-printf-1.c
+++ b/gcc/testsuite/gcc.dg/format/c90-printf-1.c
@@ -58,11 +58,11 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%-%"); /* { dg-warning "13:type" "missing type" } */
   /* { dg-warning "14:trailing" "bogus %%" { target *-*-* } 58 } */
   printf ("%-%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 60 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 60 } */
   printf ("%5%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 62 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 62 } */
   printf ("%h%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 64 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 64 } */
   /* Valid and invalid %h, %l, %L constructions.  */
   printf ("%hd", i);
   printf ("%hi", i);
@@ -184,8 +184,8 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%-08G", d); /* { dg-warning "11:flags|ignored" "0 flag ignored with - flag" } */
   /* Various tests of bad argument types.  */
   printf ("%d", l); /* { dg-warning "13:format" "bad argument types" } */
-  printf ("%*.*d", l, i2, i); /* { dg-warning "13:field" "bad * argument types" } */
-  printf ("%*.*d", i1, l, i); /* { dg-warning "15:field" "bad * argument types" } */
+  printf ("%*.*d", l, i2, i); /* { dg-warning "16:field" "bad * argument types" } */
+  printf ("%*.*d", i1, l, i); /* { dg-warning "16:field" "bad * argument types" } */
   printf ("%ld", i); /* { dg-warning "14:format" "bad argument types" } */
   printf ("%s", n); /* { dg-warning "13:format" "bad argument types" } */
   printf ("%p", i); /* { dg-warning "13:format" "bad argument types" } */
@@ -231,8 +231,8 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%d", i, i); /* { dg-warning "11:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   printf (""); /* { dg-warning "11:zero-length" "warning for empty format" } */
-  printf ("\0"); /* { dg-warning "12:embedded" "warning for embedded NUL" } */
-  printf ("%d\0", i); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+  printf ("\0"); /* { dg-warning "13:embedded" "warning for embedded NUL" } */
+  printf ("%d\0", i); /* { dg-warning "15:embedded" "warning for embedded NUL" } */
   printf ("%d\0%d", i, i); /* { dg-warning "embedded|too many" "warning for embedded NUL" } */
   printf (NULL); /* { dg-warning "3:null" "null format string warning" } */
   printf ("%"); /* { dg-warning "12:trailing" "trailing % warning" } */
diff --git a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
new file mode 100644
index 0000000..9e86b52
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
@@ -0,0 +1,222 @@
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
+
+/* See PR 52952. */
+
+#include "format.h"
+
+void test_mismatching_types (const char *msg)
+{
+  printf("hello %i", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %i", msg);
+                 ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments (void)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, 101, 102);
+/* TODO: ideally would also underline "101".  */
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments_2 (int i, int j)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, i + j, 102);
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+           100, i + j, 102);
+                ~~~~~         
+   { dg-end-multiline-output "" } */
+}
+
+void multiline_format_string (void) {
+  printf ("before the fmt specifier" /* { dg-warning "11: format '%d' expects a matching 'int' argument" } */
+/* { dg-begin-multiline-output "" }
+   printf ("before the fmt specifier"
+           ^~~~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+
+          "%"
+          "d" /* { dg-message "12: format string is defined here" } */
+          "after the fmt specifier");
+
+/* { dg-begin-multiline-output "" }
+           "%"
+            ~~
+           "d"
+           ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_hex (const char *msg)
+{
+  /* "%" is \x25
+     "i" is \x69 */
+  printf("hello \x25\x69", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \x25\x69", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_oct (const char *msg)
+{
+  /* "%" is octal 045
+     "i" is octal 151.  */
+  printf("hello \045\151", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \045\151", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple (const char *msg)
+{
+  /* "%" is \x25 in hex
+     "i" is \151 in octal.  */
+  printf("prefix"  "\x25"  "\151"  "suffix",  /* { dg-warning "format '%i'" } */
+         msg);
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+          ^~~~~~~~
+  { dg-end-multiline-output "" } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+                     ~~~~~~~~~~~^
+  { dg-end-multiline-output "" } */
+}
+
+void test_u8 (const char *msg)
+{
+  printf(u8"hello %i", msg);/* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf(u8"hello %i", msg);
+                   ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_param (long long_i, long long_j)
+{
+  printf ("foo %s bar", long_i + long_j); /* { dg-warning "17: format '%s' expects argument of type 'char \\*', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf ("foo %s bar", long_i + long_j);
+                ~^       ~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void test_field_width_specifier (long l, int i1, int i2)
+{
+  printf (" %*.*d ", l, i1, i2); /* { dg-warning "17: field width specifier '\\*' expects argument of type 'int', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %*.*d ", l, i1, i2);
+             ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_spurious_percent (void)
+{
+  printf("hello world %"); /* { dg-warning "23: spurious trailing" } */
+
+/* { dg-begin-multiline-output "" }
+   printf("hello world %");
+                       ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_empty_precision (char *s, size_t m, double d)
+{
+  strfmon (s, m, "%#.5n", d); /* { dg-warning "20: empty left precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#.5n", d);
+                    ^
+   { dg-end-multiline-output "" } */
+
+  strfmon (s, m, "%#5.n", d); /* { dg-warning "22: empty precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#5.n", d);
+                      ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_repeated (int i)
+{
+  printf ("%++d", i); /* { dg-warning "14: repeated '\\+' flag in format" } */
+/* { dg-begin-multiline-output "" }
+   printf ("%++d", i);
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_conversion_lacks_type (void)
+{
+  printf (" %h"); /* { dg-warning "14:conversion lacks type at end of format" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %h");
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_embedded_nul (void)
+{
+  printf (" \0 "); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+/* { dg-begin-multiline-output "" }
+   printf (" \0 ");
+             ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_macro (const char *msg)
+{
+#define INT_FMT "%i" /* { dg-message "19: format string is defined here" } */
+  printf("hello " INT_FMT " world", msg);  /* { dg-warning "10: format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* { dg-begin-multiline-output "" }
+   printf("hello " INT_FMT " world", msg);
+          ^~~~~~~~
+   { dg-end-multiline-output "" } */
+/* { dg-begin-multiline-output "" }
+ #define INT_FMT "%i"
+                  ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_non_contiguous_strings (void)
+{
+  __builtin_printf(" %" "d ", 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 200 } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                    ^~~~
+   { dg-end-multiline-output "" } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                      ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_const_arrays (void)
+{
+  /* TODO: ideally we'd highlight both the format string *and* the use of
+     it here.  For now, just verify that we gracefully handle this case.  */
+  const char a[] = " %d ";
+  __builtin_printf(a, 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(a, 0.5);
+                    ^
+   { dg-end-multiline-output "" } */
+}
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 3/3] c-format.c: suggest the correct format string to use (PR c/64955)
  2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
@ 2016-07-26 16:43       ` David Malcolm
  2016-07-26 16:43       ` [PATCH 2/3] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-07-26 16:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Malcolm

This adds fix-it hints to c-format.c so that it can (sometimes) suggest
the format string the user should have used.

The patch adds selftests for the new code in c-format.c.  These
selftests are thus lang-specific.  This is the first time we've had
lang-specific selftests, and hence the patch also adds a langhook for
running them.  (Note that currently the Makefile only invokes the
selftests for cc1).

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

Successful selftest run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111)
in conjunction with the rest of the patch kit.

config-list.mk test run is in progress.

OK for trunk if it passes testing?

gcc/c-family/ChangeLog:
	PR c/64955
	* c-common.h (selftest::c_format_c_tests): New declaration.
	(selftest::run_c_tests): New declaration.
	* c-format.c: Include "selftest.h.
	(format_warning_va): Add param "corrected_substring" and use
	it to add a replacement fix-it hint.
	(format_warning_at_substring): Likewise.
	(format_warning_at_char): Update for new param of
	format_warning_va.
	(check_format_info_main): Pass "fki" to check_format_types.
	(check_format_types): Add param "fki" and pass it to
	format_type_warning.
	(deref_n_times): New function.
	(get_modifier_for_format_len): New function.
	(selftest::test_get_modifier_for_format_len): New function.
	(get_format_for_type): New function.
	(format_type_warning): Add param "fki" and use it to attempt
	to provide hints for argument types when calling
	format_warning_at_substring.
	(selftest::get_info): New function.
	(selftest::assert_format_for_type_streq): New function.
	(ASSERT_FORMAT_FOR_TYPE_STREQ): New macro.
	(selftest::test_get_format_for_type_printf): New function.
	(selftest::test_get_format_for_type_scanf): New function.
	(selftest::c_format_c_tests): New function.

gcc/c/ChangeLog:
	PR c/64955
	* c-lang.c (LANG_HOOKS_RUN_LANG_SELFTESTS): If CHECKING_P, wire
	this up to selftest::run_c_tests.
	(selftest::run_c_tests): New function.

gcc/ChangeLog:
	PR c/64955
	* langhooks-def.h (LANG_HOOKS_RUN_LANG_SELFTESTS): New default
	do-nothing langhook.
	(LANG_HOOKS_INITIALIZER): Add LANG_HOOKS_RUN_LANG_SELFTESTS.
	* langhooks.h (struct lang_hooks): Add run_lang_selftests.
	* selftest-run-tests.c: Include "tree.h" and "langhooks.h".
	(selftest::run_tests): Call lang_hooks.run_lang_selftests.

gcc/testsuite/ChangeLog:
	PR c/64955
	* gcc.dg/format/diagnostic-ranges.c: Add fix-it hints to expected
	output.
---
 gcc/c-family/c-common.h                         |   7 +
 gcc/c-family/c-format.c                         | 268 ++++++++++++++++++++++--
 gcc/c/c-lang.c                                  |  22 ++
 gcc/langhooks-def.h                             |   4 +-
 gcc/langhooks.h                                 |   3 +
 gcc/selftest-run-tests.c                        |   5 +
 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c |  30 ++-
 7 files changed, 319 insertions(+), 20 deletions(-)

diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 7b5da57..61f9ced 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1533,4 +1533,11 @@ extern bool valid_array_size_p (location_t, tree, tree);
 extern bool cilk_ignorable_spawn_rhs_op (tree);
 extern bool cilk_recognize_spawn (tree, tree *);
 
+#if CHECKING_P
+namespace selftest {
+  extern void c_format_c_tests (void);
+  extern void run_c_tests (void);
+} // namespace selftest
+#endif /* #if CHECKING_P */
+
 #endif /* ! GCC_C_COMMON_H */
diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index 5b79588..f5a4011 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -30,6 +30,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "langhooks.h"
 #include "c-format.h"
 #include "diagnostic.h"
+#include "selftest.h"
 
 /* Handle attributes associated with format checking.  */
 
@@ -126,11 +127,21 @@ static int format_flags (int format_num);
      printf(fmt, msg);
             ^~~  ~~~
 
+   If CORRECTED_SUBSTRING is non-NULL, use it for cases 1 and 2 to provide
+   a fix-it hint, suggesting that it should replace the text within the
+   substring range.  For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf ("hello %i", msg);
+                    ~^
+                    %s
+
    Return true if a warning was emitted, false otherwise.  */
 
-ATTRIBUTE_GCC_DIAG (4,0)
+ATTRIBUTE_GCC_DIAG (5,0)
 static bool
 format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
+		   const char *corrected_substring,
 		   int opt, const char *gmsgid, va_list *ap)
 {
   bool substring_within_range = false;
@@ -174,6 +185,9 @@ format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
       richloc.add_range (param_loc, false);
     }
 
+  if (!err && corrected_substring && substring_within_range)
+    richloc.add_fixit_replace (fmt_substring_range, corrected_substring);
+
   diagnostic_info diagnostic;
   diagnostic_set_info (&diagnostic, gmsgid, ap, &richloc, DK_WARNING);
   diagnostic.option_index = opt;
@@ -182,22 +196,31 @@ format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
   if (!err && substring_loc && !substring_within_range)
     /* Case 2.  */
     if (warned)
-      inform (substring_loc, "format string is defined here");
+      {
+	rich_location substring_richloc (line_table, substring_loc);
+	if (corrected_substring)
+	  substring_richloc.add_fixit_replace (fmt_substring_range,
+					       corrected_substring);
+	inform_at_rich_loc (&substring_richloc,
+			    "format string is defined here");
+      }
 
   return warned;
 }
 
 /* Variadic call to format_warning_va.  */
 
-ATTRIBUTE_GCC_DIAG (4,0)
+ATTRIBUTE_GCC_DIAG (5,0)
 static bool
 format_warning_at_substring (const substring_loc &fmt_loc,
 			     source_range *param_range,
+			     const char *corrected_substring,
 			     int opt, const char *gmsgid, ...)
 {
   va_list ap;
   va_start (ap, gmsgid);
-  bool warned = format_warning_va (fmt_loc, param_range, opt, gmsgid, &ap);
+  bool warned = format_warning_va (fmt_loc, param_range, corrected_substring,
+				   opt, gmsgid, &ap);
   va_end (ap);
 
   return warned;
@@ -225,7 +248,7 @@ format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
   char_idx -= 1;
 
   substring_loc fmt_loc (fmt_string_loc, string_type, char_idx, char_idx);
-  bool warned = format_warning_va (fmt_loc, NULL, opt, gmsgid, &ap);
+  bool warned = format_warning_va (fmt_loc, NULL, NULL, opt, gmsgid, &ap);
   va_end (ap);
 
   return warned;
@@ -1126,11 +1149,13 @@ static const format_flag_spec *get_flag_spec (const format_flag_spec *,
 					      int, const char *);
 
 static void check_format_types (const substring_loc &fmt_loc,
-				format_wanted_type *);
+				format_wanted_type *,
+				const format_kind_info *fki);
 static void format_type_warning (const substring_loc &fmt_loc,
 				 source_range *param_range,
 				 format_wanted_type *, tree,
-				 tree);
+				 tree,
+				 const format_kind_info *fki);
 
 /* Decode a format type from a string, returning the type, or
    format_type_error if not valid, in which case the caller should print an
@@ -2534,7 +2559,7 @@ check_format_info_main (format_check_results *res,
 	  ptrdiff_t offset_to_format_end = (format_chars - 1) - orig_format_chars;
 	  substring_loc fmt_loc (fmt_param_loc, TREE_TYPE (format_string_cst),
 				 offset_to_format_start, offset_to_format_end);
-	  check_format_types (fmt_loc, first_wanted_type);
+	  check_format_types (fmt_loc, first_wanted_type, fki);
 	}
     }
 
@@ -2558,7 +2583,7 @@ check_format_info_main (format_check_results *res,
    location of the format conversion.  */
 static void
 check_format_types (const substring_loc &fmt_loc,
-		    format_wanted_type *types)
+		    format_wanted_type *types, const format_kind_info *fki)
 {
   for (; types != 0; types = types->next)
     {
@@ -2585,7 +2610,7 @@ check_format_types (const substring_loc &fmt_loc,
       cur_param = types->param;
       if (!cur_param)
         {
-          format_type_warning (fmt_loc, NULL, types, wanted_type, NULL);
+	  format_type_warning (fmt_loc, NULL, types, wanted_type, NULL, fki);
           continue;
         }
 
@@ -2670,7 +2695,7 @@ check_format_types (const substring_loc &fmt_loc,
 	  else
 	    {
 	      format_type_warning (fmt_loc, param_range_ptr,
-				   types, wanted_type, orig_cur_type);
+				   types, wanted_type, orig_cur_type, fki);
 	      break;
 	    }
 	}
@@ -2739,10 +2764,115 @@ check_format_types (const substring_loc &fmt_loc,
 	continue;
       /* Now we have a type mismatch.  */
       format_type_warning (fmt_loc, param_range_ptr, types,
-			   wanted_type, orig_cur_type);
+			   wanted_type, orig_cur_type, fki);
+    }
+}
+
+/* Given type TYPE, attempt to dereference the type N times
+   (e.g. from ("int ***", 2) to "int *")
+
+   Return the derefenced type, with any qualifiers
+   such as "const" stripped from the result, or
+   NULL if unsuccessful (e.g. TYPE is not a pointer type).  */
+
+static tree
+deref_n_times (tree type, int n)
+{
+  gcc_assert (type);
+
+  for (int i = n; i > 0; i--)
+    {
+      if (TREE_CODE (type) != POINTER_TYPE)
+	return NULL_TREE;
+      type = TREE_TYPE (type);
     }
+  /* Strip off any "const" etc.  */
+  return build_qualified_type (type, 0);
 }
 
+/* Lookup the format code for FORMAT_LEN within FLI,
+   returning the string code for expressing it, or NULL
+   if it is not found.  */
+
+static const char *
+get_modifier_for_format_len (const format_length_info *fli,
+			     enum format_lengths format_len)
+{
+  for (; fli->name; fli++)
+    {
+      if (fli->index == format_len)
+	return fli->name;
+      if (fli->double_index == format_len)
+	return fli->double_name;
+    }
+  return NULL;
+}
+
+#if CHECKING_P
+
+namespace selftest {
+
+static void
+test_get_modifier_for_format_len ()
+{
+  ASSERT_STREQ ("h",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_h));
+  ASSERT_STREQ ("hh",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_hh));
+  ASSERT_STREQ ("L",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_L));
+  ASSERT_EQ (NULL,
+	     get_modifier_for_format_len (printf_length_specs, FMT_LEN_none));
+}
+
+} // namespace selftest
+
+#endif /* CHECKING_P */
+
+/* Generate a string containing the format string that should be
+   used to format arguments of type ARG_TYPE within FKI (effectively
+   the inverse of the checking code).
+
+   If successful, returns a non-NULL string which should be freed
+   by the called.
+   Otherwise, returns NULL.  */
+
+static char *
+get_format_for_type (const format_kind_info *fki, tree arg_type)
+{
+  gcc_assert (arg_type);
+
+  const format_char_info *spec;
+  for (spec = &fki->conversion_specs[0];
+       spec->format_chars;
+       spec++)
+    {
+      tree effective_arg_type = deref_n_times (arg_type,
+					       spec->pointer_count);
+      if (!effective_arg_type)
+	continue;
+      for (int i = 0; i < FMT_LEN_MAX; i++)
+	{
+	  const format_type_detail *ftd = &spec->types[i];
+	  if (!ftd->type)
+	    continue;
+	  if (TYPE_CANONICAL (*ftd->type)
+	      == TYPE_CANONICAL (effective_arg_type))
+	    {
+	      const char *len_modifier
+		= get_modifier_for_format_len (fki->length_char_specs,
+					       (enum format_lengths)i);
+	      if (!len_modifier)
+		len_modifier = "";
+
+	      return xasprintf ("%%%s%c",
+				len_modifier,
+				spec->format_chars[0]);
+	    }
+	}
+   }
+  return NULL;
+}
 
 /* Give a warning at FMT_LOC about a format argument of different type
    from that expected.  If non-NULL, PARAM_RANGE is the source range of the
@@ -2756,9 +2886,10 @@ static void
 format_type_warning (const substring_loc &fmt_loc,
 		     source_range *param_range,
 		     format_wanted_type *type,
-		     tree wanted_type, tree arg_type)
+		     tree wanted_type, tree arg_type,
+		     const format_kind_info *fki)
 {
-  int kind = type->kind;
+  enum format_specifier_kind kind = type->kind;
   const char *wanted_type_name = type->wanted_type_name;
   const char *format_start = type->format_start;
   int format_length = type->format_length;
@@ -2797,12 +2928,18 @@ format_type_warning (const substring_loc &fmt_loc,
       p[pointer_count + 1] = 0;
     }
 
+  /* Attempt to provide hints for argument types, but not for field widths
+     and precisions.  */
+  char *format_for_type = NULL;
+  if (arg_type && kind == CF_KIND_FORMAT)
+    format_for_type = get_format_for_type (fki, arg_type);
+
   if (wanted_type_name)
     {
       if (arg_type)
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
 	   "but argument %d has type %qT",
 	   gettext (kind_descriptions[kind]),
@@ -2812,7 +2949,7 @@ format_type_warning (const substring_loc &fmt_loc,
       else
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
 	   gettext (kind_descriptions[kind]),
 	   (kind == CF_KIND_FORMAT ? "%" : ""),
@@ -2823,7 +2960,7 @@ format_type_warning (const substring_loc &fmt_loc,
       if (arg_type)
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
 	   "but argument %d has type %qT",
 	   gettext (kind_descriptions[kind]),
@@ -2833,12 +2970,14 @@ format_type_warning (const substring_loc &fmt_loc,
       else
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
 	   gettext (kind_descriptions[kind]),
 	   (kind == CF_KIND_FORMAT ? "%" : ""),
 	   format_length, format_start, wanted_type, p);
     }
+
+  free (format_for_type);
 }
 
 
@@ -3359,3 +3498,96 @@ handle_format_attribute (tree *node, tree ARG_UNUSED (name), tree args,
 
   return NULL_TREE;
 }
+
+#if CHECKING_P
+
+namespace selftest {
+
+/* Selftests of location handling.  */
+
+/* Get the format_kind_info with the given name.  */
+
+static const format_kind_info *
+get_info (const char *name)
+{
+  int idx = decode_format_type (name);
+  const format_kind_info *fki = &format_types[idx];
+  ASSERT_STREQ (fki->name, name);
+  return fki;
+}
+
+/* Verify that get_format_for_type (FKI, TYPE) is EXPECTED_FORMAT.  */
+
+static void
+assert_format_for_type_streq (const location &loc, const format_kind_info *fki,
+			      const char *expected_format, tree type)
+{
+  gcc_assert (fki);
+  gcc_assert (expected_format);
+  gcc_assert (type);
+
+  char *actual_format = get_format_for_type (fki, type);
+  ASSERT_STREQ_AT (loc, expected_format, actual_format);
+  free (actual_format);
+}
+
+/* Selftests for get_format_for_type.  */
+
+#define ASSERT_FORMAT_FOR_TYPE_STREQ(EXPECTED_FORMAT, TYPE) \
+  assert_format_for_type_streq (SELFTEST_LOCATION, (fki), (EXPECTED_FORMAT), (TYPE))
+
+/* Selftest for get_format_for_type for "printf"-style functions.  */
+
+static void
+test_get_format_for_type_printf ()
+{
+  const format_kind_info *fki = get_info ("gnu_printf");
+  ASSERT_NE (fki, NULL);
+
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%f", double_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%Lf", long_double_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%d", integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%o", unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%ld", long_integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lo", long_unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lld", long_long_integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%llo", long_long_unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%s", build_pointer_type (char_type_node));
+}
+
+/* Selftest for get_format_for_type for "scanf"-style functions.  */
+
+static void
+test_get_format_for_type_scanf ()
+{
+  const format_kind_info *fki = get_info ("gnu_scanf");
+  ASSERT_NE (fki, NULL);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%d", build_pointer_type (integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%u", build_pointer_type (unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%ld",
+				build_pointer_type (long_integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lu",
+				build_pointer_type (long_unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ
+    ("%lld", build_pointer_type (long_long_integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ
+    ("%llu", build_pointer_type (long_long_unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%e", build_pointer_type (float_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%le", build_pointer_type (double_type_node));
+}
+
+#undef ASSERT_FORMAT_FOR_TYPE_STREQ
+
+/* Run all of the selftests within this file.  */
+
+void
+c_format_c_tests ()
+{
+  test_get_modifier_for_format_len ();
+  test_get_format_for_type_printf ();
+  test_get_format_for_type_scanf ();
+}
+
+} // namespace selftest
+
+#endif /* CHECKING_P */
diff --git a/gcc/c/c-lang.c b/gcc/c/c-lang.c
index 89954b7..b26be6a 100644
--- a/gcc/c/c-lang.c
+++ b/gcc/c/c-lang.c
@@ -38,7 +38,29 @@ enum c_language_kind c_language = clk_c;
 #undef LANG_HOOKS_INIT_TS
 #define LANG_HOOKS_INIT_TS c_common_init_ts
 
+#if CHECKING_P
+#undef LANG_HOOKS_RUN_LANG_SELFTESTS
+#define LANG_HOOKS_RUN_LANG_SELFTESTS selftest::run_c_tests
+#endif /* #if CHECKING_P */
+
 /* Each front end provides its own lang hook initializer.  */
 struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;
 
+#if CHECKING_P
+
+namespace selftest {
+
+/* Implementation of LANG_HOOKS_RUN_LANG_SELFTESTS for the C frontend.  */
+
+void
+run_c_tests (void)
+{
+  c_format_c_tests ();
+}
+
+} // namespace selftest
+
+#endif /* #if CHECKING_P */
+
+
 #include "gtype-c.h"
diff --git a/gcc/langhooks-def.h b/gcc/langhooks-def.h
index 034b3b7..c17f998 100644
--- a/gcc/langhooks-def.h
+++ b/gcc/langhooks-def.h
@@ -120,6 +120,7 @@ extern bool lhd_omp_mappable_type (tree);
 #define LANG_HOOKS_BLOCK_MAY_FALLTHRU	hook_bool_const_tree_true
 #define LANG_HOOKS_EH_USE_CXA_END_CLEANUP	false
 #define LANG_HOOKS_DEEP_UNSHARING	false
+#define LANG_HOOKS_RUN_LANG_SELFTESTS   lhd_do_nothing
 
 /* Attribute hooks.  */
 #define LANG_HOOKS_ATTRIBUTE_TABLE		NULL
@@ -319,7 +320,8 @@ extern void lhd_end_section (void);
   LANG_HOOKS_EH_PROTECT_CLEANUP_ACTIONS, \
   LANG_HOOKS_BLOCK_MAY_FALLTHRU, \
   LANG_HOOKS_EH_USE_CXA_END_CLEANUP, \
-  LANG_HOOKS_DEEP_UNSHARING \
+  LANG_HOOKS_DEEP_UNSHARING, \
+  LANG_HOOKS_RUN_LANG_SELFTESTS \
 }
 
 #endif /* GCC_LANG_HOOKS_DEF_H */
diff --git a/gcc/langhooks.h b/gcc/langhooks.h
index 0593424..169a678 100644
--- a/gcc/langhooks.h
+++ b/gcc/langhooks.h
@@ -505,6 +505,9 @@ struct lang_hooks
      gimplification.  */
   bool deep_unsharing;
 
+  /* Run all lang-specific selftests.  */
+  void (*run_lang_selftests) (void);
+
   /* Whenever you add entries here, make sure you adjust langhooks-def.h
      and langhooks.c accordingly.  */
 };
diff --git a/gcc/selftest-run-tests.c b/gcc/selftest-run-tests.c
index 85e101d..9d75a8e 100644
--- a/gcc/selftest-run-tests.c
+++ b/gcc/selftest-run-tests.c
@@ -21,6 +21,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "selftest.h"
+#include "tree.h"
+#include "langhooks.h"
 
 /* This function needed to be split out from selftest.c as it references
    tests from the whole source tree, and so is within
@@ -70,6 +72,9 @@ selftest::run_tests ()
   /* This one relies on most of the above.  */
   function_tests_c_tests ();
 
+  /* Run any lang-specific selftests.  */
+  lang_hooks.run_lang_selftests ();
+
   /* Finished running tests.  */
   long finish_time = get_run_time ();
   long elapsed_time = finish_time - start_time;
diff --git a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
index 9e86b52..ff51833 100644
--- a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
+++ b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
@@ -12,6 +12,25 @@ void test_mismatching_types (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello %i", msg);
                  ~^
+                 %s
+   { dg-end-multiline-output "" } */
+
+
+  printf("hello %s", 42);  /* { dg-warning "format '%s' expects argument of type 'char \\*', but argument 2 has type 'int'" } */
+/* TODO: ideally would also underline "42".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %s", 42);
+                 ~^
+                 %d
+   { dg-end-multiline-output "" } */
+
+
+  printf("hello %i", (long)0);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'long int' " } */
+/* TODO: ideally would also underline the argument.  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %i", (long)0);
+                 ~^
+                 %ld
    { dg-end-multiline-output "" } */
 }
 
@@ -23,6 +42,7 @@ void test_multiple_arguments (void)
 /* { dg-begin-multiline-output "" }
    printf ("arg0: %i  arg1: %s arg 2: %i",
                             ~^
+                            %d
    { dg-end-multiline-output "" } */
 }
 
@@ -33,6 +53,7 @@ void test_multiple_arguments_2 (int i, int j)
 /* { dg-begin-multiline-output "" }
    printf ("arg0: %i  arg1: %s arg 2: %i",
                             ~^
+                            %d
            100, i + j, 102);
                 ~~~~~         
    { dg-end-multiline-output "" } */
@@ -67,6 +88,7 @@ void test_hex (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello \x25\x69", msg);
                  ~~~~~~~^
+                 %s
    { dg-end-multiline-output "" } */
 }
 
@@ -80,6 +102,7 @@ void test_oct (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello \045\151", msg);
                  ~~~~~~~^
+                 %s
    { dg-end-multiline-output "" } */
 }
 
@@ -98,6 +121,7 @@ void test_multiple (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("prefix"  "\x25"  "\151"  "suffix",
                      ~~~~~~~~~~~^
+                     %s
   { dg-end-multiline-output "" } */
 }
 
@@ -108,6 +132,7 @@ void test_u8 (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf(u8"hello %i", msg);
                    ~^
+                   %s
    { dg-end-multiline-output "" } */
 }
 
@@ -117,6 +142,7 @@ void test_param (long long_i, long long_j)
 /* { dg-begin-multiline-output "" }
    printf ("foo %s bar", long_i + long_j);
                 ~^       ~~~~~~~~~~~~~~~
+                %ld
    { dg-end-multiline-output "" } */
 }
 
@@ -192,13 +218,14 @@ void test_macro (const char *msg)
 /* { dg-begin-multiline-output "" }
  #define INT_FMT "%i"
                   ~^
+                  %s
    { dg-end-multiline-output "" } */
 }
 
 void test_non_contiguous_strings (void)
 {
   __builtin_printf(" %" "d ", 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
-                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 200 } */
+                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 227 } */
   /* { dg-begin-multiline-output "" }
    __builtin_printf(" %" "d ", 0.5);
                     ^~~~
@@ -206,6 +233,7 @@ void test_non_contiguous_strings (void)
   /* { dg-begin-multiline-output "" }
    __builtin_printf(" %" "d ", 0.5);
                       ~~~~^
+                      %f
    { dg-end-multiline-output "" } */
 }
 
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
  2016-07-26 16:43       ` [PATCH 3/3] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
  2016-07-26 16:43       ` [PATCH 2/3] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
@ 2016-07-26 18:06       ` Manuel López-Ibáñez
  2016-07-27 14:30         ` David Malcolm
  2016-07-29 21:42       ` Joseph Myers
  3 siblings, 1 reply; 61+ messages in thread
From: Manuel López-Ibáñez @ 2016-07-26 18:06 UTC (permalink / raw)
  To: David Malcolm, GCC Patches

On 26/07/16 18:11, David Malcolm wrote:

> gcc/ChangeLog:
> 	* gcc.c (cpp_options): Rename string to...
> 	(cpp_options_): ...this, to avoid clashing with struct in
> 	cpplib.h.

It seems to me that you need this because  now gcc.c includes cpplib.h via 
input.h, which seems wrong.

input.h was FE-independent (it depends on line-map.h but it is an accident of 
history that line-map.h is in libcpp since it doesn't depend on anything from 
libcpp [*]). Note that input.h is included in coretypes.h, so this means that 
now cpplib.h is included almost everywhere! [**]

There is the following in coretypes.h:

/* Provide forward struct declaration so that we don't have to include
    all of cpplib.h whenever a random prototype includes a pointer.
    Note that the cpp_reader and cpp_token typedefs remain part of
    cpplib.h.  */

struct cpp_reader;
struct cpp_token;

precisely to avoid including cpplib.h.

If I understand correctly, cpplib.h is needed in input.h because of this 
declaration:

+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+						   string_concat_db *concats,
+						   location_t strloc,
+						   enum cpp_ttype type,
+						   int start_idx, int end_idx,
+						   source_range *out_range);

Does this really need to be in input.h ?  It seems something that only C-family 
languages will be able to use. Note that you need a reader to use this 
function, and for that, you need to already include cpplib.h.

Perhaps it could live for now in c-format.c, since it is the only place using it?

Cheers,

	Manuel.

[*] In an ideal world, we would have a language-agnostic diagnostics library 
that would include line-map and that would be used by libcpp and the rest of 
GCC, so that we can remove all the error-routines in libcpp and the awkward 
glue code that ties it into diagnostics.c.

[**] And it seems that we are slowly undoing all the work that was done by 
Andrew MacLeod to clean up the .h web and remove dependencies 
(https://gcc.gnu.org/wiki/rearch).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-26 18:06       ` [PATCH 1/3] (v2) On-demand locations within string-literals Manuel López-Ibáñez
@ 2016-07-27 14:30         ` David Malcolm
  2016-07-27 22:42           ` Manuel López-Ibáñez
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-27 14:30 UTC (permalink / raw)
  To: Manuel López-Ibáñez, GCC Patches; +Cc: Martin Sebor

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

On Tue, 2016-07-26 at 19:05 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
> On 26/07/16 18:11, David Malcolm wrote:
> 
> > gcc/ChangeLog:
> > 	* gcc.c (cpp_options): Rename string to...
> > 	(cpp_options_): ...this, to avoid clashing with struct in
> > 	cpplib.h.
> 
> It seems to me that you need this because  now gcc.c includes
> cpplib.h via 
> input.h, which seems wrong.
> 
> input.h was FE-independent (it depends on line-map.h but it is an
> accident of 
> history that line-map.h is in libcpp since it doesn't depend on
> anything from 
> libcpp [*]). Note that input.h is included in coretypes.h, so this
> means that 
> now cpplib.h is included almost everywhere! [**]
> 
> There is the following in coretypes.h:
> 
> /* Provide forward struct declaration so that we don't have to
> include
>     all of cpplib.h whenever a random prototype includes a pointer.
>     Note that the cpp_reader and cpp_token typedefs remain part of
>     cpplib.h.  */
> 
> struct cpp_reader;
> struct cpp_token;
> 
> precisely to avoid including cpplib.h.
> 
> 
> If I understand correctly, cpplib.h is needed in input.h because of
> this 
> declaration:
> 
> +extern const char *get_source_range_for_substring (cpp_reader
> *pfile,
> +						   string_concat_db
> *concats,
> +						   location_t
> strloc,
> +						   enum cpp_ttype
> type,
> +						   int start_idx,
> int end_idx,
> +						   source_range
> *out_range);
> 
> 
> Does this really need to be in input.h ?  It seems something that
> only C-family 
> languages will be able to use. Note that you need a reader to use
> this 
> function, and for that, you need to already include cpplib.h.

Fair point; the attached modification to patch 1 compiles cleanly, and
moves it to a new header.

> Perhaps it could live for now in c-format.c, since it is the only
> place using it?

Martin Sebor [CC-ed] wants to use it from the middle-end:
  https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
so it's unclear to me that c-format.c would be a better location.

There are various places it could live; but getting it working took a
lot of effort to achieve - the currently proposed mixture of libcpp,
input.c and c-format.c for the locations of the various pieces works
(for example, auto_vec isn't available in libcpp).

Given that both Martin and I have candidate patches that are touching
the same area, I'd prefer to focus on getting this code in to trunk,
rather than rewrite it out-of-tree, so that we can at least have the
improvement to location-handling for Wformat.  Once the code is in the
tree, it should be easier to figure out how to access it from the
middle-end.

> Cheers,
> 
> 	Manuel.
> 
> [*] In an ideal world, we would have a language-agnostic diagnostics
> library 
> that would include line-map and that would be used by libcpp and the
> rest of 
> GCC, so that we can remove all the error-routines in libcpp and the
> awkward 
> glue code that ties it into diagnostics.c.,

Agreed, though that may have to wait until gcc 8 at this point.
(Given that the proposed diagnostics library would use line maps, and
would be used by libcpp, would it make sense to move the diagnostics
into libcpp itself?  Diagnostics would seem to be intimately related to
location-tracking)

> [**] And it seems that we are slowly undoing all the work that was
> done by 
> Andrew MacLeod to clean up the .h web and remove dependencies 
> (https://gcc.gnu.org/wiki/rearch).
> 
> 

[-- Attachment #2: 0001-Avoid-including-cpplib.h-from-input.h.patch --]
[-- Type: text/x-patch, Size: 2982 bytes --]

From 09824cb27c0e817b29de1c7eb9b53c603116f13e Mon Sep 17 00:00:00 2001
From: David Malcolm <dmalcolm@redhat.com>
Date: Wed, 27 Jul 2016 10:33:52 -0400
Subject: [PATCH] Avoid including cpplib.h from input.h

gcc/c-family/ChangeLog:
	* c-common.c: Include "substring-locations.h".

gcc/ChangeLog:
	* input.h: Don't include cpplib.h.
	(get_source_range_for_substring): Move to...
	* substring-locations.h: New header.
---
 gcc/c-family/c-common.c   |  1 +
 gcc/input.h               |  8 --------
 gcc/substring-locations.h | 30 ++++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 8 deletions(-)
 create mode 100644 gcc/substring-locations.h

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index f4ffc0e..c4843db 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-iterator.h"
 #include "opts.h"
 #include "gimplify.h"
+#include "substring-locations.h"
 
 cpp_reader *parse_in;		/* Declared in c-pragma.h.  */
 
diff --git a/gcc/input.h b/gcc/input.h
index 24d9115..c17e440 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -22,7 +22,6 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_INPUT_H
 
 #include "line-map.h"
-#include <cpplib.h>
 
 extern GTY(()) struct line_maps *line_table;
 
@@ -131,11 +130,4 @@ class GTY(()) string_concat_db
   hash_map <location_hash, string_concat *> *m_table;
 };
 
-extern const char *get_source_range_for_substring (cpp_reader *pfile,
-						   string_concat_db *concats,
-						   location_t strloc,
-						   enum cpp_ttype type,
-						   int start_idx, int end_idx,
-						   source_range *out_range);
-
 #endif
diff --git a/gcc/substring-locations.h b/gcc/substring-locations.h
new file mode 100644
index 0000000..274ebbe
--- /dev/null
+++ b/gcc/substring-locations.h
@@ -0,0 +1,30 @@
+/* Source locations within string literals.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_SUBSTRING_LOCATIONS_H
+#define GCC_SUBSTRING_LOCATIONS_H
+
+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+						   string_concat_db *concats,
+						   location_t strloc,
+						   enum cpp_ttype type,
+						   int start_idx, int end_idx,
+						   source_range *out_range);
+
+#endif /* ! GCC_SUBSTRING_LOCATIONS_H */
-- 
1.8.5.3


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-27 14:30         ` David Malcolm
@ 2016-07-27 22:42           ` Manuel López-Ibáñez
  2016-07-28 20:12             ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Manuel López-Ibáñez @ 2016-07-27 22:42 UTC (permalink / raw)
  To: David Malcolm; +Cc: GCC Patches, Martin Sebor, Richard Biener

On 27 July 2016 at 15:30, David Malcolm <dmalcolm@redhat.com> wrote:
>> Perhaps it could live for now in c-format.c, since it is the only
>> place using it?
>
> Martin Sebor [CC-ed] wants to use it from the middle-end:
>   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> so it's unclear to me that c-format.c would be a better location.

Fine. He will have to figure out how to get a cpp_reader from the
middle-end, though.

> There are various places it could live; but getting it working took a
> lot of effort to achieve - the currently proposed mixture of libcpp,
> input.c and c-format.c for the locations of the various pieces works
> (for example, auto_vec isn't available in libcpp).

I don't doubt it. I tried to do something similar in the past and I
failed, this is why I ended up with the poor approximation that was in
place until now. This is a significant step forward.

Is libcpp still C? When would be the time to move it to C++ already
and start using common utilities?

Also, moving vec.h, sbitmap, etc to their own directory/library so
that they can be used by other parts of the compiler (hey! maybe even
by other parts of the toolchain?) is desirable. Richard has said in
the past that he supports such moves. Did I understand correctly
Richard?

> Given that both Martin and I have candidate patches that are touching
> the same area, I'd prefer to focus on getting this code in to trunk,
> rather than rewrite it out-of-tree, so that we can at least have the
> improvement to location-handling for Wformat.  Once the code is in the
> tree, it should be easier to figure out how to access it from the
> middle-end.

Sure, I think this version is fine. I'm a big proponent of
step-by-step, even if the steps are only approximations to the optimal
solution :)
It may be enough to motivate someone else more capable to improve over
my poor approximations ;-)

>> [*] In an ideal world, we would have a language-agnostic diagnostics
>> library
>> that would include line-map and that would be used by libcpp and the
>> rest of
>> GCC, so that we can remove all the error-routines in libcpp and the
>> awkward
>> glue code that ties it into diagnostics.c.,
>
> Agreed, though that may have to wait until gcc 8 at this point.
> (Given that the proposed diagnostics library would use line maps, and
> would be used by libcpp, would it make sense to move the diagnostics
> into libcpp itself?  Diagnostics would seem to be intimately related to
> location-tracking)

I don't think so. There is nothing in diagnostic.* pretty-print.*
input.* line-map.* that requires libcpp (and only two mentions of tree
that could be easily abstracted out). This was a deliberate design
goal of Gabriel and followed by most of us later working on
diagnostics. Of course, cpp may make use of the new library, but also
other parts of the toolchain (GAS?). The main obstacle I faced when
trying to do this move was the build machinery to make both libcpp and
gcc build and statically link with this new library.

Once that move is done, the main abstraction challenge to remove the
glue is that libcpp has its own flags for options and diagnostics that
are independent from those of gcc (see c_cpp_error in c-common.c). It
would be great if libcpp used the common flags, but then one would
have to figure out a way to reorder things so that the diagnostic
library, libcpp and gcc can use (or avoid being dependent on) the same
flags.

Cheers,

Manuel.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-27 22:42           ` Manuel López-Ibáñez
@ 2016-07-28 20:12             ` David Malcolm
  2016-07-28 20:38               ` Martin Sebor
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-28 20:12 UTC (permalink / raw)
  To: Manuel López-Ibáñez
  Cc: GCC Patches, Martin Sebor, Richard Biener

On Wed, 2016-07-27 at 23:41 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
> On 27 July 2016 at 15:30, David Malcolm <dmalcolm@redhat.com> wrote:
> > > Perhaps it could live for now in c-format.c, since it is the only
> > > place using it?
> > 
> > Martin Sebor [CC-ed] wants to use it from the middle-end:
> >   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> > so it's unclear to me that c-format.c would be a better location.
> 
> Fine. He will have to figure out how to get a cpp_reader from the
> middle-end, though.

It seems to me that on-demand reconstruction of source locations for
STRING_CST nodes is inherently frontend-specific: unless we have the
frontend record the information in some fe-independent way (which I
assume we *don't* want to do, for space-efficiency), we need to be able
to effectively re-run part of the frontend.

So maybe this needs to be a langhook; the c-family can use the global
cpp_reader * there, and everything else can return a "not supported"
code if a diagnostic requests substring location information (and the
diagnostic needs to be able to cope with that).

> > There are various places it could live; but getting it working took
> > a
> > lot of effort to achieve - the currently proposed mixture of
> > libcpp,
> > input.c and c-format.c for the locations of the various pieces
> > works
> > (for example, auto_vec isn't available in libcpp).
> 
> I don't doubt it. I tried to do something similar in the past and I
> failed, this is why I ended up with the poor approximation that was
> in
> place until now. This is a significant step forward.

Thanks (for the current implementation, and for the kind words).

> Is libcpp still C? When would be the time to move it to C++ already
> and start using common utilities?

libcpp is very much C++: I converted the linemap types to use
inheritance as part of gcc 6 (and it helped a lot when implementing the
range-tracking stuff).

> Also, moving vec.h, sbitmap, etc to their own directory/library so
> that they can be used by other parts of the compiler (hey! maybe even
> by other parts of the toolchain?) is desirable. Richard has said in
> the past that he supports such moves. Did I understand correctly
> Richard?

FWIW, I'd want the selftest framework there too; part of the reason
things are in input.c rather than libcpp in the current patch is that
selftests aren't yet available from libcpp (and reworking that seems
orthogonal).

> > Given that both Martin and I have candidate patches that are
> > touching
> > the same area, I'd prefer to focus on getting this code in to
> > trunk,
> > rather than rewrite it out-of-tree, so that we can at least have
> > the
> > improvement to location-handling for Wformat.  Once the code is in
> > the
> > tree, it should be easier to figure out how to access it from the
> > middle-end.
> 
> Sure, I think this version is fine. I'm a big proponent of
> step-by-step, even if the steps are only approximations to the
> optimal
> solution :)
> It may be enough to motivate someone else more capable to improve
> over
> my poor approximations ;-)

:)

> > > [*] In an ideal world, we would have a language-agnostic
> > > diagnostics
> > > library
> > > that would include line-map and that would be used by libcpp and
> > > the
> > > rest of
> > > GCC, so that we can remove all the error-routines in libcpp and
> > > the
> > > awkward
> > > glue code that ties it into diagnostics.c.,
> > 
> > Agreed, though that may have to wait until gcc 8 at this point.
> > (Given that the proposed diagnostics library would use line maps,
> > and
> > would be used by libcpp, would it make sense to move the
> > diagnostics
> > into libcpp itself?  Diagnostics would seem to be intimately
> > related to
> > location-tracking)
> 
> I don't think so. There is nothing in diagnostic.* pretty-print.*
> input.* line-map.* that requires libcpp (and only two mentions of
> tree
> that could be easily abstracted out). This was a deliberate design
> goal of Gabriel and followed by most of us later working on
> diagnostics. Of course, cpp may make use of the new library, but also
> other parts of the toolchain (GAS?). The main obstacle I faced when
> trying to do this move was the build machinery to make both libcpp
> and
> gcc build and statically link with this new library.
> 
> Once that move is done, the main abstraction challenge to remove the
> glue is that libcpp has its own flags for options and diagnostics
> that
> are independent from those of gcc (see c_cpp_error in c-common.c). It
> would be great if libcpp used the common flags, but then one would
> have to figure out a way to reorder things so that the diagnostic
> library, libcpp and gcc can use (or avoid being dependent on) the
> same
> flags.

Thanks.

Dave

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-28 20:12             ` David Malcolm
@ 2016-07-28 20:38               ` Martin Sebor
  2016-07-28 21:17                 ` Martin Sebor
  0 siblings, 1 reply; 61+ messages in thread
From: Martin Sebor @ 2016-07-28 20:38 UTC (permalink / raw)
  To: David Malcolm, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

On 07/28/2016 02:12 PM, David Malcolm wrote:
> On Wed, 2016-07-27 at 23:41 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
>> On 27 July 2016 at 15:30, David Malcolm <dmalcolm@redhat.com> wrote:
>>>> Perhaps it could live for now in c-format.c, since it is the only
>>>> place using it?
>>>
>>> Martin Sebor [CC-ed] wants to use it from the middle-end:
>>>    https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
>>> so it's unclear to me that c-format.c would be a better location.
>>
>> Fine. He will have to figure out how to get a cpp_reader from the
>> middle-end, though.
>
> It seems to me that on-demand reconstruction of source locations for
> STRING_CST nodes is inherently frontend-specific: unless we have the
> frontend record the information in some fe-independent way (which I
> assume we *don't* want to do, for space-efficiency), we need to be able
> to effectively re-run part of the frontend.
>
> So maybe this needs to be a langhook; the c-family can use the global
> cpp_reader * there, and everything else can return a "not supported"
> code if a diagnostic requests substring location information (and the
> diagnostic needs to be able to cope with that).

The problem with the lanhook approach, as I learned from my first
-Wformat-length attempt, is that it doesn't make the front end
implementation available to LTO.  So passes that run late enough
with LTO (like the latest version of the -Wformat-length pass
does) would not be bale to make use of it.

Martin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-28 20:38               ` Martin Sebor
@ 2016-07-28 21:17                 ` Martin Sebor
  2016-07-29 12:37                   ` David Malcolm
  2016-08-01 21:13                   ` Joseph Myers
  0 siblings, 2 replies; 61+ messages in thread
From: Martin Sebor @ 2016-07-28 21:17 UTC (permalink / raw)
  To: David Malcolm, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

On 07/28/2016 02:38 PM, Martin Sebor wrote:
> On 07/28/2016 02:12 PM, David Malcolm wrote:
>> On Wed, 2016-07-27 at 23:41 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
>>> On 27 July 2016 at 15:30, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>> Perhaps it could live for now in c-format.c, since it is the only
>>>>> place using it?
>>>>
>>>> Martin Sebor [CC-ed] wants to use it from the middle-end:
>>>>    https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
>>>> so it's unclear to me that c-format.c would be a better location.
>>>
>>> Fine. He will have to figure out how to get a cpp_reader from the
>>> middle-end, though.
>>
>> It seems to me that on-demand reconstruction of source locations for
>> STRING_CST nodes is inherently frontend-specific: unless we have the
>> frontend record the information in some fe-independent way (which I
>> assume we *don't* want to do, for space-efficiency), we need to be able
>> to effectively re-run part of the frontend.
>>
>> So maybe this needs to be a langhook; the c-family can use the global
>> cpp_reader * there, and everything else can return a "not supported"
>> code if a diagnostic requests substring location information (and the
>> diagnostic needs to be able to cope with that).
>
> The problem with the lanhook approach, as I learned from my first
> -Wformat-length attempt, is that it doesn't make the front end
> implementation available to LTO.  So passes that run late enough
> with LTO (like the latest version of the -Wformat-length pass
> does) would not be bale to make use of it.

I'm sorry, I didn't mean to sound like I was dismissing the idea.
I agree that string processing is language and front-end specific.
Having the middle end call back into the front-end also seems like
the right thing to do, not just to make this case work, but others
like it as well.  So perhaps the problem to solve is how to teach
LTO to talk to the front end.  One way to do it would be to build
the front ends as shared libraries.

Martin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-28 21:17                 ` Martin Sebor
@ 2016-07-29 12:37                   ` David Malcolm
  2016-07-29 14:22                     ` Martin Sebor
  2016-08-01 21:13                   ` Joseph Myers
  1 sibling, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-29 12:37 UTC (permalink / raw)
  To: Martin Sebor, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

On Thu, 2016-07-28 at 15:16 -0600, Martin Sebor wrote:
> On 07/28/2016 02:38 PM, Martin Sebor wrote:
> > On 07/28/2016 02:12 PM, David Malcolm wrote:
> > > On Wed, 2016-07-27 at 23:41 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
> > > > On 27 July 2016 at 15:30, David Malcolm <dmalcolm@redhat.com>
> > > > wrote:
> > > > > > Perhaps it could live for now in c-format.c, since it is
> > > > > > the only
> > > > > > place using it?
> > > > > 
> > > > > Martin Sebor [CC-ed] wants to use it from the middle-end:
> > > > >    https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> > > > > so it's unclear to me that c-format.c would be a better
> > > > > location.
> > > > 
> > > > Fine. He will have to figure out how to get a cpp_reader from
> > > > the
> > > > middle-end, though.
> > > 
> > > It seems to me that on-demand reconstruction of source locations
> > > for
> > > STRING_CST nodes is inherently frontend-specific: unless we have
> > > the
> > > frontend record the information in some fe-independent way (which
> > > I
> > > assume we *don't* want to do, for space-efficiency), we need to
> > > be able
> > > to effectively re-run part of the frontend.
> > > 
> > > So maybe this needs to be a langhook; the c-family can use the
> > > global
> > > cpp_reader * there, and everything else can return a "not
> > > supported"
> > > code if a diagnostic requests substring location information (and
> > > the
> > > diagnostic needs to be able to cope with that).
> > 
> > The problem with the lanhook approach, as I learned from my first
> > -Wformat-length attempt, is that it doesn't make the front end
> > implementation available to LTO.  So passes that run late enough
> > with LTO (like the latest version of the -Wformat-length pass
> > does) would not be bale to make use of it.
> 
> I'm sorry, I didn't mean to sound like I was dismissing the idea.
> I agree that string processing is language and front-end specific.
> Having the middle end call back into the front-end also seems like
> the right thing to do, not just to make this case work, but others
> like it as well.  So perhaps the problem to solve is how to teach
> LTO to talk to the front end.  One way to do it would be to build
> the front ends as shared libraries.

Turning frontends into shared libraries as a prerequisite would seem to
be imposing a significant burden on the patch.

Currently all that we need from the C family of frontends is the
cpp_reader and the string concatenation records.  I think we can
reconstruct the cpp_reader if we have the options, though presumably
that's per TU, so to support all this we'd need to capture e.g. the per
-TU encoding information in the LTO records, for the case where one TU
is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC
-encoded source to UCS-4 execution (or whatever).  And there's an issue
if different TUs compiled the same header with different encoding
options.

Or... we could not bother.  This is a Quality of Implementation thing,
for improving diagnostics, and in each case, the diagnostic is required
to cope with substring location information not being available (and
the code I posted in patch 2 of the kit makes it trivial to handle that
case from a diagnostic).  So we could simply have LTO use the
fallback mode.

There are two high-level approaches I've tried:

(a) capture the substring location information in the lexer/parser in
the frontend as it runs, and store it somehow.

(b) regenerate it "on-demand" when a diagnostic needs it.

Approach (b) is inherently going to be prone to the LTO issues you
describe, but it avoids adding to the CPU cycles/memory consumption for
the common case of not needing the information. [1]

Is approach (b) acceptable?

Thanks
Dave

[1] with the exception of the string concatenation records, but I
believe those are tiny

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 12:37                   ` David Malcolm
@ 2016-07-29 14:22                     ` Martin Sebor
  2016-07-29 14:46                       ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Martin Sebor @ 2016-07-29 14:22 UTC (permalink / raw)
  To: David Malcolm, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

> Currently all that we need from the C family of frontends is the
> cpp_reader and the string concatenation records.  I think we can
> reconstruct the cpp_reader if we have the options, though presumably
> that's per TU, so to support all this we'd need to capture e.g. the per
> -TU encoding information in the LTO records, for the case where one TU
> is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC
> -encoded source to UCS-4 execution (or whatever).  And there's an issue
> if different TUs compiled the same header with different encoding
> options.
>
> Or... we could not bother.  This is a Quality of Implementation thing,
> for improving diagnostics, and in each case, the diagnostic is required
> to cope with substring location information not being available (and
> the code I posted in patch 2 of the kit makes it trivial to handle that
> case from a diagnostic).  So we could simply have LTO use the
> fallback mode.
>
> There are two high-level approaches I've tried:
>
> (a) capture the substring location information in the lexer/parser in
> the frontend as it runs, and store it somehow.
>
> (b) regenerate it "on-demand" when a diagnostic needs it.
>
> Approach (b) is inherently going to be prone to the LTO issues you
> describe, but it avoids adding to the CPU cycles/memory consumption for
> the common case of not needing the information. [1]
>
> Is approach (b) acceptable?

If (b) means potentially reduced quality of the location ranges
in the -Wformat-length pass (e.g., with funky C++ format strings)
then I don't think that's enough of a problem to worry about, at
least not for this warning.

If it means not being able to use the solution you're working
on in the middle end  at all (unless I misunderstood that doesn't
seem to be what you're implying, but just to be sure) then that
would seem like a serious shortcoming.  I would continue to use
the code I copied from c-format.c (assuming that will still work),
but as more warnings are implemented in later passes it would
lead to duplicating code or reinventing the wheel just to get
around the limitation (or simply worse quality diagnostics).

Martin

>
> Thanks
> Dave
>
> [1] with the exception of the string concatenation records, but I
> believe those are tiny
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 14:22                     ` Martin Sebor
@ 2016-07-29 14:46                       ` David Malcolm
  2016-07-29 15:26                         ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-29 14:46 UTC (permalink / raw)
  To: Martin Sebor, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote:
> > Currently all that we need from the C family of frontends is the
> > cpp_reader and the string concatenation records.  I think we can
> > reconstruct the cpp_reader if we have the options, though
> > presumably
> > that's per TU, so to support all this we'd need to capture e.g. the
> > per
> > -TU encoding information in the LTO records, for the case where one
> > TU
> > is UTF-8 encoded source to UTF-8 execution, and another TU is
> > EBCDIC
> > -encoded source to UCS-4 execution (or whatever).  And there's an
> > issue
> > if different TUs compiled the same header with different encoding
> > options.
> > 
> > Or... we could not bother.  This is a Quality of Implementation
> > thing,
> > for improving diagnostics, and in each case, the diagnostic is
> > required
> > to cope with substring location information not being available
> > (and
> > the code I posted in patch 2 of the kit makes it trivial to handle
> > that
> > case from a diagnostic).  So we could simply have LTO use the
> > fallback mode.
> > 
> > There are two high-level approaches I've tried:
> > 
> > (a) capture the substring location information in the lexer/parser
> > in
> > the frontend as it runs, and store it somehow.
> > 
> > (b) regenerate it "on-demand" when a diagnostic needs it.
> > 
> > Approach (b) is inherently going to be prone to the LTO issues you
> > describe, but it avoids adding to the CPU cycles/memory consumption
> > for
> > the common case of not needing the information. [1]
> > 
> > Is approach (b) acceptable?
> 
> If (b) means potentially reduced quality of the location ranges
> in the -Wformat-length pass (e.g., with funky C++ format strings)
> then I don't think that's enough of a problem to worry about, at
> least not for this warning.
> 
> If it means not being able to use the solution you're working
> on in the middle end  at all (unless I misunderstood that doesn't
> seem to be what you're implying, but just to be sure) then that
> would seem like a serious shortcoming.  I would continue to use
> the code I copied from c-format.c (assuming that will still work),
> but as more warnings are implemented in later passes it would
> lead to duplicating code or reinventing the wheel just to get
> around the limitation (or simply worse quality diagnostics).

It'll work fine for the middle-end within cc1 and cc1plus.

I'm specifically referring to LTO here, and it would be fixable from
LTO if we can encode information about the TU encoding options into the
LTO data stream, and capture the string concatenation records there too
(but that would be followup work).

> Martin
> 
> > 
> > Thanks
> > Dave
> > 
> > [1] with the exception of the string concatenation records, but I
> > believe those are tiny
> > 
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 14:46                       ` David Malcolm
@ 2016-07-29 15:26                         ` David Malcolm
  2016-07-29 16:54                           ` Manuel López-Ibáñez
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-29 15:26 UTC (permalink / raw)
  To: Martin Sebor, Manuel López-Ibáñez
  Cc: GCC Patches, Richard Biener

On Fri, 2016-07-29 at 10:46 -0400, David Malcolm wrote:
> On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote:
> > > Currently all that we need from the C family of frontends is the
> > > cpp_reader and the string concatenation records.  I think we can
> > > reconstruct the cpp_reader if we have the options, though
> > > presumably
> > > that's per TU, so to support all this we'd need to capture e.g.
> > > the
> > > per
> > > -TU encoding information in the LTO records, for the case where
> > > one
> > > TU
> > > is UTF-8 encoded source to UTF-8 execution, and another TU is
> > > EBCDIC
> > > -encoded source to UCS-4 execution (or whatever).  And there's an
> > > issue
> > > if different TUs compiled the same header with different encoding
> > > options.
> > > 
> > > Or... we could not bother.  This is a Quality of Implementation
> > > thing,
> > > for improving diagnostics, and in each case, the diagnostic is
> > > required
> > > to cope with substring location information not being available
> > > (and
> > > the code I posted in patch 2 of the kit makes it trivial to
> > > handle
> > > that
> > > case from a diagnostic).  So we could simply have LTO use the
> > > fallback mode.
> > > 
> > > There are two high-level approaches I've tried:
> > > 
> > > (a) capture the substring location information in the
> > > lexer/parser
> > > in
> > > the frontend as it runs, and store it somehow.
> > > 
> > > (b) regenerate it "on-demand" when a diagnostic needs it.
> > > 
> > > Approach (b) is inherently going to be prone to the LTO issues
> > > you
> > > describe, but it avoids adding to the CPU cycles/memory
> > > consumption
> > > for
> > > the common case of not needing the information. [1]
> > > 
> > > Is approach (b) acceptable?
> > 
> > If (b) means potentially reduced quality of the location ranges
> > in the -Wformat-length pass (e.g., with funky C++ format strings)
> > then I don't think that's enough of a problem to worry about, at
> > least not for this warning.
> > 
> > If it means not being able to use the solution you're working
> > on in the middle end  at all (unless I misunderstood that doesn't
> > seem to be what you're implying, but just to be sure) then that
> > would seem like a serious shortcoming.  I would continue to use
> > the code I copied from c-format.c (assuming that will still work),
> > but as more warnings are implemented in later passes it would
> > lead to duplicating code or reinventing the wheel just to get
> > around the limitation (or simply worse quality diagnostics).
> 
> It'll work fine for the middle-end within cc1 and cc1plus.
> 
> I'm specifically referring to LTO here, and it would be fixable from
> LTO if we can encode information about the TU encoding options into
> the
> LTO data stream, and capture the string concatenation records there
> too
> (but that would be followup work).

FWIW, it appears that clang uses the on-demand approach; the relevant
code appears to be StringLiteral::getLocationOfByte:
http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008


> 
> > Martin
> > 
> > > 
> > > Thanks
> > > Dave
> > > 
> > > [1] with the exception of the string concatenation records, but I
> > > believe those are tiny
> > > 
> > 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 15:26                         ` David Malcolm
@ 2016-07-29 16:54                           ` Manuel López-Ibáñez
  2016-07-29 17:27                             ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Manuel López-Ibáñez @ 2016-07-29 16:54 UTC (permalink / raw)
  To: David Malcolm; +Cc: Martin Sebor, GCC Patches, Richard Biener

On 29 July 2016 at 16:25, David Malcolm <dmalcolm@redhat.com> wrote:
>
> FWIW, it appears that clang uses the on-demand approach; the relevant
> code appears to be StringLiteral::getLocationOfByte:
> http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008

As far as I know, llvm doesn't do language diagnostics from the
middle-end/LTO. Thus, they do not have those problems.

Cheers,

Manuel.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 16:54                           ` Manuel López-Ibáñez
@ 2016-07-29 17:27                             ` David Malcolm
  2016-07-30  1:18                               ` Manuel López-Ibáñez
  2016-08-03 15:56                               ` Jeff Law
  0 siblings, 2 replies; 61+ messages in thread
From: David Malcolm @ 2016-07-29 17:27 UTC (permalink / raw)
  To: Manuel López-Ibáñez
  Cc: Martin Sebor, GCC Patches, Richard Biener

On Fri, 2016-07-29 at 17:53 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
> On 29 July 2016 at 16:25, David Malcolm <dmalcolm@redhat.com> wrote:
> > 
> > FWIW, it appears that clang uses the on-demand approach; the
> > relevant
> > code appears to be StringLiteral::getLocationOfByte:
> > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008
> 
> As far as I know, llvm doesn't do language diagnostics from the
> middle-end/LTO. Thus, they do not have those problems.

If you really want to have middle-end diagnostics from LTO, I can make
the on-demand approach work.

I can also do the stored-location approach, but it would mean rewriting
all the patches again, I think, would be less efficient.

I would prefer the on-demand approach.

Who is empowered to make a decision here?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
                         ` (2 preceding siblings ...)
  2016-07-26 18:06       ` [PATCH 1/3] (v2) On-demand locations within string-literals Manuel López-Ibáñez
@ 2016-07-29 21:42       ` Joseph Myers
  2016-07-30  1:16         ` David Malcolm
  2016-08-03 15:59         ` [PATCH 1/3] (v2) On-demand locations within string-literals Jeff Law
  3 siblings, 2 replies; 61+ messages in thread
From: Joseph Myers @ 2016-07-29 21:42 UTC (permalink / raw)
  To: David Malcolm; +Cc: gcc-patches

On Tue, 26 Jul 2016, David Malcolm wrote:

> This patch implements precise tracking of source locations for the
> individual chars within string literals, so that we can e.g. underline
> specific ranges in -Wformat diagnostics.  It handles macros,
> concatenated tokens, escaped characters etc.

What if the string literal results from stringizing other tokens (which 
might have arisen in turn from macro expansion, including expansion of 
built-in macros not just those defined in source files, etc.)?  "You don't 
get precise locations" would be a fine answer for such cases - provided 
there is good testsuite coverage of them to show they don't crash the 
compiler or underline nonsensical characters.

> +	return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS";

Where do these strings get used?  Hopefully not in diagnostics for users, 
as they aren't written in user terms, and any diagnostic string like that 
would need to be marked up to be extracted for translation.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 21:42       ` Joseph Myers
@ 2016-07-30  1:16         ` David Malcolm
  2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
  2016-08-03 15:59         ` [PATCH 1/3] (v2) On-demand locations within string-literals Jeff Law
  1 sibling, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-07-30  1:16 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

On Fri, 2016-07-29 at 21:42 +0000, Joseph Myers wrote:
> On Tue, 26 Jul 2016, David Malcolm wrote:
> 
> > This patch implements precise tracking of source locations for the
> > individual chars within string literals, so that we can e.g.
> > underline
> > specific ranges in -Wformat diagnostics.  It handles macros,
> > concatenated tokens, escaped characters etc.
> 
> What if the string literal results from stringizing other tokens
> (which 
> might have arisen in turn from macro expansion, including expansion
> of 
> built-in macros not just those defined in source files, etc.)?  "You
> don't 
> get precise locations" would be a fine answer for such cases -
> provided 
> there is good testsuite coverage of them to show they don't crash the
> compiler or underline nonsensical characters.

Good question.  I briefly tested it just now, and it happens to fail
gracefully.  I'll add proper test coverage for this.


> > +	return "range starts after
> > LINE_MAP_MAX_LOCATION_WITH_COLS";
> 
> Where do these strings get used?  Hopefully not in diagnostics for
> users, 
> as they aren't written in user terms, and any diagnostic string like
> that 
> would need to be marked up to be extracted for translation.

Quoting from the comment for get_source_range_for_substring:

   Return NULL if successful, or an error message if any errors occurred.
   Error messages are intended for GCC developers (to help debugging) rather
   than for end-users.

and various functions in the patch follow this pattern (maybe I need to
add this to more comments?)

I initially had these functions return bool, but found that a const
char * was much more useful when debugging failures.
(In the testsuite I do happen to use it in a diagnostic, but that's in
a plugin, and is purely intended for verifying that various cases are
hitting various error paths - analogous to looking for messages in a
dumpfile).


Thanks
Dave

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 17:27                             ` David Malcolm
@ 2016-07-30  1:18                               ` Manuel López-Ibáñez
  2016-08-03 15:56                               ` Jeff Law
  1 sibling, 0 replies; 61+ messages in thread
From: Manuel López-Ibáñez @ 2016-07-30  1:18 UTC (permalink / raw)
  To: David Malcolm; +Cc: Martin Sebor, GCC Patches, Richard Biener

On 29 July 2016 at 18:27, David Malcolm <dmalcolm@redhat.com> wrote:
> On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote:
>> On 29 July 2016 at 16:25, David Malcolm <dmalcolm@redhat.com> wrote:
>> >
>> > FWIW, it appears that clang uses the on-demand approach; the
>> > relevant
>> > code appears to be StringLiteral::getLocationOfByte:
>> > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008
>>
>> As far as I know, llvm doesn't do language diagnostics from the
>> middle-end/LTO. Thus, they do not have those problems.
>
> If you really want to have middle-end diagnostics from LTO, I can make
> the on-demand approach work.

Personally, I'm happy with having this work only on the FEs. I haven't
had time to look at what Martin is doing, so he may prefer otherwise.

In any case, making it work from LTO could be done as a follow-up, no?

> I can also do the stored-location approach, but it would mean rewriting
> all the patches again, I think, would be less efficient.

Agreed, FWIW.

> I would prefer the on-demand approach.
>
> Who is empowered to make a decision here?

I thought you were the diagnostics maintainer ;-)

Cheers,

Manuel.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-28 21:17                 ` Martin Sebor
  2016-07-29 12:37                   ` David Malcolm
@ 2016-08-01 21:13                   ` Joseph Myers
  1 sibling, 0 replies; 61+ messages in thread
From: Joseph Myers @ 2016-08-01 21:13 UTC (permalink / raw)
  To: Martin Sebor
  Cc: David Malcolm, Manuel López-Ibáñez, GCC Patches,
	Richard Biener

On Thu, 28 Jul 2016, Martin Sebor wrote:

> like it as well.  So perhaps the problem to solve is how to teach
> LTO to talk to the front end.  One way to do it would be to build
> the front ends as shared libraries.

I think building front ends as shared libraries would run into different 
platforms (e.g. Windows) having very different conceptual models for 
shared libraries, especially when you get into shared libraries depending 
on symbols from the main executable (you might need to make all the 
language-independent parts of the compiler into a shared library as well).  
But a useful starting point could be to eliminate all cases where 
different front ends define external functions / variables with the same 
name (which would also enable statically linking multiple front ends 
together, to do such things without depending on shared libraries at all).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 2/4] (v3) On-demand locations within string-literals
  2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
  2016-08-03 15:17             ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
@ 2016-08-03 15:17             ` David Malcolm
  2016-08-04 17:38               ` Jeff Law
  2016-08-03 15:17             ` [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
  2016-08-03 16:06             ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT Jeff Law
  3 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-03 15:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Joseph Myers, David Malcolm

Changes in v3:
- Avoid including cpplib.h from input.h
- Properly handle stringified macro arguments (with tests for this)
- Minor whitespace fixes
- Move selftest.h changes to a separate patch

Changes in v2:
- Tweaks to substring location selftests
- Many more selftests (EBCDIC, the various wide string types, etc)
- Clean up conditions in charset.c; require source == execution charset
  to have substring locations
- Make string_concat_db field private
- Return error messages rather than bool
- Fix source_range for charset.c:convert_escape
- Introduce class substring_loc
- Handle bad input locations more gracefully
- Ensure that we can read substring information for a token which
  starts in one linemap and ends in another (seen in
  gcc.dg/cpp/pr69985.c)

This version addresses Joseph's qn about stringification of macro
arguments (by failing gracefully on them), and the modularity
concerns noted by Manu.

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

v2 of the kit successfully passes a full config-list.mk and a successful selftest
run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111), both in conjunction with the
rest of the patch kit; I plan to repeat those tests.

I believe I can self-approve the changes to input.c, input.h, libcpp,
and the testsuite; the remaining changes needing approval are those
to c-family and to gcc.c.

OK for trunk if it passes testing? (by itself)

Blurb from v2 follows, for context:

This patch implements precise tracking of source locations for the
individual chars within string literals, so that we can e.g. underline
specific ranges in -Wformat diagnostics.  It handles macros,
concatenated tokens, escaped characters etc.

The idea is to replace the limited implementation of this we currently
have in c-format.c (see r223470 [1]).  Doing so happens in patch 2 of
the kit; this patch just provides the infrastructure to do so.

As before the patch implements a new mode within libcpp's string literal
lexer.  It's disabled during the regular lexer, but it's available
through a low-level interface in input.{c|h} which can rerun the libcpp
code and capture the per-char source_ranges for when we need to issue a
diagnostic.  It also now adds a higher-level interface in c-common.h:
class substring_loc.

As before, to handle concatentation the patch adds some extra data
storage: every time a string concatenation happens in c-lex.c: it stores
the locations of the component tokens in a hash_map, keyed by the
spelling location of the start first token (see class string_concat_db
in input.h).

Hence it's only storing extra data for string concatenations,
not for simple string literals.

As before, this doesn't support the C++ frontend yet, but it doesn't
regress the status quo for c-format.c from C++.  I have a patch for
the C++ FE that records string concatenation information to the lexer,
but given that it's not used yet, I didn't add that in this patch, as
the data would be redundant.

This version of the patch properly handles encodings (and adds a
lot of test coverage for this to input.c).  It makes the simplifying
restriction that precise source location information is only available
if source charset == execution charset, as discussed on this list,
failing gracefully when this isn't the case.

[1]  https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=d5a2ddc76a109258297ff345957c35cb50116c94#patch2

gcc/c-family/ChangeLog:
	* c-common.c: Include "substring-locations.h".
	(get_cpp_ttype_from_string_type): New function.
	(g_string_concat_db): New global.
	(substring_loc::get_range): New method.
	* c-common.h (g_string_concat_db): New declaration.
	(class substring_loc): New class.
	* c-lex.c (lex_string): When concatenating strings, capture the
	locations of all tokens using a new obstack, and record the
	concatenation locations within g_string_concat_db.
	* c-opts.c (c_common_init_options): Construct g_string_concat_db
	on the ggc-heap.

gcc/ChangeLog:
	* gcc.c (cpp_options): Rename string to...
	(cpp_options_): ...this, to avoid clashing with struct in
	cpplib.h.
	(static_specs): Update initialize for above renaming
	* input.c (string_concat::string_concat): New constructor.
	(string_concat_db::string_concat_db): New constructor.
	(string_concat_db::record_string_concatenation): New method.
	(string_concat_db::get_string_concatenation): New method.
	(string_concat_db::get_key_loc): New method.
	(class auto_cpp_string_vec): New class.
	(get_substring_ranges_for_loc): New function.
	(get_source_range_for_substring): New function.
	(get_num_source_ranges_for_substring): New function.
	(class selftest::lexer_test_options): New class.
	(struct selftest::lexer_test): New struct.
	(class selftest::ebcdic_execution_charset): New class.
	(selftest::ebcdic_execution_charset::s_singleton): New variable.
	(selftest::lexer_test::lexer_test): New constructor.
	(selftest::lexer_test::~lexer_test): New destructor.
	(selftest::lexer_test::get_token): New method.
	(selftest::assert_char_at_range): New function.
	(ASSERT_CHAR_AT_RANGE): New macro.
	(selftest::assert_num_substring_ranges): New function.
	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
	(selftest::assert_has_no_substring_ranges): New function.
	(ASSERT_HAS_NO_SUBSTRING_RANGES): New macro.
	(selftest::test_lexer_string_locations_simple): New function.
	(selftest::test_lexer_string_locations_ebcdic): New function.
	(selftest::test_lexer_string_locations_hex): New function.
	(selftest::test_lexer_string_locations_oct): New function.
	(selftest::test_lexer_string_locations_letter_escape_1): New function.
	(selftest::test_lexer_string_locations_letter_escape_2): New function.
	(selftest::test_lexer_string_locations_ucn4): New function.
	(selftest::test_lexer_string_locations_ucn8): New function.
	(selftest::uint32_from_big_endian): New function.
	(selftest::test_lexer_string_locations_wide_string): New function.
	(selftest::uint16_from_big_endian): New function.
	(selftest::test_lexer_string_locations_string16): New function.
	(selftest::test_lexer_string_locations_string32): New function.
	(selftest::test_lexer_string_locations_u8): New function.
	(selftest::test_lexer_string_locations_utf8_source): New function.
	(selftest::test_lexer_string_locations_concatenation_1): New
	function.
	(selftest::test_lexer_string_locations_concatenation_2): New
	function.
	(selftest::test_lexer_string_locations_concatenation_3): New
	function.
	(selftest::test_lexer_string_locations_macro): New function.
	(selftest::test_lexer_string_locations_stringified_macro_argument):
	New function.
	(selftest::test_lexer_string_locations_non_string): New function.
	(selftest::test_lexer_string_locations_long_line): New function.
	(selftest::test_lexer_char_constants): New function.
	(selftest::input_c_tests): Call the new test functions once per
	case within the line_table test matrix.
	* input.h (struct string_concat): New struct.
	(struct location_hash): New struct.
	(class string_concat_db): New class.
	* substring-locations.h: New header.

gcc/testsuite/ChangeLog:
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
	* gcc.dg/plugin/diagnostic-test-string-literals-2.c: New file.
	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New file.
	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add the above new files.

libcpp/ChangeLog:
	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
	constructor.
	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
	(cpp_substring_ranges::add_range): New method.
	(cpp_substring_ranges::add_n_ranges): New method.
	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
	they are non-NULL, read position information from *loc_reader
	and update char_range->m_finish accordingly.
	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
	params.  If loc_reader is non-NULL, read location information from
	it, and update *ranges accordingly, using char_range.
	Conditionalize the conversion into tbuf on tbuf being non-NULL.
	(convert_hex): Likewise, conditionalizing the call to
	emit_numeric_escape on tbuf.
	(convert_oct): Likewise.
	(convert_escape): Add params "loc_reader" and "ranges".  If
	loc_reader is non-NULL, read location information from it, and
	update *ranges accordingly.  Conditionalize the conversion into
	tbuf on tbuf being non-NULL.
	(cpp_interpret_string): Rename to...
	(cpp_interpret_string_1): ...this, adding params "loc_readers" and
	"out".  Use "to" to conditionalize the initialization and usage of
	"tbuf", such as running the converter.  If "loc_readers" is
	non-NULL, use the instances within it, reading location
	information from them, and passing them to convert_escape; likewise
	write to "out" if loc_readers is non-NULL.  Check for leading
	quote and issue an error if it is not present.  Update boundary
	check from "== limit" to ">= limit" to protect against erroneous
	location values to calls that are not parsing string literals.
	(cpp_interpret_string): Reimplement in terms to
	cpp_interpret_string_1.
	(noop_error_cb): New function.
	(cpp_interpret_string_ranges): New function.
	(cpp_string_location_reader::cpp_string_location_reader): New
	constructor.
	(cpp_string_location_reader::get_next): New method.
	* include/cpplib.h (class cpp_string_location_reader): New class.
	(class cpp_substring_ranges): New class.
	(cpp_interpret_string_ranges): New prototype.
	* internal.h (_cpp_valid_ucn): Add params "char_range" and
	"loc_reader".
	* lex.c (forms_identifier_p): Pass NULL for new params to
	_cpp_valid_ucn.
---
 gcc/c-family/c-common.c                            |   62 +
 gcc/c-family/c-common.h                            |   29 +
 gcc/c-family/c-lex.c                               |   24 +-
 gcc/c-family/c-opts.c                              |    3 +
 gcc/gcc.c                                          |    4 +-
 gcc/input.c                                        | 1547 ++++++++++++++++++++
 gcc/input.h                                        |   35 +
 gcc/substring-locations.h                          |   30 +
 .../plugin/diagnostic-test-string-literals-1.c     |  211 +++
 .../plugin/diagnostic-test-string-literals-2.c     |   53 +
 .../diagnostic_plugin_test_string_literals.c       |  212 +++
 gcc/testsuite/gcc.dg/plugin/plugin.exp             |    3 +
 libcpp/charset.c                                   |  432 +++++-
 libcpp/include/cpplib.h                            |   51 +
 libcpp/internal.h                                  |    4 +-
 libcpp/lex.c                                       |    2 +-
 16 files changed, 2644 insertions(+), 58 deletions(-)
 create mode 100644 gcc/substring-locations.h
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 27031b5..7a8b6ea 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-iterator.h"
 #include "opts.h"
 #include "gimplify.h"
+#include "substring-locations.h"
 
 cpp_reader *parse_in;		/* Declared in c-pragma.h.  */
 
@@ -1098,6 +1099,67 @@ fix_string_type (tree value)
   TREE_STATIC (value) = 1;
   return value;
 }
+
+/* Given a string of type STRING_TYPE, determine what kind of string
+   token created it: CPP_STRING, CPP_STRING16, CPP_STRING32, or
+   CPP_WSTRING.  Return CPP_OTHER in case of error.
+
+   This effectively reverses part of the logic in
+   lex_string and fix_string_type.  */
+
+static enum cpp_ttype
+get_cpp_ttype_from_string_type (tree string_type)
+{
+  gcc_assert (string_type);
+  if (TREE_CODE (string_type) != ARRAY_TYPE)
+    return CPP_OTHER;
+
+  tree element_type = TREE_TYPE (string_type);
+  if (TREE_CODE (element_type) != INTEGER_TYPE)
+    return CPP_OTHER;
+
+  int bits_per_character = TYPE_PRECISION (element_type);
+  switch (bits_per_character)
+    {
+    case 8:
+      return CPP_STRING;  /* It could have also been CPP_UTF8STRING.  */
+    case 16:
+      return CPP_STRING16;
+    case 32:
+      return CPP_STRING32;
+    }
+
+  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
+    return CPP_WSTRING;
+
+  return CPP_OTHER;
+}
+
+/* The global record of string concatentations, for use in
+   extracting locations within string literals.  */
+
+GTY(()) string_concat_db *g_string_concat_db;
+
+/* Attempt to determine the source range of the substring.
+   If successful, return NULL and write the source range to *OUT_RANGE.
+   Otherwise return an error message.  Error messages are intended
+   for GCC developers (to help debugging) rather than for end-users.  */
+
+const char *
+substring_loc::get_range (source_range *out_range) const
+{
+  gcc_assert (out_range);
+
+  enum cpp_ttype tok_type = get_cpp_ttype_from_string_type (m_string_type);
+  if (tok_type == CPP_OTHER)
+    return "unrecognized string type";
+
+  return get_source_range_for_substring (parse_in, g_string_concat_db,
+					 m_fmt_string_loc, tok_type,
+					 m_start_idx, m_end_idx,
+					 out_range);
+}
+
 \f
 /* Fold X for consideration by one of the warning functions when checking
    whether an expression has a constant value.  */
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 8c80574..7b5da57 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch (cpp_reader *pfile);
    __TIME__ can store.  */
 #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
 
+extern GTY(()) string_concat_db *g_string_concat_db;
+
+/* libcpp can calculate location information about a range of characters
+   within a string literal, but doing so is non-trivial.
+
+   This class encapsulates such a source location, so that it can be
+   passed around (e.g. within c-format.c).  It is effectively a deferred
+   call into libcpp.  If needed by a diagnostic, the actual source_range
+   can be calculated by calling the get_range method.  */
+
+class substring_loc
+{
+ public:
+  substring_loc (location_t fmt_string_loc, tree string_type,
+		 int start_idx, int end_idx)
+  : m_fmt_string_loc (fmt_string_loc), m_string_type (string_type),
+    m_start_idx (start_idx), m_end_idx (end_idx) {}
+
+  const char *get_range (source_range *out_range) const;
+
+  location_t get_fmt_string_loc () const { return m_fmt_string_loc; }
+
+ private:
+  location_t m_fmt_string_loc;
+  tree m_string_type;
+  int m_start_idx;
+  int m_end_idx;
+};
+
 /* In c-gimplify.c  */
 extern void c_genericize (tree);
 extern int c_gimplify_expr (tree *, gimple_seq *, gimple_seq *);
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
index 8f33d86..4c7e385 100644
--- a/gcc/c-family/c-lex.c
+++ b/gcc/c-family/c-lex.c
@@ -1097,13 +1097,16 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   tree value;
   size_t concats = 0;
   struct obstack str_ob;
+  struct obstack loc_ob;
   cpp_string istr;
   enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
+  location_t init_loc = tok->src_loc;
   cpp_string *strs = &str;
+  location_t *locs = NULL;
 
   /* objc_at_sign_was_seen is only used when doing Objective-C string
      concatenation.  It is 'true' if we have seen an '@' before the
@@ -1142,16 +1145,21 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 	  else
 	    error ("unsupported non-standard concatenation of string literals");
 	}
+      /* FALLTHROUGH */
 
     case CPP_STRING:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
+	  gcc_obstack_init (&loc_ob);
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
+	  obstack_grow (&loc_ob, &init_loc, sizeof (location_t));
 	}
 
       concats++;
       obstack_grow (&str_ob, &tok->val.str, sizeof (cpp_string));
+      obstack_grow (&loc_ob, &tok->src_loc, sizeof (location_t));
+
       if (objc_string)
 	objc_at_sign_was_seen = false;
       goto retry;
@@ -1164,7 +1172,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   /* We have read one more token than we want.  */
   _cpp_backup_tokens (parse_in, 1);
   if (concats)
-    strs = XOBFINISH (&str_ob, cpp_string *);
+    {
+      strs = XOBFINISH (&str_ob, cpp_string *);
+      locs = XOBFINISH (&loc_ob, location_t *);
+    }
 
   if (concats && !objc_string && !in_system_header_at (input_location))
     warning (OPT_Wtraditional,
@@ -1176,6 +1187,12 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
+      if (concats)
+	{
+	  gcc_assert (locs);
+	  gcc_assert (g_string_concat_db);
+	  g_string_concat_db->record_string_concatenation (concats + 1, locs);
+	}
     }
   else
     {
@@ -1227,7 +1244,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   *valp = fix_string_type (value);
 
   if (concats)
-    obstack_free (&str_ob, 0);
+    {
+      obstack_free (&str_ob, 0);
+      obstack_free (&loc_ob, 0);
+    }
 
   return objc_string ? CPP_OBJC_STRING : type;
 }
diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index c11e7e7..0715b2e 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -216,6 +216,9 @@ c_common_init_options (unsigned int decoded_options_count,
   unsigned int i;
   struct cpp_callbacks *cb;
 
+  g_string_concat_db
+    = new (ggc_alloc <string_concat_db> ()) string_concat_db ();
+
   parse_in = cpp_create_reader (c_dialect_cxx () ? CLK_GNUCXX: CLK_GNUC89,
 				ident_hash, line_table);
   cb = cpp_get_callbacks (parse_in);
diff --git a/gcc/gcc.c b/gcc/gcc.c
index 7460f6a..062fcce 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -1117,7 +1117,7 @@ static const char *cpp_unique_options =
    options to the preprocessor so that it the cc1 spec may manipulate
    options used to set target flags.  Those special target flags settings may
    in turn cause preprocessor symbols to be defined specially.  */
-static const char *cpp_options =
+static const char *cpp_options_ =
 "%(cpp_unique_options) %1 %{m*} %{std*&ansi&trigraphs} %{W*&pedantic*} %{w}\
  %{f*} %{g*:%{!g0:%{g*} %{!fno-working-directory:-fworking-directory}}} %{O*}\
  %{undef} %{save-temps*:-fpch-preprocess}";
@@ -1558,7 +1558,7 @@ static struct spec_list static_specs[] =
   INIT_STATIC_SPEC ("asm_options",		&asm_options),
   INIT_STATIC_SPEC ("invoke_as",		&invoke_as),
   INIT_STATIC_SPEC ("cpp",			&cpp_spec),
-  INIT_STATIC_SPEC ("cpp_options",		&cpp_options),
+  INIT_STATIC_SPEC ("cpp_options",		&cpp_options_),
   INIT_STATIC_SPEC ("cpp_debug_options",	&cpp_debug_options),
   INIT_STATIC_SPEC ("cpp_unique_options",	&cpp_unique_options),
   INIT_STATIC_SPEC ("trad_capable_cpp",		&trad_capable_cpp),
diff --git a/gcc/input.c b/gcc/input.c
index f91a702..d058b8a 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1189,6 +1189,279 @@ dump_location_info (FILE *stream)
 				MAX_SOURCE_LOCATION + 1, UINT_MAX);
 }
 
+/* string_concat's constructor.  */
+
+string_concat::string_concat (int num, location_t *locs)
+  : m_num (num)
+{
+  m_locs = ggc_vec_alloc <location_t> (num);
+  for (int i = 0; i < num; i++)
+    m_locs[i] = locs[i];
+}
+
+/* string_concat_db's constructor.  */
+
+string_concat_db::string_concat_db ()
+{
+  m_table = hash_map <location_hash, string_concat *>::create_ggc (64);
+}
+
+/* Record that a string concatenation occurred, covering NUM
+   string literal tokens.  LOCS is an array of size NUM, containing the
+   locations of the tokens.  A copy of LOCS is taken.  */
+
+void
+string_concat_db::record_string_concatenation (int num, location_t *locs)
+{
+  gcc_assert (num > 1);
+  gcc_assert (locs);
+
+  location_t key_loc = get_key_loc (locs[0]);
+
+  string_concat *concat
+    = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
+  m_table->put (key_loc, concat);
+}
+
+/* Determine if LOC was the location of the the initial token of a
+   concatenation of string literal tokens.
+   If so, *OUT_NUM is written to with the number of tokens, and
+   *OUT_LOCS with the location of an array of locations of the
+   tokens, and return true.  *OUT_LOCS is a borrowed pointer to
+   storage owned by the string_concat_db.
+   Otherwise, return false.  */
+
+bool
+string_concat_db::get_string_concatenation (location_t loc,
+					    int *out_num,
+					    location_t **out_locs)
+{
+  gcc_assert (out_num);
+  gcc_assert (out_locs);
+
+  location_t key_loc = get_key_loc (loc);
+
+  string_concat **concat = m_table->get (key_loc);
+  if (!concat)
+    return false;
+
+  *out_num = (*concat)->m_num;
+  *out_locs =(*concat)->m_locs;
+  return true;
+}
+
+/* Internal function.  Canonicalize LOC into a form suitable for
+   use as a key within the database, stripping away macro expansion,
+   ad-hoc information, and range information, using the location of
+   the start of LOC within an ordinary linemap.  */
+
+location_t
+string_concat_db::get_key_loc (location_t loc)
+{
+  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
+				  NULL);
+
+  loc = get_range_from_loc (line_table, loc).m_start;
+
+  return loc;
+}
+
+/* Helper class for use within get_substring_ranges_for_loc.
+   An vec of cpp_string with responsibility for releasing all of the
+   str->text for each str in the vector.  */
+
+class auto_cpp_string_vec :  public auto_vec <cpp_string>
+{
+ public:
+  auto_cpp_string_vec (int alloc)
+    : auto_vec <cpp_string> (alloc) {}
+
+  ~auto_cpp_string_vec ()
+  {
+    /* Clean up the copies within this vec.  */
+    int i;
+    cpp_string *str;
+    FOR_EACH_VEC_ELT (*this, i, str)
+      free (const_cast <unsigned char *> (str->text));
+  }
+};
+
+/* Attempt to populate RANGES with source location information on the
+   individual characters within the string literal found at STRLOC.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC  was concatenated with are also added to RANGES.
+
+   Return NULL if successful, or an error message if any errors occurred (in
+   which case RANGES may be only partially populated and should not
+   be used).
+
+   This is implemented by re-parsing the relevant source line(s).  */
+
+static const char *
+get_substring_ranges_for_loc (cpp_reader *pfile,
+			      string_concat_db *concats,
+			      location_t strloc,
+			      enum cpp_ttype type,
+			      cpp_substring_ranges &ranges)
+{
+  gcc_assert (pfile);
+
+  if (strloc == UNKNOWN_LOCATION)
+    return "unknown location";
+
+  /* If string concatenation has occurred at STRLOC, get the locations
+     of all of the literal tokens making up the compound string.
+     Otherwise, just use STRLOC.  */
+  int num_locs = 1;
+  location_t *strlocs = &strloc;
+  if (concats)
+    concats->get_string_concatenation (strloc, &num_locs, &strlocs);
+
+  auto_cpp_string_vec strs (num_locs);
+  auto_vec <cpp_string_location_reader> loc_readers (num_locs);
+  for (int i = 0; i < num_locs; i++)
+    {
+      /* Get range of strloc.  We will use it to locate the start and finish
+	 of the literal token within the line.  */
+      source_range src_range = get_range_from_loc (line_table, strlocs[i]);
+
+      if (src_range.m_start >= LINEMAPS_MACRO_LOWEST_LOCATION (line_table))
+	/* If the string is within a macro expansion, we can't get at the
+	   end location.  */
+	return "macro expansion";
+
+      if (src_range.m_start >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token started within
+	   its line.  */
+	return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      if (src_range.m_finish >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token finished within
+	   its line.  */
+	return "range ends after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      expanded_location start
+	= expand_location_to_spelling_point (src_range.m_start);
+      expanded_location finish
+	= expand_location_to_spelling_point (src_range.m_finish);
+      if (start.file != finish.file)
+	return "range endpoints are in different files";
+      if (start.line != finish.line)
+	return "range endpoints are on different lines";
+      if (start.column > finish.column)
+	return "range endpoints are reversed";
+
+      int line_width;
+      const char *line = location_get_source_line (start.file, start.line,
+						   &line_width);
+      if (line == NULL)
+	return "unable to read source line";
+
+      /* Determine the location of the literal (including quotes
+	 and leading prefix chars, such as the 'u' in a u""
+	 token).  */
+      const char *literal = line + start.column - 1;
+      int literal_length = finish.column - start.column + 1;
+
+      gcc_assert (line_width >= (start.column - 1 + literal_length));
+      cpp_string from;
+      from.len = literal_length;
+      /* Make a copy of the literal, to avoid having to rely on
+	 the lifetime of the copy of the line within the cache.
+	 This will be released by the auto_cpp_string_vec dtor.  */
+      from.text = XDUPVEC (unsigned char, literal, literal_length);
+      strs.safe_push (from);
+
+      /* For very long lines, a new linemap could have started
+	 halfway through the token.
+	 Ensure that the loc_reader uses the linemap of the
+	 *end* of the token for its start location.  */
+      const line_map_ordinary *final_ord_map;
+      linemap_resolve_location (line_table, src_range.m_finish,
+				LRK_MACRO_EXPANSION_POINT, &final_ord_map);
+      location_t start_loc
+	= linemap_position_for_line_and_column (line_table, final_ord_map,
+						start.line, start.column);
+
+      cpp_string_location_reader loc_reader (start_loc, line_table);
+      loc_readers.safe_push (loc_reader);
+    }
+
+  /* Rerun cpp_interpret_string, or rather, a modified version of it.  */
+  const char *err = cpp_interpret_string_ranges (pfile, strs.address (),
+						 loc_readers.address (),
+						 num_locs, &ranges, type);
+  if (err)
+    return err;
+
+  /* Success: "ranges" should now contain information on the string.  */
+  return NULL;
+}
+
+/* Attempt to populate *OUT_RANGE with source location information on the
+   range of given characters within the string literal found at STRLOC.
+   START_IDX and END_IDX refer to offsets within the execution character
+   set.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC was concatenated with are also considered.
+
+   This is implemented by re-parsing the relevant source line(s).
+
+   Return NULL if successful, or an error message if any errors occurred.
+   Error messages are intended for GCC developers (to help debugging) rather
+   than for end-users.  */
+
+const char *
+get_source_range_for_substring (cpp_reader *pfile,
+				string_concat_db *concats,
+				location_t strloc,
+				enum cpp_ttype type,
+				int start_idx, int end_idx,
+				source_range *out_range)
+{
+  gcc_checking_assert (start_idx >= 0);
+  gcc_checking_assert (end_idx >= 0);
+  gcc_assert (out_range);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+  if (err)
+    return err;
+
+  if (start_idx >= ranges.get_num_ranges ())
+    return "start_idx out of range";
+  if (end_idx >= ranges.get_num_ranges ())
+    return "end_idx out of range";
+
+  out_range->m_start = ranges.get_range (start_idx).m_start;
+  out_range->m_finish = ranges.get_range (end_idx).m_finish;
+  return NULL;
+}
+
+/* As get_source_range_for_substring, but write to *OUT the number
+   of ranges that are available.  */
+
+const char *
+get_num_source_ranges_for_substring (cpp_reader *pfile,
+				     string_concat_db *concats,
+				     location_t strloc,
+				     enum cpp_ttype type,
+				     int *out)
+{
+  gcc_assert (out);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+
+  if (err)
+    return err;
+
+  *out = ranges.get_num_ranges ();
+  return NULL;
+}
+
 #if CHECKING_P
 
 namespace selftest {
@@ -1541,6 +1814,1259 @@ test_lexer (const line_table_case &case_)
   cpp_destroy (parser);
 }
 
+/* Forward decls.  */
+
+struct lexer_test;
+class lexer_test_options;
+
+/* A class for specifying options of a lexer_test.
+   The "apply" vfunc is called during the lexer_test constructor.  */
+
+class lexer_test_options
+{
+ public:
+  virtual void apply (lexer_test &) = 0;
+};
+
+/* A struct for writing lexer tests.  */
+
+struct lexer_test
+{
+  lexer_test (const line_table_case &case_, const char *content,
+	      lexer_test_options *options);
+  ~lexer_test ();
+
+  const cpp_token *get_token ();
+
+  temp_source_file m_tempfile;
+  temp_line_table m_tmp_lt;
+  cpp_reader *m_parser;
+  string_concat_db m_concats;
+};
+
+/* Use an EBCDIC encoding for the execution charset, specifically
+   IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+
+   This exercises iconv integration within libcpp.
+   Not every build of iconv supports the given charset,
+   so we need to flag this error and handle it gracefully.  */
+
+class ebcdic_execution_charset : public lexer_test_options
+{
+ public:
+  ebcdic_execution_charset () : m_num_iconv_errors (0)
+    {
+      gcc_assert (s_singleton == NULL);
+      s_singleton = this;
+    }
+  ~ebcdic_execution_charset ()
+    {
+      gcc_assert (s_singleton == this);
+      s_singleton = NULL;
+    }
+
+  void apply (lexer_test &test) FINAL OVERRIDE
+  {
+    cpp_options *cpp_opts = cpp_get_options (test.m_parser);
+    cpp_opts->narrow_charset = "IBM1047";
+
+    cpp_callbacks *callbacks = cpp_get_callbacks (test.m_parser);
+    callbacks->error = on_error;
+  }
+
+  static bool on_error (cpp_reader *pfile ATTRIBUTE_UNUSED,
+			int level ATTRIBUTE_UNUSED,
+			int reason ATTRIBUTE_UNUSED,
+			rich_location *richloc ATTRIBUTE_UNUSED,
+			const char *msgid, va_list *ap ATTRIBUTE_UNUSED)
+    ATTRIBUTE_FPTR_PRINTF(5,0)
+  {
+    gcc_assert (s_singleton);
+    /* Detect and record errors emitted by libcpp/charset.c:init_iconv_desc
+       when the local iconv build doesn't support the conversion.  */
+    if (strstr (msgid, "not supported by iconv"))
+      {
+	s_singleton->m_num_iconv_errors++;
+	return true;
+      }
+
+    /* Otherwise, we have an unexpected error.  */
+    abort ();
+  }
+
+  bool iconv_errors_occurred_p () const { return m_num_iconv_errors > 0; }
+
+ private:
+  static ebcdic_execution_charset *s_singleton;
+  int m_num_iconv_errors;
+};
+
+ebcdic_execution_charset *ebcdic_execution_charset::s_singleton;
+
+/* Constructor.  Override line_table with a new instance based on CASE_,
+   and write CONTENT to a tempfile.  Create a cpp_reader, and use it to
+   start parsing the tempfile.  */
+
+lexer_test::lexer_test (const line_table_case &case_, const char *content,
+			lexer_test_options *options) :
+  /* Create a tempfile and write the text to it.  */
+  m_tempfile (SELFTEST_LOCATION, ".c", content),
+  m_tmp_lt (case_),
+  m_parser (cpp_create_reader (CLK_GNUC99, NULL, line_table)),
+  m_concats ()
+{
+  if (options)
+    options->apply (*this);
+
+  cpp_init_iconv (m_parser);
+
+  /* Parse the file.  */
+  const char *fname = cpp_read_main_file (m_parser,
+					  m_tempfile.get_filename ());
+  ASSERT_NE (fname, NULL);
+}
+
+/* Destructor.  Verify that the next token in m_parser is EOF.  */
+
+lexer_test::~lexer_test ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  ASSERT_EQ (tok->type, CPP_EOF);
+
+  cpp_finish (m_parser, NULL);
+  cpp_destroy (m_parser);
+}
+
+/* Get the next token from m_parser.  */
+
+const cpp_token *
+lexer_test::get_token ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  return tok;
+}
+
+/* Verify that locations within string literals are correctly handled.  */
+
+/* Verify get_source_range_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the character at index IDX is on EXPECTED_LINE,
+   and that it begins at column EXPECTED_START_COL and ends at
+   EXPECTED_FINISH_COL (unless the locations are beyond
+   LINE_MAP_MAX_LOCATION_WITH_COLS, in which case don't check their
+   columns).  */
+
+static void
+assert_char_at_range (const location &loc,
+		      lexer_test& test,
+		      location_t strloc, enum cpp_ttype type, int idx,
+		      int expected_line, int expected_start_col,
+		      int expected_finish_col)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  source_range actual_range;
+  const char *err
+    = get_source_range_for_substring (pfile, concats, strloc, type,
+				      idx, idx, &actual_range);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+
+  int actual_start_line = LOCATION_LINE (actual_range.m_start);
+  ASSERT_EQ_AT (loc, expected_line, actual_start_line);
+  int actual_finish_line = LOCATION_LINE (actual_range.m_finish);
+  ASSERT_EQ_AT (loc, expected_line, actual_finish_line);
+
+  if (should_have_column_data_p (actual_range.m_start))
+    {
+      int actual_start_col = LOCATION_COLUMN (actual_range.m_start);
+      ASSERT_EQ_AT (loc, expected_start_col, actual_start_col);
+    }
+  if (should_have_column_data_p (actual_range.m_finish))
+    {
+      int actual_finish_col = LOCATION_COLUMN (actual_range.m_finish);
+      ASSERT_EQ_AT (loc, expected_finish_col, actual_finish_col);
+    }
+}
+
+/* Macro for calling assert_char_at_range, supplying SELFTEST_LOCATION for
+   the effective location of any errors.  */
+
+#define ASSERT_CHAR_AT_RANGE(LEXER_TEST, STRLOC, TYPE, IDX, EXPECTED_LINE, \
+			     EXPECTED_START_COL, EXPECTED_FINISH_COL)	\
+  assert_char_at_range (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), (TYPE), \
+			(IDX), (EXPECTED_LINE), (EXPECTED_START_COL), \
+			(EXPECTED_FINISH_COL))
+
+/* Verify get_num_source_ranges_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the token(s) at STRLOC contain EXPECTED_NUM_RANGES.  */
+
+static void
+assert_num_substring_ranges (const location &loc,
+			     lexer_test& test,
+			     location_t strloc,
+			     enum cpp_ttype type,
+			     int expected_num_ranges)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  int actual_num_ranges;
+  const char *err
+    = get_num_source_ranges_for_substring (pfile, concats, strloc, type,
+					   &actual_num_ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+  ASSERT_EQ_AT (loc, expected_num_ranges, actual_num_ranges);
+}
+
+/* Macro for calling assert_num_substring_ranges, supplying
+   SELFTEST_LOCATION for the effective location of any errors.  */
+
+#define ASSERT_NUM_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, \
+				    EXPECTED_NUM_RANGES)		\
+  assert_num_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), \
+			       (TYPE), (EXPECTED_NUM_RANGES))
+
+
+/* Verify that get_num_source_ranges_for_substring for token(s) at STRLOC
+   returns an error (using the string concatenation database for TEST).  */
+
+static void
+assert_has_no_substring_ranges (const location &loc,
+				lexer_test& test,
+				location_t strloc,
+				enum cpp_ttype type,
+				const char *expected_err)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+  cpp_substring_ranges ranges;
+  const char *actual_err
+    = get_substring_ranges_for_loc (pfile, concats, strloc,
+				    type, ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_STREQ_AT (loc, expected_err, actual_err);
+  else
+    ASSERT_STREQ_AT (loc,
+		     "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		     actual_err);
+}
+
+#define ASSERT_HAS_NO_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, ERR)    \
+    assert_has_no_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), \
+				    (STRLOC), (TYPE), (ERR))
+
+/* Lex a simple string literal.  Verify the substring location data, before
+   and after running cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_simple (const line_table_case &case_)
+{
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1,
+			  10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* As test_lexer_string_locations_simple, but use an EBCDIC execution
+   encoding.  */
+
+static void
+test_lexer_string_locations_ebcdic (const line_table_case &case_)
+{
+  /* EBCDIC support requires iconv.  */
+  if (!HAVE_ICONV)
+    return;
+
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  ebcdic_execution_charset use_ebcdic;
+  lexer_test test (case_, content, &use_ebcdic);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* The remainder of the test requires an iconv implementation that
+     can convert from UTF-8 to the EBCDIC encoding requested above.  */
+  if (use_ebcdic.iconv_errors_occurred_p ())
+    return;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* We should now have EBCDIC-encoded text, specifically
+     IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+     The digits 0-9 are encoded as 240-249 i.e. 0xf0-0xf9.  */
+  ASSERT_STREQ ("\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify that we don't attempt to record substring location information
+     for such cases.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a string literal containing a hex-escaped character.
+   Verify the substring location data, before and after running
+   cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_hex (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.
+     ....................000000000.111111.11112222.
+     ....................123456789.012345.67890123.  */
+  const char *content = "        \"01234\\x35 789\"\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\x35 789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 23);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+  ASSERT_EQ (tok->val.str.len, 15);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Lex a string literal containing an octal-escaped character.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_oct (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.
+     ....................000000000.111111.11112222.2222223333333333444
+     ....................123456789.012345.67890123.4567890123456789012  */
+  const char *content = "        \"01234\\065 789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\065 789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Test of string literal containing letter escapes.  */
+
+static void
+test_lexer_string_locations_letter_escape_1 (const line_table_case &case_)
+{
+  /* The string "\tfoo\\\nbar" i.e. tab, "foo", backslash, newline, bar.
+     .....................000000000.1.11111.1.1.11222.22222223333333
+     .....................123456789.0.12345.6.7.89012.34567890123456.  */
+  const char *content = ("        \"\\tfoo\\\\\\nbar\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"\\tfoo\\\\\\nbar\"");
+
+  /* Verify ranges of individual characters. */
+  /* "\t".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			0, 1, 10, 11);
+  /* "foo". */
+  for (int i = 1; i <= 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 11 + i, 11 + i);
+  /* "\\" and "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			4, 1, 15, 16);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			5, 1, 17, 18);
+
+  /* "bar".  */
+  for (int i = 6; i <= 8; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 9);
+}
+
+/* Another test of a string literal containing a letter escape.
+   Based on string seen in
+     printf ("%-%\n");
+   in gcc.dg/format/c90-printf-1.c.  */
+
+static void
+test_lexer_string_locations_letter_escape_2 (const line_table_case &case_)
+{
+  /* .....................000000000.1111.11.1111.22222222223.
+     .....................123456789.0123.45.6789.01234567890.  */
+  const char *content = ("        \"%-%\\n\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"%-%\\n\"");
+
+  /* Verify ranges of individual characters. */
+  /* "%-%".  */
+  for (int i = 0; i < 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 10 + i, 10 + i);
+  /* "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			3, 1, 13, 14);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 4);
+}
+
+/* Lex a string literal containing UCN 4 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn4 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     ....................000000000.111111.111122.222222223.33333333344444
+     ....................123456789.012345.678901.234567890.12345678901234  */
+  const char *content = "        \"01234\\u2174\\u2175789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\u2174\\u2175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The string should be encoded in the execution character
+     set.  Assuming that that is UTF-8, we should have the following:
+     -----------  ----  -----  -------  ----------------
+     Byte offset  Byte  Octal  Unicode  Source Column(s)
+     -----------  ----  -----  -------  ----------------
+     0            0x30         '0'      10
+     1            0x31         '1'      11
+     2            0x32         '2'      12
+     3            0x33         '3'      13
+     4            0x34         '4'      14
+     5            0xE2  \342   U+2174   15-20
+     6            0x85  \205    (cont)  15-20
+     7            0xB4  \264    (cont)  15-20
+     8            0xE2  \342   U+2175   21-26
+     9            0x85  \205    (cont)  21-26
+     10           0xB5  \265    (cont)  21-26
+     11           0x37         '7'      27
+     12           0x38         '8'      28
+     13           0x39         '9'      29
+     -----------  ----  -----  -------  ---------------.  */
+
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 20);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 21, 26);
+  /* '789'.  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 16 + i, 16 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Lex a string literal containing UCN 8 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn8 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     ....................000000000.111111.1111222222.2222333333333.344444
+     ....................123456789.012345.6789012345.6789012345678.901234  */
+  const char *content = "        \"01234\\U00002174\\U00002175789\" /* */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok,
+			   "\"01234\\U00002174\\U00002175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The UTF-8 encoding of the string is identical to that from
+     the ucn4 testcase above; the only difference is the column
+     locations.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 24);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 25, 34);
+  /* '789' at columns 35-37  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 24 + i, 24 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Fetch a big-endian 32-bit value and convert to host endianness.  */
+
+static uint32_t
+uint32_from_big_endian (const uint32_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return (((uint32_t) buf[0] << 24)
+	  | ((uint32_t) buf[1] << 16)
+	  | ((uint32_t) buf[2] << 8)
+	  | (uint32_t) buf[3]);
+}
+
+/* Lex a wide string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_wide_string (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       L\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_WSTRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "L\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_WSTRING.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_WSTRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* The cpp_reader defaults to big-endian with
+     CHAR_BIT * sizeof (int) for the wchar_precision, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Fetch a big-endian 16-bit value and convert to host endianness.  */
+
+static uint16_t
+uint16_from_big_endian (const uint16_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return ((uint16_t) buf[0] << 8) | (uint16_t) buf[1];
+}
+
+/* Lex a u"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string16 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       u\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING16);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING16.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING16;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-16BE.  */
+  const uint16_t *be16_chars = (const uint16_t *)dst_string.text;
+  ASSERT_EQ ('0', uint16_from_big_endian (&be16_chars[0]));
+  ASSERT_EQ ('5', uint16_from_big_endian (&be16_chars[5]));
+  ASSERT_EQ ('9', uint16_from_big_endian (&be16_chars[9]));
+  ASSERT_EQ (0, uint16_from_big_endian (&be16_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a U"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string32 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       U\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING32);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "U\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING32.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING32;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a u8-string literal.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_u8 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "      u8\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_UTF8STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u8\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+}
+
+/* Lex a string literal containing UTF-8 source characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_utf8_source (const line_table_case &case_)
+{
+ /* This string literal is written out to the source file as UTF-8,
+    and is of the form "before mojibake after", where "mojibake"
+    is written as the following four unicode code points:
+       U+6587 CJK UNIFIED IDEOGRAPH-6587
+       U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+       U+5316 CJK UNIFIED IDEOGRAPH-5316
+       U+3051 HIRAGANA LETTER KE.
+     Each of these is 3 bytes wide when encoded in UTF-8, whereas the
+     "before" and "after" are 1 byte per unicode character.
+
+     The numbering shown are "columns", which are *byte* numbers within
+     the line, rather than unicode character numbers.
+
+     .................... 000000000.1111111.
+     .................... 123456789.0123456.  */
+  const char *content = ("        \"before "
+			 /* U+6587 CJK UNIFIED IDEOGRAPH-6587
+			      UTF-8: 0xE6 0x96 0x87
+			      C octal escaped UTF-8: \346\226\207
+			    "column" numbers: 17-19.  */
+			 "\346\226\207"
+
+			 /* U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+			      UTF-8: 0xE5 0xAD 0x97
+			      C octal escaped UTF-8: \345\255\227
+			    "column" numbers: 20-22.  */
+			 "\345\255\227"
+
+			 /* U+5316 CJK UNIFIED IDEOGRAPH-5316
+			      UTF-8: 0xE5 0x8C 0x96
+			      C octal escaped UTF-8: \345\214\226
+			    "column" numbers: 23-25.  */
+			 "\345\214\226"
+
+			 /* U+3051 HIRAGANA LETTER KE
+			      UTF-8: 0xE3 0x81 0x91
+			      C octal escaped UTF-8: \343\201\221
+			    "column" numbers: 26-28.  */
+			 "\343\201\221"
+
+			 /* column numbers 29 onwards
+			  2333333.33334444444444
+			  9012345.67890123456789. */
+			 " after\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"before \346\226\207\345\255\227\345\214\226\343\201\221 after\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ
+    ("before \346\226\207\345\255\227\345\214\226\343\201\221 after",
+     (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     Assuming that both source and execution encodings are UTF-8, we have
+     a run of 25 octets in each.  */
+  for (int i = 0; i < 25; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 25);
+}
+
+/* Test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_1 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111111.11112222222222
+     .....................123456789.012345.67890123456789.  */
+  const char *content = ("        \"01234\" /* non-str */\n"
+			 "        \"56789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  location_t input_locs[2];
+
+  /* Verify that we get the expected tokens back.  */
+  auto_vec <cpp_string> input_strings;
+  const cpp_token *tok_a = test.get_token ();
+  ASSERT_EQ (tok_a->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok_a, "\"01234\"");
+  input_strings.safe_push (tok_a->val.str);
+  input_locs[0] = tok_a->src_loc;
+
+  const cpp_token *tok_b = test.get_token ();
+  ASSERT_EQ (tok_b->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok_b, "\"56789\"");
+  input_strings.safe_push (tok_b->val.str);
+  input_locs[1] = tok_b->src_loc;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 2,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (2, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  for (int i = 5; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 2, 5 + i, 5 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_2 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111.11111112222222
+     .....................123456789.012.34567890123456.  */
+  const char *content = ("        \"01\" /* non-str */\n"
+			 "        \"23\" /* non-str */\n"
+			 "        \"45\" /* non-str */\n"
+			 "        \"67\" /* non-str */\n"
+			 "        \"89\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[5];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 5; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 5,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (5, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  /* Within ASSERT_CHAR_AT_RANGE (actually assert_char_at_range), we can
+     detect if the initial loc is after LINE_MAP_MAX_LOCATION_WITH_COLS
+     and expect get_source_range_for_substring to fail.
+     However, for a string concatenation test, we can have a case
+     where the initial string is fully before LINE_MAP_MAX_LOCATION_WITH_COLS,
+     but subsequent strings can be after it.
+     Attempting to detect this within assert_char_at_range
+     would overcomplicate the logic for the common test cases, so
+     we detect it here.  */
+  if (should_have_column_data_p (input_locs[0])
+      && !should_have_column_data_p (input_locs[4]))
+    {
+      /* Verify that get_source_range_for_substring gracefully rejects
+	 this case.  */
+      source_range actual_range;
+      const char *err
+	= get_source_range_for_substring (test.m_parser, &test.m_concats,
+					  initial_loc, type, 0, 0,
+					  &actual_range);
+      ASSERT_STREQ ("range starts after LINE_MAP_MAX_LOCATION_WITH_COLS", err);
+      return;
+    }
+
+  for (int i = 0; i < 5; i++)
+    for (int j = 0; j < 2; j++)
+      ASSERT_CHAR_AT_RANGE (test, initial_loc, type, (i * 2) + j,
+			    i + 1, 10 + j, 10 + j);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation, this time combined with
+   various kinds of escaped characters.  */
+
+static void
+test_lexer_string_locations_concatenation_3 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  const char *content
+    /* .000000000.111111.111.1.2222.222.2.2233.333.3333.34444444444555
+       .123456789.012345.678.9.0123.456.7.8901.234.5678.90123456789012. */
+    = ("        \"01234\"  \"\\x35\"  \"\\066\"  \"789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[4];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 4; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 4,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (4, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 5, 1, 19, 22);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 6, 1, 27, 30);
+  for (int i = 7; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 28 + i, 28 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Test of string literal in a macro.  */
+
+static void
+test_lexer_string_locations_macro (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("#define MACRO     \"0123456789\" /* non-str */\n"
+			 "  MACRO");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+
+  /* Verify ranges of individual characters.  We ought to
+     see columns within the macro definition.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 20 + i, 20 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 10);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
+/* Test of stringification of a macro argument.  */
+
+static void
+test_lexer_string_locations_stringified_macro_argument
+  (const line_table_case &case_)
+{
+  /* .....................000000000111111111122222222223.
+     .....................123456789012345678901234567890.  */
+  const char *content = ("#define MACRO(X) #X /* non-str */\n"
+			 "MACRO(foo)\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"foo\"");
+
+  /* We don't support getting the location of a stringified macro
+     argument.  Verify that it fails gracefully.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING,
+				  "cpp_interpret_string_1 failed");
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
+/* Ensure that we are fail gracefully if something attempts to pass
+   in a location that isn't a string literal token.  Seen on this code:
+
+     const char a[] = " %d ";
+     __builtin_printf (a, 0.5);
+                       ^
+
+   when c-format.c erroneously used the indicated one-character
+   location as the format string location, leading to a read past the
+   end of a string buffer in cpp_interpret_string_1.  */
+
+static void
+test_lexer_string_locations_non_string (const line_table_case &case_)
+{
+  /* .....................000000000111111111122222222223.
+     .....................123456789012345678901234567890.  */
+  const char *content = ("         a\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_NAME);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "a");
+
+  /* At this point, libcpp is attempting to interpret the name as a
+     string literal, despite it not starting with a quote.  We don't detect
+     that, but we should at least fail gracefully.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING,
+				  "cpp_interpret_string_1 failed");
+}
+
+/* Ensure that we can read substring information for a token which
+   starts in one linemap and ends in another .  Adapted from
+   gcc.dg/cpp/pr69985.c.  */
+
+static void
+test_lexer_string_locations_long_line (const line_table_case &case_)
+{
+  /* .....................000000.000111111111
+     .....................123456.789012346789.  */
+  const char *content = ("/* A very long line, so that we start a new line map.  */\n"
+			 "     \"0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789\"\n");
+
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+
+  if (!should_have_column_data_p (line_table->highest_location))
+    return;
+
+  /* Verify ranges of individual characters.  */
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 130);
+  for (int i = 0; i < 130; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 2, 7 + i, 7 + i);
+}
+
+/* Test of lexing char constants.  */
+
+static void
+test_lexer_char_constants (const line_table_case &case_)
+{
+  /* Various char constants.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("         'a'\n"
+			 "        u'a'\n"
+			 "        U'a'\n"
+			 "        L'a'\n"
+			 "         'abc'\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  /* 'a'.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "'a'");
+
+  unsigned int chars_seen;
+  int unsignedp;
+  cppchar_t cc = cpp_interpret_charconst (test.m_parser, tok,
+					  &chars_seen, &unsignedp);
+  ASSERT_EQ (cc, 'a');
+  ASSERT_EQ (chars_seen, 1);
+
+  /* u'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR16);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u'a'");
+
+  /* U'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR32);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "U'a'");
+
+  /* L'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_WCHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "L'a'");
+
+  /* 'abc' (c-char-sequence).  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "'abc'");
+}
 /* A table of interesting location_t values, giving one axis of our test
    matrix.  */
 
@@ -1599,6 +3125,27 @@ input_c_tests ()
 	  /* Run all tests for the given case within the test matrix.  */
 	  test_accessing_ordinary_linemaps (c);
 	  test_lexer (c);
+	  test_lexer_string_locations_simple (c);
+	  test_lexer_string_locations_ebcdic (c);
+	  test_lexer_string_locations_hex (c);
+	  test_lexer_string_locations_oct (c);
+	  test_lexer_string_locations_letter_escape_1 (c);
+	  test_lexer_string_locations_letter_escape_2 (c);
+	  test_lexer_string_locations_ucn4 (c);
+	  test_lexer_string_locations_ucn8 (c);
+	  test_lexer_string_locations_wide_string (c);
+	  test_lexer_string_locations_string16 (c);
+	  test_lexer_string_locations_string32 (c);
+	  test_lexer_string_locations_u8 (c);
+	  test_lexer_string_locations_utf8_source (c);
+	  test_lexer_string_locations_concatenation_1 (c);
+	  test_lexer_string_locations_concatenation_2 (c);
+	  test_lexer_string_locations_concatenation_3 (c);
+	  test_lexer_string_locations_macro (c);
+	  test_lexer_string_locations_stringified_macro_argument (c);
+	  test_lexer_string_locations_non_string (c);
+	  test_lexer_string_locations_long_line (c);
+	  test_lexer_char_constants (c);
 
 	  num_cases_tested++;
 	}
diff --git a/gcc/input.h b/gcc/input.h
index d51f950..c17e440 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -95,4 +95,39 @@ void dump_location_info (FILE *stream);
 
 void diagnostics_file_cache_fini (void);
 
+struct GTY(()) string_concat
+{
+  string_concat (int num, location_t *locs);
+
+  int m_num;
+  location_t * GTY ((atomic)) m_locs;
+};
+
+struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
+
+class GTY(()) string_concat_db
+{
+ public:
+  string_concat_db ();
+  void record_string_concatenation (int num, location_t *locs);
+
+  bool get_string_concatenation (location_t loc,
+				 int *out_num,
+				 location_t **out_locs);
+
+ private:
+  static location_t get_key_loc (location_t loc);
+
+  /* For the fields to be private, we must grant access to the
+     generated code in gtype-desc.c.  */
+
+  friend void ::gt_ggc_mx_string_concat_db (void *x_p);
+  friend void ::gt_pch_nx_string_concat_db (void *x_p);
+  friend void ::gt_pch_p_16string_concat_db (void *this_obj, void *x_p,
+					     gt_pointer_operator op,
+					     void *cookie);
+
+  hash_map <location_hash, string_concat *> *m_table;
+};
+
 #endif
diff --git a/gcc/substring-locations.h b/gcc/substring-locations.h
new file mode 100644
index 0000000..274ebbe
--- /dev/null
+++ b/gcc/substring-locations.h
@@ -0,0 +1,30 @@
+/* Source locations within string literals.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_SUBSTRING_LOCATIONS_H
+#define GCC_SUBSTRING_LOCATIONS_H
+
+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+						   string_concat_db *concats,
+						   location_t strloc,
+						   enum cpp_ttype type,
+						   int start_idx, int end_idx,
+						   source_range *out_range);
+
+#endif /* ! GCC_SUBSTRING_LOCATIONS_H */
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
new file mode 100644
index 0000000..82689b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -0,0 +1,211 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdiagnostics-show-caret" } */
+
+/* This is a collection of unittests for ranges within string literals,
+   using diagnostic_plugin_test_string_literals, which handles
+   "__emit_string_literal_range" by generating a warning at the given
+   subset of a string literal.
+
+   The indices are 0-based.  It's easiest to verify things using string
+   literals that are runs of 0-based digits (to avoid having to count
+   characters).
+
+   LITERAL is a const void * to allow testing the various kinds of wide
+   string literal, rather than just const char *.  */
+
+extern void __emit_string_literal_range (const void *literal,
+					 int start_idx, int end_idx);
+
+void
+test_simple_string_literal (void)
+{
+  __emit_string_literal_range ("0123456789", /* { dg-warning "range" } */
+			       6, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("0123456789",
+                                       ^~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_concatenated_string_literal (void)
+{
+  __emit_string_literal_range ("01234" "56789", /* { dg-warning "range" } */
+			       3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234" "56789",
+                                    ^~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiline_string_literal (void)
+{
+  __emit_string_literal_range ("01234" /* { dg-warning "range" } */
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~  
+   { dg-end-multiline-output "" } */
+  /* FIXME: why does the above need two trailing spaces?  */
+}
+
+/* Tests of various unicode encodings.
+
+   Digits 0 through 9 are unicode code points:
+      U+0030 DIGIT ZERO
+      ...
+      U+0039 DIGIT NINE
+   However, these are not always valid as UCN (see the comment in
+   libcpp/charset.c:_cpp_valid_ucn).
+
+   Hence we need to test UCN using an alternative unicode
+   representation of numbers; let's use Roman numerals,
+   (though these start at one, not zero):
+      U+2170 SMALL ROMAN NUMERAL ONE
+      ...
+      U+2174 SMALL ROMAN NUMERAL FIVE  ("v")
+      U+2175 SMALL ROMAN NUMERAL SIX   ("vi")
+      ...
+      U+2178 SMALL ROMAN NUMERAL NINE.  */
+
+void
+test_hex (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.  */
+  __emit_string_literal_range ("01234\x35 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\x35 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_oct (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.  */
+  __emit_string_literal_range ("01234\065 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\065 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiple (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  __emit_string_literal_range ("01234"  "\x35"  "\066"  "789", /* { dg-warning "range" } */
+			       3, 8);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"  "\x35"  "\066"  "789",
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn4 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     The resulting string is encoded as UTF-8.  Most of the digits are 1 byte
+     each, but digits 5 and 6 are encoded with 3 bytes each.
+     Hence to underline digits 4-7 we need to underling using bytes 4-11 in
+     the UTF-8 encoding.  */
+  __emit_string_literal_range ("01234\u2174\u2175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\u2174\u2175789",
+                                     ^~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn8 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     The resulting string is the same as as in test_ucn4 above, and hence
+     has the same UTF-8 encoding, and so we again need to underline bytes
+     4-11 in the UTF-8 encoding in order to underline digits 4-7.  */
+  __emit_string_literal_range ("01234\U00002174\U00002175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\U00002174\U00002175789",
+                                     ^~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u8 (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u8"0123456789", /* { dg-warning "range" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u8"0123456789",
+                                       ^~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_U (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (U"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (U"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_L (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (L"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (L"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_macro (void)
+{
+#define START "01234"  /* { dg-warning "range" } */
+  __emit_string_literal_range (START
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+ #define START "01234"
+                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   __emit_string_literal_range (START
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~
+   { dg-end-multiline-output "" } */
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
new file mode 100644
index 0000000..7851c02
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
@@ -0,0 +1,53 @@
+/* { dg-do compile } */
+
+/* See the notes in diagnostic-test-string-literals-1.c.
+   This test case has caret-printing disabled.  */
+
+extern void __emit_string_literal_range (const void *literal,
+					 int start_idx, int end_idx);
+/* Test of a stringified macro argument, by itself.  */
+
+void
+test_stringified_token_1 (int x)
+{
+#define STRINGIFY(EXPR) #EXPR
+
+  __emit_string_literal_range (STRINGIFY(x > 0), /* { dg-error "unable to read substring range: macro expansion" } */
+                               0, 4);
+
+#undef STRINGIFY
+}
+
+/* Test of a stringified token within a concatenation.  */
+
+void
+test_stringized_token_2 (int x)
+{
+#define EXAMPLE(EXPR, START_IDX, END_IDX)			\
+  do {								\
+    __emit_string_literal_range ("  before " #EXPR " after \n",	\
+				 START_IDX, END_IDX);		\
+  } while (0)
+
+  EXAMPLE(x > 0, 1, 6);
+  /* { dg-error "unable to read substring range: cpp_interpret_string_1 failed" "" { target *-*-* } 28 } */
+
+#undef EXAMPLE
+}
+
+/* Test of a doubly-stringified macro argument (by itself).  */
+
+void
+test_stringified_token_3 (int x)
+{
+#define XSTR(s) STR(s)
+#define STR(s) #s
+#define FOO 123456789
+  __emit_string_literal_range (XSTR (FOO), /* { dg-error "unable to read substring range: macro expansion" } */
+                               2, 3);
+
+#undef XSTR
+#undef STR
+#undef FOO
+}
+
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
new file mode 100644
index 0000000..d44612a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
@@ -0,0 +1,212 @@
+/* This plugin uses the diagnostics code to verify tracking of source code
+   locations within string literals.  */
+/* { dg-options "-O" } */
+
+#include "gcc-plugin.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "stringpool.h"
+#include "toplev.h"
+#include "basic-block.h"
+#include "hash-table.h"
+#include "vec.h"
+#include "ggc.h"
+#include "basic-block.h"
+#include "tree-ssa-alias.h"
+#include "internal-fn.h"
+#include "gimple-fold.h"
+#include "tree-eh.h"
+#include "gimple-expr.h"
+#include "is-a.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "intl.h"
+#include "plugin-version.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "context.h"
+#include "print-tree.h"
+#include "cpplib.h"
+#include "c-family/c-pragma.h"
+
+int plugin_is_GPL_compatible;
+
+/* A custom pass for printing string literal location information.  */
+
+const pass_data pass_data_test_string_literals =
+{
+  GIMPLE_PASS, /* type */
+  "test_string_literals", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_test_string_literals : public gimple_opt_pass
+{
+public:
+  pass_test_string_literals(gcc::context *ctxt)
+    : gimple_opt_pass(pass_data_test_string_literals, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_test_string_literals
+
+/* Determine if STMT is a call with NUM_ARGS arguments to a function
+   named FUNCNAME.
+   If so, return STMT as a gcall *.  Otherwise return NULL.  */
+
+static gcall *
+check_for_named_call (gimple *stmt,
+		      const char *funcname, unsigned int num_args)
+{
+  gcc_assert (funcname);
+
+  gcall *call = dyn_cast <gcall *> (stmt);
+  if (!call)
+    return NULL;
+
+  tree fndecl = gimple_call_fndecl (call);
+  if (!fndecl)
+    return NULL;
+
+  if (strcmp (IDENTIFIER_POINTER (DECL_NAME (fndecl)), funcname))
+    return NULL;
+
+  if (gimple_call_num_args (call) != num_args)
+    {
+      error_at (stmt->location, "expected number of args: %i (got %i)",
+		num_args, gimple_call_num_args (call));
+      return NULL;
+    }
+
+  return call;
+}
+
+/* Emit a warning covering SRC_RANGE, with the caret at the start of
+   SRC_RANGE.  */
+
+static void
+emit_warning (source_range src_range)
+{
+  location_t loc
+    = make_location (src_range.m_start, src_range.m_start, src_range.m_finish);
+  warning_at (loc, 0, "range %i:%i-%i:%i",
+	      LOCATION_LINE (src_range.m_start),
+	      LOCATION_COLUMN (src_range.m_start),
+	      LOCATION_LINE (src_range.m_finish),
+	      LOCATION_COLUMN (src_range.m_finish));
+}
+
+/* Support code for verifying that we are correctly tracking ranges
+   within string literals, for use by diagnostic-test-string-literals-*.c.
+   Emit a warning showing the range of a string literal, for each call to
+   a function named "__emit_string_literal_range".
+   The initial argument should be a string literal; arguments 2 and 3
+   should be integer constants, giving the range within the string
+   to be printed.  */
+
+static void
+test_string_literals (gimple *stmt)
+{
+  gcall *call = check_for_named_call (stmt, "__emit_string_literal_range", 3);
+  if (!call)
+    return;
+
+  /* We expect an ADDR_EXPR with a STRING_CST inside it for the
+     initial arg.  */
+  tree t_addr_string = gimple_call_arg (call, 0);
+  if (TREE_CODE (t_addr_string) != ADDR_EXPR)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_string = TREE_OPERAND (t_addr_string, 0);
+  if (TREE_CODE (t_string) != STRING_CST)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_start_idx = gimple_call_arg (call, 1);
+  if (TREE_CODE (t_start_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 2");
+      return;
+    }
+  int start_idx = TREE_INT_CST_LOW (t_start_idx);
+
+  tree t_end_idx = gimple_call_arg (call, 2);
+  if (TREE_CODE (t_end_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 3");
+      return;
+    }
+  int end_idx = TREE_INT_CST_LOW (t_end_idx);
+
+  /* A STRING_CST doesn't have a location, but the ADDR_EXPR does.  */
+  location_t strloc = EXPR_LOCATION (t_addr_string);
+  source_range src_range;
+  substring_loc substr_loc (strloc, TREE_TYPE (t_string),
+			    start_idx, end_idx);
+  const char *err = substr_loc.get_range (&src_range);
+  if (err)
+    error_at (strloc, "unable to read substring range: %s", err);
+  else
+    emit_warning (src_range);
+}
+
+/* Call test_string_literals on every statement within FUN.  */
+
+unsigned int
+pass_test_string_literals::execute (function *fun)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, fun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	test_string_literals (stmt);
+      }
+
+  return 0;
+}
+
+/* Entrypoint for the plugin.  Create and register the custom pass.  */
+
+int
+plugin_init (struct plugin_name_args *plugin_info,
+	     struct plugin_gcc_version *version)
+{
+  struct register_pass_info pass_info;
+  const char *plugin_name = plugin_info->base_name;
+  int argc = plugin_info->argc;
+  struct plugin_argument *argv = plugin_info->argv;
+
+  if (!plugin_default_version_check (version, &gcc_version))
+    return 1;
+
+  pass_info.pass = new pass_test_string_literals (g);
+  pass_info.reference_pass_name = "ssa";
+  pass_info.ref_pass_instance_number = 1;
+  pass_info.pos_op = PASS_POS_INSERT_AFTER;
+  register_callback (plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,
+		     &pass_info);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/plugin.exp b/gcc/testsuite/gcc.dg/plugin/plugin.exp
index faebb75..715038a 100644
--- a/gcc/testsuite/gcc.dg/plugin/plugin.exp
+++ b/gcc/testsuite/gcc.dg/plugin/plugin.exp
@@ -70,6 +70,9 @@ set plugin_test_list [list \
 	  diagnostic-test-expressions-1.c } \
     { diagnostic_plugin_show_trees.c \
 	  diagnostic-test-show-trees-1.c } \
+    { diagnostic_plugin_test_string_literals.c \
+	  diagnostic-test-string-literals-1.c \
+	  diagnostic-test-string-literals-2.c } \
     { location_overflow_plugin.c \
 	  location-overflow-test-1.c \
 	  location-overflow-test-2.c } \
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 2d07942..3739d6c 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -812,6 +812,51 @@ cpp_host_to_exec_charset (cpp_reader *pfile, cppchar_t c)
 
 \f
 
+/* cpp_substring_ranges's constructor. */
+
+cpp_substring_ranges::cpp_substring_ranges () :
+  m_ranges (NULL),
+  m_num_ranges (0),
+  m_alloc_ranges (8)
+{
+  m_ranges = XNEWVEC (source_range, m_alloc_ranges);
+}
+
+/* cpp_substring_ranges's destructor. */
+
+cpp_substring_ranges::~cpp_substring_ranges ()
+{
+  free (m_ranges);
+}
+
+/* Add RANGE to the vector of source_range information.  */
+
+void
+cpp_substring_ranges::add_range (source_range range)
+{
+  if (m_num_ranges >= m_alloc_ranges)
+    {
+      m_alloc_ranges *= 2;
+      m_ranges
+	= (source_range *)xrealloc (m_ranges,
+				    sizeof (source_range) * m_alloc_ranges);
+    }
+  m_ranges[m_num_ranges++] = range;
+}
+
+/* Read NUM ranges from LOC_READER, adding them to the vector of source_range
+   information.  */
+
+void
+cpp_substring_ranges::add_n_ranges (int num,
+				    cpp_string_location_reader &loc_reader)
+{
+  for (int i = 0; i < num; i++)
+    add_range (loc_reader.get_next ());
+}
+
+\f
+
 /* Utility routine that computes a mask of the form 0000...111... with
    WIDTH 1-bits.  */
 static inline size_t
@@ -980,18 +1025,27 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
    one beyond the UCN, or to the syntactically invalid character.
 
    IDENTIFIER_POS is 0 when not in an identifier, 1 for the start of
-   an identifier, or 2 otherwise.  */
+   an identifier, or 2 otherwise.
+
+   If CHAR_RANGE and LOC_READER are non-NULL, then position information is
+   read from *LOC_READER and CHAR_RANGE->m_finish is updated accordingly.  */
 
 bool
 _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 		const uchar *limit, int identifier_pos,
-		struct normalize_state *nst, cppchar_t *cp)
+		struct normalize_state *nst, cppchar_t *cp,
+		source_range *char_range,
+		cpp_string_location_reader *loc_reader)
 {
   cppchar_t result, c;
   unsigned int length;
   const uchar *str = *pstr;
   const uchar *base = str - 2;
 
+  /* char_range and loc_reader must either be both NULL, or both be
+     non-NULL.  */
+  gcc_assert ((char_range != NULL) == (loc_reader != NULL));
+
   if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99))
     cpp_error (pfile, CPP_DL_WARNING,
 	       "universal character names are only valid in C++ and C99");
@@ -1021,6 +1075,8 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
       if (!ISXDIGIT (c))
 	break;
       str++;
+      if (loc_reader)
+	char_range->m_finish = loc_reader->get_next ().m_finish;
       result = (result << 4) + hex_value (c);
     }
   while (--length && str < limit);
@@ -1086,11 +1142,18 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 }
 
 /* Convert an UCN, pointed to by FROM, to UTF-8 encoding, then translate
-   it to the execution character set and write the result into TBUF.
-   An advanced pointer is returned.  Issues all relevant diagnostics.  */
+   it to the execution character set and write the result into TBUF,
+   if TBUF is non-NULL.
+   An advanced pointer is returned.  Issues all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t ucn;
   uchar buf[6];
@@ -1099,8 +1162,17 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
   int rval;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   from++;  /* Skip u/U.  */
-  _cpp_valid_ucn (pfile, &from, limit, 0, &nst, &ucn);
+
+  if (loc_reader)
+    /* The u/U is part of the spelling of this character.  */
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
+  _cpp_valid_ucn (pfile, &from, limit, 0, &nst,
+		  &ucn, &char_range, loc_reader);
 
   rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
   if (rval)
@@ -1109,9 +1181,20 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
       cpp_errno (pfile, CPP_DL_ERROR,
 		 "converting UCN to source character set");
     }
-  else if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting UCN to execution character set");
+  else
+    {
+      if (tbuf)
+	if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
+	  cpp_errno (pfile, CPP_DL_ERROR,
+		     "converting UCN to execution character set");
+
+      if (loc_reader)
+	{
+	  int num_encoded_bytes = 6 - bytesleft;
+	  for (int i = 0; i < num_encoded_bytes; i++)
+	    ranges->add_range (char_range);
+	}
+    }
 
   return from;
 }
@@ -1167,31 +1250,48 @@ emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
 }
 
 /* Convert a hexadecimal escape, pointed to by FROM, to the execution
-   character set and write it into the string buffer TBUF.  Returns an
-   advanced pointer, and issues diagnostics as necessary.
+   character set and write it into the string buffer TBUF (if non-NULL).
+   Returns an advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given hex
-   number.  You can, e.g. generate surrogate pairs this way.  */
+   number.  You can, e.g. generate surrogate pairs this way.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
   size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   if (CPP_WTRADITIONAL (pfile))
     cpp_warning (pfile, CPP_W_TRADITIONAL,
 	         "the meaning of '\\x' is different in traditional C");
 
-  from++;  /* Skip 'x'.  */
+  /* Skip 'x'.  */
+  from++;
+
+  /* The 'x' is part of the spelling of this character.  */
+  if (loc_reader)
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
   while (from < limit)
     {
       c = *from;
       if (! hex_p (c))
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 4 >> 4);
       n = (n << 4) + hex_value (c);
       digits_found = 1;
@@ -1211,7 +1311,10 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
@@ -1221,10 +1324,16 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
    advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given octal
-   number.  */
+   number.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
@@ -1232,12 +1341,17 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   while (from < limit && count++ < 3)
     {
       c = *from;
       if (c < '0' || c > '7')
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 3 >> 3);
       n = (n << 3) + c - '0';
     }
@@ -1249,18 +1363,26 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
 
 /* Convert an escape sequence (pointed to by FROM) to its value on
    the target, and to the execution character set.  Do not scan past
-   LIMIT.  Write the converted value into TBUF.  Returns an advanced
-   pointer.  Handles all relevant diagnostics.  */
+   LIMIT.  Write the converted value into TBUF, if TBUF is non-NULL.
+   Returns an advanced pointer.  Handles all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL: location
+   information is read from *LOC_READER, and *RANGES is updated
+   accordingly.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+		cpp_string_location_reader *loc_reader,
+		cpp_substring_ranges *ranges)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1273,20 +1395,28 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 
   uchar c;
 
+  /* Record the location of the backslash.  */
+  source_range char_range;
+  if (loc_reader)
+    char_range = loc_reader->get_next ();
+
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, cvt);
+      return convert_ucn (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, cvt);
+      return convert_hex (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, cvt);
+      return convert_oct (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1338,10 +1468,17 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 	}
     }
 
-  /* Now convert what we have to the execution character set.  */
-  if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting escape sequence to execution character set");
+  if (tbuf)
+    /* Now convert what we have to the execution character set.  */
+    if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
+      cpp_errno (pfile, CPP_DL_ERROR,
+		 "converting escape sequence to execution character set");
+
+  if (loc_reader)
+    {
+      char_range.m_finish = loc_reader->get_next ().m_finish;
+      ranges->add_range (char_range);
+    }
 
   return from + 1;
 }
@@ -1374,28 +1511,52 @@ converter_for_type (cpp_reader *pfile, enum cpp_ttype type)
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
    concatenated.  WIDE indicates whether or not to produce a wide
-   string.  The result is written into TO.  Returns true for success,
-   false for failure.  */
-bool
-cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to,  enum cpp_ttype type)
+   string.  If TO is non-NULL, the result is written into TO.
+   If LOC_READERS and OUT are non-NULL, then location information
+   is read from LOC_READERS (which must be an array of length COUNT),
+   and location information is written to *RANGES.
+
+   Returns true for success, false for failure.  */
+
+static bool
+cpp_interpret_string_1 (cpp_reader *pfile, const cpp_string *from, size_t count,
+			cpp_string *to,  enum cpp_ttype type,
+			cpp_string_location_reader *loc_readers,
+			cpp_substring_ranges *out)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
   struct cset_converter cvt = converter_for_type (pfile, type);
 
-  tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
-  tbuf.text = XNEWVEC (uchar, tbuf.asize);
-  tbuf.len = 0;
+  /* loc_readers and out must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_readers != NULL) == (out != NULL));
+
+  if (to)
+    {
+      tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
+      tbuf.text = XNEWVEC (uchar, tbuf.asize);
+      tbuf.len = 0;
+    }
 
   for (i = 0; i < count; i++)
     {
+      cpp_string_location_reader *loc_reader = NULL;
+      if (loc_readers)
+	loc_reader = &loc_readers[i];
+
       p = from[i].text;
       if (*p == 'u')
 	{
-	  if (*++p == '8')
-	    p++;
+	  p++;
+	  if (loc_reader)
+	    loc_reader->get_next ();
+	  if (*p == '8')
+	    {
+	      p++;
+	      if (loc_reader)
+		loc_reader->get_next ();
+	    }
 	}
       else if (*p == 'L' || *p == 'U') p++;
       if (*p == 'R')
@@ -1414,13 +1575,43 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 
 	  /* Raw strings are all normal characters; these can be fed
 	     directly to convert_cset.  */
-	  if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
-	    goto fail;
+	  if (to)
+	    if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
+	      goto fail;
+
+	  if (loc_reader)
+	    {
+	      /* If generating source ranges, assume we have a 1:1
+		 correspondence between bytes in the source encoding and bytes
+		 in the execution encoding (e.g. if we have a UTF-8 to UTF-8
+		 conversion), so that this run of bytes in the source file
+		 corresponds to a run of bytes in the execution string.
+		 This requirement is guaranteed by an early-reject in
+		 cpp_interpret_string_ranges.  */
+	      gcc_assert (cvt.func == convert_no_conversion);
+	      out->add_n_ranges (limit - p, *loc_reader);
+	    }
 
 	  continue;
 	}
 
-      p++; /* Skip leading quote.  */
+      /* If we don't now have a leading quote, something has gone wrong.
+	 This can occur if cpp_interpret_string_ranges is handling a
+	 stringified macro argument, but should not be possible otherwise.  */
+      if (*p != '"' && *p != '\'')
+	{
+	  gcc_assert (out != NULL);
+	  cpp_error (pfile, CPP_DL_ERROR, "missing open quote");
+	  if (to)
+	    free (tbuf.text);
+	  return false;
+	}
+
+      /* Skip leading quote.  */
+      p++;
+      if (loc_reader)
+	loc_reader->get_next ();
+
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
       for (;;)
@@ -1432,29 +1623,130 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 	    {
 	      /* We have a run of normal characters; these can be fed
 		 directly to convert_cset.  */
-	      if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
-		goto fail;
+	      if (to)
+		if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
+		  goto fail;
+	    /* Similar to above: assumes we have a 1:1 correspondence
+	       between bytes in the source encoding and bytes in the
+	       execution encoding.  */
+	      if (loc_reader)
+		{
+		  gcc_assert (cvt.func == convert_no_conversion);
+		  out->add_n_ranges (p - base, *loc_reader);
+		}
 	    }
-	  if (p == limit)
+	  if (p >= limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
+	  struct _cpp_strbuf *tbuf_ptr = to ? &tbuf : NULL;
+	  p = convert_escape (pfile, p + 1, limit, tbuf_ptr, cvt,
+			      loc_reader, out);
 	}
     }
-  /* NUL-terminate the 'to' buffer and translate it to a cpp_string
-     structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, cvt);
-  tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
-  to->text = tbuf.text;
-  to->len = tbuf.len;
+
+  if (to)
+    {
+      /* NUL-terminate the 'to' buffer and translate it to a cpp_string
+	 structure.  */
+      emit_numeric_escape (pfile, 0, &tbuf, cvt);
+      tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
+      to->text = tbuf.text;
+      to->len = tbuf.len;
+    }
+
   return true;
 
  fail:
   cpp_errno (pfile, CPP_DL_ERROR, "converting to execution character set");
-  free (tbuf.text);
+  if (to)
+    free (tbuf.text);
   return false;
 }
 
+/* FROM is an array of cpp_string structures of length COUNT.  These
+   are to be converted from the source to the execution character set,
+   escape sequences translated, and finally all are to be
+   concatenated.  WIDE indicates whether or not to produce a wide
+   string.  The result is written into TO.  Returns true for success,
+   false for failure.  */
+bool
+cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
+		      cpp_string *to,  enum cpp_ttype type)
+{
+  return cpp_interpret_string_1 (pfile, from, count, to, type, NULL, NULL);
+}
+
+/* A "do nothing" error-handling callback for use by
+   cpp_interpret_string_ranges, so that it can temporarily suppress
+   error-handling.  */
+
+static bool
+noop_error_cb (cpp_reader *, int, int, rich_location *,
+	       const char *, va_list *)
+{
+  /* no-op.  */
+  return true;
+}
+
+/* This function mimics the behavior of cpp_interpret_string, but
+   rather than generating a string in the execution character set,
+   *OUT is written to with the source code ranges of the characters
+   in such a string.
+   FROM and LOC_READERS should both be arrays of length COUNT.
+   Returns NULL for success, or an error message for failure.  */
+
+const char *
+cpp_interpret_string_ranges (cpp_reader *pfile, const cpp_string *from,
+			     cpp_string_location_reader *loc_readers,
+			     size_t count,
+			     cpp_substring_ranges *out,
+			     enum cpp_ttype type)
+{
+  /* There are a couple of cases in the range-handling in
+     cpp_interpret_string_1 that rely on there being a 1:1 correspondence
+     between bytes in the source encoding and bytes in the execution
+     encoding, so that each byte in the execution string can correspond
+     to the location of a byte in the source string.
+
+     This holds for the typical case of a UTF-8 to UTF-8 conversion.
+     Enforce this requirement by only attempting to track substring
+     locations if we have source encoding == execution encoding.
+
+     This is a stronger condition than we need, since we could e.g.
+     have ASCII to EBCDIC (with 1 byte per character before and after),
+     but it seems to be a reasonable restriction.  */
+  struct cset_converter cvt = converter_for_type (pfile, type);
+  if (cvt.func != convert_no_conversion)
+    return "execution character set != source character set";
+
+  /* For on-demand strings we have already lexed the strings, so there
+     should be no errors.  However, if we have bogus source location
+     data (or stringified macro arguments), the attempt to lex the
+     strings could fail with an error.  Temporarily install an
+     error-handler to catch the error, so that it can lead to this call
+     failing, rather than being emitted as a user-visible diagnostic.
+     If an error does occur, we should see it via the return value of
+     cpp_interpret_string_1.  */
+  bool (*saved_error_handler) (cpp_reader *, int, int, rich_location *,
+			       const char *, va_list *)
+    ATTRIBUTE_FPTR_PRINTF(5,0);
+
+  saved_error_handler = pfile->cb.error;
+  pfile->cb.error = noop_error_cb;
+
+  bool result = cpp_interpret_string_1 (pfile, from, count, NULL, type,
+					loc_readers, out);
+
+  /* Restore the saved error-handler.  */
+  pfile->cb.error = saved_error_handler;
+
+  if (!result)
+    return "cpp_interpret_string_1 failed";
+
+  /* Success.  */
+  return NULL;
+}
+
 /* Subroutine of do_line and do_linemarker.  Convert escape sequences
    in a string, but do not perform character set conversion.  */
 bool
@@ -1818,3 +2110,39 @@ _cpp_default_encoding (void)
 
   return current_encoding;
 }
+
+/* Implementation of class cpp_string_location_reader.  */
+
+/* Constructor for cpp_string_location_reader.  */
+
+cpp_string_location_reader::
+cpp_string_location_reader (source_location src_loc,
+			    line_maps *line_table)
+: m_line_table (line_table)
+{
+  src_loc = get_range_from_loc (line_table, src_loc).m_start;
+
+  /* SRC_LOC might be a macro location.  It only makes sense to do
+     column-by-column calculations on ordinary maps, so get the
+     corresponding location in an ordinary map.  */
+  m_loc
+    = linemap_resolve_location (line_table, src_loc,
+				LRK_SPELLING_LOCATION, NULL);
+
+  const line_map_ordinary *map
+    = linemap_check_ordinary (linemap_lookup (line_table, m_loc));
+  m_offset_per_column = (1 << map->m_range_bits);
+}
+
+/* Get the range of the next source byte.  */
+
+source_range
+cpp_string_location_reader::get_next ()
+{
+  source_range result;
+  result.m_start = m_loc;
+  result.m_finish = m_loc;
+  if (m_loc <= LINE_MAP_MAX_LOCATION_WITH_COLS)
+    m_loc += m_offset_per_column;
+  return result;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 4e0084c..659686b 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -743,6 +743,51 @@ struct GTY(()) cpp_hashnode {
   union _cpp_hashnode_value GTY ((desc ("CPP_HASHNODE_VALUE_IDX (%1)"))) value;
 };
 
+/* A class for iterating through the source locations within a
+   string token (before escapes are interpreted, and before
+   concatenation).  */
+
+class cpp_string_location_reader {
+ public:
+  cpp_string_location_reader (source_location src_loc,
+			      line_maps *line_table);
+
+  source_range get_next ();
+
+ private:
+  source_location m_loc;
+  int m_offset_per_column;
+  line_maps *m_line_table;
+};
+
+/* A class for storing the source ranges of all of the characters within
+   a string literal, after escapes are interpreted, and after
+   concatenation.
+
+   This is not GTY-marked, as instances are intended to be temporary.  */
+
+class cpp_substring_ranges
+{
+ public:
+  cpp_substring_ranges ();
+  ~cpp_substring_ranges ();
+
+  int get_num_ranges () const { return m_num_ranges; }
+  source_range get_range (int idx) const
+  {
+    linemap_assert (idx < m_num_ranges);
+    return m_ranges[idx];
+  }
+
+  void add_range (source_range range);
+  void add_n_ranges (int num, cpp_string_location_reader &loc_reader);
+
+ private:
+  source_range *m_ranges;
+  int m_num_ranges;
+  int m_alloc_ranges;
+};
+
 /* Call this first to get a handle to pass to other functions.
 
    If you want cpplib to manage its own hashtable, pass in a NULL
@@ -829,6 +874,12 @@ extern cppchar_t cpp_interpret_charconst (cpp_reader *, const cpp_token *,
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
 				  cpp_string *, enum cpp_ttype);
+extern const char *cpp_interpret_string_ranges (cpp_reader *pfile,
+						const cpp_string *from,
+						cpp_string_location_reader *,
+						size_t count,
+						cpp_substring_ranges *out,
+						enum cpp_ttype type);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
 					      cpp_string *, enum cpp_ttype);
diff --git a/libcpp/internal.h b/libcpp/internal.h
index ca2b498..4a5cd3c 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -754,7 +754,9 @@ struct normalize_state
 extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    const unsigned char *, int,
 			    struct normalize_state *state,
-			    cppchar_t *);
+			    cppchar_t *,
+			    source_range *char_range,
+			    cpp_string_location_reader *loc_reader);
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/lex.c b/libcpp/lex.c
index 236418d..4e71965 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -1247,7 +1247,7 @@ forms_identifier_p (cpp_reader *pfile, int first,
       cppchar_t s;
       buffer->cur += 2;
       if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
-			  state, &s))
+			  state, &s, NULL, NULL))
 	return true;
       buffer->cur -= 2;
     }
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955)
  2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
  2016-08-03 15:17             ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
  2016-08-03 15:17             ` [PATCH 2/4] (v3) On-demand locations within string-literals David Malcolm
@ 2016-08-03 15:17             ` David Malcolm
  2016-08-04 19:55               ` Jeff Law
  2016-08-03 16:06             ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT Jeff Law
  3 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-03 15:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Joseph Myers, David Malcolm

This adds fix-it hints to c-format.c so that it can (sometimes) suggest
the format string the user should have used.

The patch adds selftests for the new code in c-format.c.  These
selftests are thus lang-specific.  This is the first time we've had
lang-specific selftests, and hence the patch also adds a langhook for
running them.  (Note that currently the Makefile only invokes the
selftests for cc1).

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

(The v2 version of the patch had a successful selftest run for stage 1 on
powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the patch
kit, and a successful build of stage1 for all targets via config-list.mk;
the patch has only been rebased since)

OK for trunk if it passes testing?

gcc/c-family/ChangeLog:
	PR c/64955
	* c-common.h (selftest::c_format_c_tests): New declaration.
	(selftest::run_c_tests): New declaration.
	* c-format.c: Include "selftest.h.
	(format_warning_va): Add param "corrected_substring" and use
	it to add a replacement fix-it hint.
	(format_warning_at_substring): Likewise.
	(format_warning_at_char): Update for new param of
	format_warning_va.
	(check_format_info_main): Pass "fki" to check_format_types.
	(check_format_types): Add param "fki" and pass it to
	format_type_warning.
	(deref_n_times): New function.
	(get_modifier_for_format_len): New function.
	(selftest::test_get_modifier_for_format_len): New function.
	(get_format_for_type): New function.
	(format_type_warning): Add param "fki" and use it to attempt
	to provide hints for argument types when calling
	format_warning_at_substring.
	(selftest::get_info): New function.
	(selftest::assert_format_for_type_streq): New function.
	(ASSERT_FORMAT_FOR_TYPE_STREQ): New macro.
	(selftest::test_get_format_for_type_printf): New function.
	(selftest::test_get_format_for_type_scanf): New function.
	(selftest::c_format_c_tests): New function.

gcc/c/ChangeLog:
	PR c/64955
	* c-lang.c (LANG_HOOKS_RUN_LANG_SELFTESTS): If CHECKING_P, wire
	this up to selftest::run_c_tests.
	(selftest::run_c_tests): New function.

gcc/ChangeLog:
	PR c/64955
	* langhooks-def.h (LANG_HOOKS_RUN_LANG_SELFTESTS): New default
	do-nothing langhook.
	(LANG_HOOKS_INITIALIZER): Add LANG_HOOKS_RUN_LANG_SELFTESTS.
	* langhooks.h (struct lang_hooks): Add run_lang_selftests.
	* selftest-run-tests.c: Include "tree.h" and "langhooks.h".
	(selftest::run_tests): Call lang_hooks.run_lang_selftests.

gcc/testsuite/ChangeLog:
	PR c/64955
	* gcc.dg/format/diagnostic-ranges.c: Add fix-it hints to expected
	output.
---
 gcc/c-family/c-common.h                         |   7 +
 gcc/c-family/c-format.c                         | 268 ++++++++++++++++++++++--
 gcc/c/c-lang.c                                  |  22 ++
 gcc/langhooks-def.h                             |   4 +-
 gcc/langhooks.h                                 |   3 +
 gcc/selftest-run-tests.c                        |   5 +
 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c |  30 ++-
 7 files changed, 319 insertions(+), 20 deletions(-)

diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 7b5da57..61f9ced 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1533,4 +1533,11 @@ extern bool valid_array_size_p (location_t, tree, tree);
 extern bool cilk_ignorable_spawn_rhs_op (tree);
 extern bool cilk_recognize_spawn (tree, tree *);
 
+#if CHECKING_P
+namespace selftest {
+  extern void c_format_c_tests (void);
+  extern void run_c_tests (void);
+} // namespace selftest
+#endif /* #if CHECKING_P */
+
 #endif /* ! GCC_C_COMMON_H */
diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index 5b79588..f5a4011 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -30,6 +30,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "langhooks.h"
 #include "c-format.h"
 #include "diagnostic.h"
+#include "selftest.h"
 
 /* Handle attributes associated with format checking.  */
 
@@ -126,11 +127,21 @@ static int format_flags (int format_num);
      printf(fmt, msg);
             ^~~  ~~~
 
+   If CORRECTED_SUBSTRING is non-NULL, use it for cases 1 and 2 to provide
+   a fix-it hint, suggesting that it should replace the text within the
+   substring range.  For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf ("hello %i", msg);
+                    ~^
+                    %s
+
    Return true if a warning was emitted, false otherwise.  */
 
-ATTRIBUTE_GCC_DIAG (4,0)
+ATTRIBUTE_GCC_DIAG (5,0)
 static bool
 format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
+		   const char *corrected_substring,
 		   int opt, const char *gmsgid, va_list *ap)
 {
   bool substring_within_range = false;
@@ -174,6 +185,9 @@ format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
       richloc.add_range (param_loc, false);
     }
 
+  if (!err && corrected_substring && substring_within_range)
+    richloc.add_fixit_replace (fmt_substring_range, corrected_substring);
+
   diagnostic_info diagnostic;
   diagnostic_set_info (&diagnostic, gmsgid, ap, &richloc, DK_WARNING);
   diagnostic.option_index = opt;
@@ -182,22 +196,31 @@ format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
   if (!err && substring_loc && !substring_within_range)
     /* Case 2.  */
     if (warned)
-      inform (substring_loc, "format string is defined here");
+      {
+	rich_location substring_richloc (line_table, substring_loc);
+	if (corrected_substring)
+	  substring_richloc.add_fixit_replace (fmt_substring_range,
+					       corrected_substring);
+	inform_at_rich_loc (&substring_richloc,
+			    "format string is defined here");
+      }
 
   return warned;
 }
 
 /* Variadic call to format_warning_va.  */
 
-ATTRIBUTE_GCC_DIAG (4,0)
+ATTRIBUTE_GCC_DIAG (5,0)
 static bool
 format_warning_at_substring (const substring_loc &fmt_loc,
 			     source_range *param_range,
+			     const char *corrected_substring,
 			     int opt, const char *gmsgid, ...)
 {
   va_list ap;
   va_start (ap, gmsgid);
-  bool warned = format_warning_va (fmt_loc, param_range, opt, gmsgid, &ap);
+  bool warned = format_warning_va (fmt_loc, param_range, corrected_substring,
+				   opt, gmsgid, &ap);
   va_end (ap);
 
   return warned;
@@ -225,7 +248,7 @@ format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
   char_idx -= 1;
 
   substring_loc fmt_loc (fmt_string_loc, string_type, char_idx, char_idx);
-  bool warned = format_warning_va (fmt_loc, NULL, opt, gmsgid, &ap);
+  bool warned = format_warning_va (fmt_loc, NULL, NULL, opt, gmsgid, &ap);
   va_end (ap);
 
   return warned;
@@ -1126,11 +1149,13 @@ static const format_flag_spec *get_flag_spec (const format_flag_spec *,
 					      int, const char *);
 
 static void check_format_types (const substring_loc &fmt_loc,
-				format_wanted_type *);
+				format_wanted_type *,
+				const format_kind_info *fki);
 static void format_type_warning (const substring_loc &fmt_loc,
 				 source_range *param_range,
 				 format_wanted_type *, tree,
-				 tree);
+				 tree,
+				 const format_kind_info *fki);
 
 /* Decode a format type from a string, returning the type, or
    format_type_error if not valid, in which case the caller should print an
@@ -2534,7 +2559,7 @@ check_format_info_main (format_check_results *res,
 	  ptrdiff_t offset_to_format_end = (format_chars - 1) - orig_format_chars;
 	  substring_loc fmt_loc (fmt_param_loc, TREE_TYPE (format_string_cst),
 				 offset_to_format_start, offset_to_format_end);
-	  check_format_types (fmt_loc, first_wanted_type);
+	  check_format_types (fmt_loc, first_wanted_type, fki);
 	}
     }
 
@@ -2558,7 +2583,7 @@ check_format_info_main (format_check_results *res,
    location of the format conversion.  */
 static void
 check_format_types (const substring_loc &fmt_loc,
-		    format_wanted_type *types)
+		    format_wanted_type *types, const format_kind_info *fki)
 {
   for (; types != 0; types = types->next)
     {
@@ -2585,7 +2610,7 @@ check_format_types (const substring_loc &fmt_loc,
       cur_param = types->param;
       if (!cur_param)
         {
-          format_type_warning (fmt_loc, NULL, types, wanted_type, NULL);
+	  format_type_warning (fmt_loc, NULL, types, wanted_type, NULL, fki);
           continue;
         }
 
@@ -2670,7 +2695,7 @@ check_format_types (const substring_loc &fmt_loc,
 	  else
 	    {
 	      format_type_warning (fmt_loc, param_range_ptr,
-				   types, wanted_type, orig_cur_type);
+				   types, wanted_type, orig_cur_type, fki);
 	      break;
 	    }
 	}
@@ -2739,10 +2764,115 @@ check_format_types (const substring_loc &fmt_loc,
 	continue;
       /* Now we have a type mismatch.  */
       format_type_warning (fmt_loc, param_range_ptr, types,
-			   wanted_type, orig_cur_type);
+			   wanted_type, orig_cur_type, fki);
+    }
+}
+
+/* Given type TYPE, attempt to dereference the type N times
+   (e.g. from ("int ***", 2) to "int *")
+
+   Return the derefenced type, with any qualifiers
+   such as "const" stripped from the result, or
+   NULL if unsuccessful (e.g. TYPE is not a pointer type).  */
+
+static tree
+deref_n_times (tree type, int n)
+{
+  gcc_assert (type);
+
+  for (int i = n; i > 0; i--)
+    {
+      if (TREE_CODE (type) != POINTER_TYPE)
+	return NULL_TREE;
+      type = TREE_TYPE (type);
     }
+  /* Strip off any "const" etc.  */
+  return build_qualified_type (type, 0);
 }
 
+/* Lookup the format code for FORMAT_LEN within FLI,
+   returning the string code for expressing it, or NULL
+   if it is not found.  */
+
+static const char *
+get_modifier_for_format_len (const format_length_info *fli,
+			     enum format_lengths format_len)
+{
+  for (; fli->name; fli++)
+    {
+      if (fli->index == format_len)
+	return fli->name;
+      if (fli->double_index == format_len)
+	return fli->double_name;
+    }
+  return NULL;
+}
+
+#if CHECKING_P
+
+namespace selftest {
+
+static void
+test_get_modifier_for_format_len ()
+{
+  ASSERT_STREQ ("h",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_h));
+  ASSERT_STREQ ("hh",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_hh));
+  ASSERT_STREQ ("L",
+		get_modifier_for_format_len (printf_length_specs, FMT_LEN_L));
+  ASSERT_EQ (NULL,
+	     get_modifier_for_format_len (printf_length_specs, FMT_LEN_none));
+}
+
+} // namespace selftest
+
+#endif /* CHECKING_P */
+
+/* Generate a string containing the format string that should be
+   used to format arguments of type ARG_TYPE within FKI (effectively
+   the inverse of the checking code).
+
+   If successful, returns a non-NULL string which should be freed
+   by the called.
+   Otherwise, returns NULL.  */
+
+static char *
+get_format_for_type (const format_kind_info *fki, tree arg_type)
+{
+  gcc_assert (arg_type);
+
+  const format_char_info *spec;
+  for (spec = &fki->conversion_specs[0];
+       spec->format_chars;
+       spec++)
+    {
+      tree effective_arg_type = deref_n_times (arg_type,
+					       spec->pointer_count);
+      if (!effective_arg_type)
+	continue;
+      for (int i = 0; i < FMT_LEN_MAX; i++)
+	{
+	  const format_type_detail *ftd = &spec->types[i];
+	  if (!ftd->type)
+	    continue;
+	  if (TYPE_CANONICAL (*ftd->type)
+	      == TYPE_CANONICAL (effective_arg_type))
+	    {
+	      const char *len_modifier
+		= get_modifier_for_format_len (fki->length_char_specs,
+					       (enum format_lengths)i);
+	      if (!len_modifier)
+		len_modifier = "";
+
+	      return xasprintf ("%%%s%c",
+				len_modifier,
+				spec->format_chars[0]);
+	    }
+	}
+   }
+  return NULL;
+}
 
 /* Give a warning at FMT_LOC about a format argument of different type
    from that expected.  If non-NULL, PARAM_RANGE is the source range of the
@@ -2756,9 +2886,10 @@ static void
 format_type_warning (const substring_loc &fmt_loc,
 		     source_range *param_range,
 		     format_wanted_type *type,
-		     tree wanted_type, tree arg_type)
+		     tree wanted_type, tree arg_type,
+		     const format_kind_info *fki)
 {
-  int kind = type->kind;
+  enum format_specifier_kind kind = type->kind;
   const char *wanted_type_name = type->wanted_type_name;
   const char *format_start = type->format_start;
   int format_length = type->format_length;
@@ -2797,12 +2928,18 @@ format_type_warning (const substring_loc &fmt_loc,
       p[pointer_count + 1] = 0;
     }
 
+  /* Attempt to provide hints for argument types, but not for field widths
+     and precisions.  */
+  char *format_for_type = NULL;
+  if (arg_type && kind == CF_KIND_FORMAT)
+    format_for_type = get_format_for_type (fki, arg_type);
+
   if (wanted_type_name)
     {
       if (arg_type)
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
 	   "but argument %d has type %qT",
 	   gettext (kind_descriptions[kind]),
@@ -2812,7 +2949,7 @@ format_type_warning (const substring_loc &fmt_loc,
       else
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
 	   gettext (kind_descriptions[kind]),
 	   (kind == CF_KIND_FORMAT ? "%" : ""),
@@ -2823,7 +2960,7 @@ format_type_warning (const substring_loc &fmt_loc,
       if (arg_type)
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
 	   "but argument %d has type %qT",
 	   gettext (kind_descriptions[kind]),
@@ -2833,12 +2970,14 @@ format_type_warning (const substring_loc &fmt_loc,
       else
 	format_warning_at_substring
 	  (fmt_loc, param_range,
-	   OPT_Wformat_,
+	   format_for_type, OPT_Wformat_,
 	   "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
 	   gettext (kind_descriptions[kind]),
 	   (kind == CF_KIND_FORMAT ? "%" : ""),
 	   format_length, format_start, wanted_type, p);
     }
+
+  free (format_for_type);
 }
 
 
@@ -3359,3 +3498,96 @@ handle_format_attribute (tree *node, tree ARG_UNUSED (name), tree args,
 
   return NULL_TREE;
 }
+
+#if CHECKING_P
+
+namespace selftest {
+
+/* Selftests of location handling.  */
+
+/* Get the format_kind_info with the given name.  */
+
+static const format_kind_info *
+get_info (const char *name)
+{
+  int idx = decode_format_type (name);
+  const format_kind_info *fki = &format_types[idx];
+  ASSERT_STREQ (fki->name, name);
+  return fki;
+}
+
+/* Verify that get_format_for_type (FKI, TYPE) is EXPECTED_FORMAT.  */
+
+static void
+assert_format_for_type_streq (const location &loc, const format_kind_info *fki,
+			      const char *expected_format, tree type)
+{
+  gcc_assert (fki);
+  gcc_assert (expected_format);
+  gcc_assert (type);
+
+  char *actual_format = get_format_for_type (fki, type);
+  ASSERT_STREQ_AT (loc, expected_format, actual_format);
+  free (actual_format);
+}
+
+/* Selftests for get_format_for_type.  */
+
+#define ASSERT_FORMAT_FOR_TYPE_STREQ(EXPECTED_FORMAT, TYPE) \
+  assert_format_for_type_streq (SELFTEST_LOCATION, (fki), (EXPECTED_FORMAT), (TYPE))
+
+/* Selftest for get_format_for_type for "printf"-style functions.  */
+
+static void
+test_get_format_for_type_printf ()
+{
+  const format_kind_info *fki = get_info ("gnu_printf");
+  ASSERT_NE (fki, NULL);
+
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%f", double_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%Lf", long_double_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%d", integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%o", unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%ld", long_integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lo", long_unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lld", long_long_integer_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%llo", long_long_unsigned_type_node);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%s", build_pointer_type (char_type_node));
+}
+
+/* Selftest for get_format_for_type for "scanf"-style functions.  */
+
+static void
+test_get_format_for_type_scanf ()
+{
+  const format_kind_info *fki = get_info ("gnu_scanf");
+  ASSERT_NE (fki, NULL);
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%d", build_pointer_type (integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%u", build_pointer_type (unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%ld",
+				build_pointer_type (long_integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%lu",
+				build_pointer_type (long_unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ
+    ("%lld", build_pointer_type (long_long_integer_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ
+    ("%llu", build_pointer_type (long_long_unsigned_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%e", build_pointer_type (float_type_node));
+  ASSERT_FORMAT_FOR_TYPE_STREQ ("%le", build_pointer_type (double_type_node));
+}
+
+#undef ASSERT_FORMAT_FOR_TYPE_STREQ
+
+/* Run all of the selftests within this file.  */
+
+void
+c_format_c_tests ()
+{
+  test_get_modifier_for_format_len ();
+  test_get_format_for_type_printf ();
+  test_get_format_for_type_scanf ();
+}
+
+} // namespace selftest
+
+#endif /* CHECKING_P */
diff --git a/gcc/c/c-lang.c b/gcc/c/c-lang.c
index 89954b7..b26be6a 100644
--- a/gcc/c/c-lang.c
+++ b/gcc/c/c-lang.c
@@ -38,7 +38,29 @@ enum c_language_kind c_language = clk_c;
 #undef LANG_HOOKS_INIT_TS
 #define LANG_HOOKS_INIT_TS c_common_init_ts
 
+#if CHECKING_P
+#undef LANG_HOOKS_RUN_LANG_SELFTESTS
+#define LANG_HOOKS_RUN_LANG_SELFTESTS selftest::run_c_tests
+#endif /* #if CHECKING_P */
+
 /* Each front end provides its own lang hook initializer.  */
 struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;
 
+#if CHECKING_P
+
+namespace selftest {
+
+/* Implementation of LANG_HOOKS_RUN_LANG_SELFTESTS for the C frontend.  */
+
+void
+run_c_tests (void)
+{
+  c_format_c_tests ();
+}
+
+} // namespace selftest
+
+#endif /* #if CHECKING_P */
+
+
 #include "gtype-c.h"
diff --git a/gcc/langhooks-def.h b/gcc/langhooks-def.h
index 034b3b7..c17f998 100644
--- a/gcc/langhooks-def.h
+++ b/gcc/langhooks-def.h
@@ -120,6 +120,7 @@ extern bool lhd_omp_mappable_type (tree);
 #define LANG_HOOKS_BLOCK_MAY_FALLTHRU	hook_bool_const_tree_true
 #define LANG_HOOKS_EH_USE_CXA_END_CLEANUP	false
 #define LANG_HOOKS_DEEP_UNSHARING	false
+#define LANG_HOOKS_RUN_LANG_SELFTESTS   lhd_do_nothing
 
 /* Attribute hooks.  */
 #define LANG_HOOKS_ATTRIBUTE_TABLE		NULL
@@ -319,7 +320,8 @@ extern void lhd_end_section (void);
   LANG_HOOKS_EH_PROTECT_CLEANUP_ACTIONS, \
   LANG_HOOKS_BLOCK_MAY_FALLTHRU, \
   LANG_HOOKS_EH_USE_CXA_END_CLEANUP, \
-  LANG_HOOKS_DEEP_UNSHARING \
+  LANG_HOOKS_DEEP_UNSHARING, \
+  LANG_HOOKS_RUN_LANG_SELFTESTS \
 }
 
 #endif /* GCC_LANG_HOOKS_DEF_H */
diff --git a/gcc/langhooks.h b/gcc/langhooks.h
index 0593424..169a678 100644
--- a/gcc/langhooks.h
+++ b/gcc/langhooks.h
@@ -505,6 +505,9 @@ struct lang_hooks
      gimplification.  */
   bool deep_unsharing;
 
+  /* Run all lang-specific selftests.  */
+  void (*run_lang_selftests) (void);
+
   /* Whenever you add entries here, make sure you adjust langhooks-def.h
      and langhooks.c accordingly.  */
 };
diff --git a/gcc/selftest-run-tests.c b/gcc/selftest-run-tests.c
index 85e101d..9d75a8e 100644
--- a/gcc/selftest-run-tests.c
+++ b/gcc/selftest-run-tests.c
@@ -21,6 +21,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "selftest.h"
+#include "tree.h"
+#include "langhooks.h"
 
 /* This function needed to be split out from selftest.c as it references
    tests from the whole source tree, and so is within
@@ -70,6 +72,9 @@ selftest::run_tests ()
   /* This one relies on most of the above.  */
   function_tests_c_tests ();
 
+  /* Run any lang-specific selftests.  */
+  lang_hooks.run_lang_selftests ();
+
   /* Finished running tests.  */
   long finish_time = get_run_time ();
   long elapsed_time = finish_time - start_time;
diff --git a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
index 9e86b52..ff51833 100644
--- a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
+++ b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
@@ -12,6 +12,25 @@ void test_mismatching_types (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello %i", msg);
                  ~^
+                 %s
+   { dg-end-multiline-output "" } */
+
+
+  printf("hello %s", 42);  /* { dg-warning "format '%s' expects argument of type 'char \\*', but argument 2 has type 'int'" } */
+/* TODO: ideally would also underline "42".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %s", 42);
+                 ~^
+                 %d
+   { dg-end-multiline-output "" } */
+
+
+  printf("hello %i", (long)0);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'long int' " } */
+/* TODO: ideally would also underline the argument.  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %i", (long)0);
+                 ~^
+                 %ld
    { dg-end-multiline-output "" } */
 }
 
@@ -23,6 +42,7 @@ void test_multiple_arguments (void)
 /* { dg-begin-multiline-output "" }
    printf ("arg0: %i  arg1: %s arg 2: %i",
                             ~^
+                            %d
    { dg-end-multiline-output "" } */
 }
 
@@ -33,6 +53,7 @@ void test_multiple_arguments_2 (int i, int j)
 /* { dg-begin-multiline-output "" }
    printf ("arg0: %i  arg1: %s arg 2: %i",
                             ~^
+                            %d
            100, i + j, 102);
                 ~~~~~         
    { dg-end-multiline-output "" } */
@@ -67,6 +88,7 @@ void test_hex (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello \x25\x69", msg);
                  ~~~~~~~^
+                 %s
    { dg-end-multiline-output "" } */
 }
 
@@ -80,6 +102,7 @@ void test_oct (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("hello \045\151", msg);
                  ~~~~~~~^
+                 %s
    { dg-end-multiline-output "" } */
 }
 
@@ -98,6 +121,7 @@ void test_multiple (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf("prefix"  "\x25"  "\151"  "suffix",
                      ~~~~~~~~~~~^
+                     %s
   { dg-end-multiline-output "" } */
 }
 
@@ -108,6 +132,7 @@ void test_u8 (const char *msg)
 /* { dg-begin-multiline-output "" }
    printf(u8"hello %i", msg);
                    ~^
+                   %s
    { dg-end-multiline-output "" } */
 }
 
@@ -117,6 +142,7 @@ void test_param (long long_i, long long_j)
 /* { dg-begin-multiline-output "" }
    printf ("foo %s bar", long_i + long_j);
                 ~^       ~~~~~~~~~~~~~~~
+                %ld
    { dg-end-multiline-output "" } */
 }
 
@@ -192,13 +218,14 @@ void test_macro (const char *msg)
 /* { dg-begin-multiline-output "" }
  #define INT_FMT "%i"
                   ~^
+                  %s
    { dg-end-multiline-output "" } */
 }
 
 void test_non_contiguous_strings (void)
 {
   __builtin_printf(" %" "d ", 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
-                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 200 } */
+                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 227 } */
   /* { dg-begin-multiline-output "" }
    __builtin_printf(" %" "d ", 0.5);
                     ^~~~
@@ -206,6 +233,7 @@ void test_non_contiguous_strings (void)
   /* { dg-begin-multiline-output "" }
    __builtin_printf(" %" "d ", 0.5);
                       ~~~~^
+                      %f
    { dg-end-multiline-output "" } */
 }
 
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT
  2016-07-30  1:16         ` David Malcolm
@ 2016-08-03 15:17           ` David Malcolm
  2016-08-03 15:17             ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
                               ` (3 more replies)
  0 siblings, 4 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-03 15:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Joseph Myers, David Malcolm

I split out the selftest.h changes from v2 of the kit for ease of review;
here they are.

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

OK for trunk?

gcc/ChangeLog:
	* selftest.h (ASSERT_TRUE): Reimplement in terms of...
	(ASSERT_TRUE_AT): New macro.
	(ASSERT_FALSE): Reimplement in terms of...
	(ASSERT_FALSE_AT): New macro.
	(ASSERT_STREQ_AT): Fix typo in comment.
---
 gcc/selftest.h | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/gcc/selftest.h b/gcc/selftest.h
index 0bee476..397e998 100644
--- a/gcc/selftest.h
+++ b/gcc/selftest.h
@@ -104,13 +104,19 @@ extern int num_passes;
    ::selftest::fail if it false.  */
 
 #define ASSERT_TRUE(EXPR)				\
+  ASSERT_TRUE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_TRUE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_TRUE_AT(LOC, EXPR)			\
   SELFTEST_BEGIN_STMT					\
   const char *desc = "ASSERT_TRUE (" #EXPR ")";		\
   bool actual = (EXPR);					\
   if (actual)						\
-    ::selftest::pass (SELFTEST_LOCATION, desc);	\
+    ::selftest::pass ((LOC), desc);			\
   else							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);		\
+    ::selftest::fail ((LOC), desc);			\
   SELFTEST_END_STMT
 
 /* Evaluate EXPR and coerce to bool, calling
@@ -118,13 +124,19 @@ extern int num_passes;
    ::selftest::fail if it true.  */
 
 #define ASSERT_FALSE(EXPR)					\
+  ASSERT_FALSE_AT (SELFTEST_LOCATION, (EXPR))
+
+/* Like ASSERT_FALSE, but treat LOC as the effective location of the
+   selftest.  */
+
+#define ASSERT_FALSE_AT(LOC, EXPR)				\
   SELFTEST_BEGIN_STMT						\
-  const char *desc = "ASSERT_FALSE (" #EXPR ")";		\
-  bool actual = (EXPR);					\
-  if (actual)							\
-    ::selftest::fail (SELFTEST_LOCATION, desc);				\
-  else								\
-    ::selftest::pass (SELFTEST_LOCATION, desc);				\
+  const char *desc = "ASSERT_FALSE (" #EXPR ")";			\
+  bool actual = (EXPR);							\
+  if (actual)								\
+    ::selftest::fail ((LOC), desc);			\
+  else									\
+    ::selftest::pass ((LOC), desc);					\
   SELFTEST_END_STMT
 
 /* Evaluate EXPECTED and ACTUAL and compare them with ==, calling
@@ -169,7 +181,7 @@ extern int num_passes;
 			    (EXPECTED), (ACTUAL));		    \
   SELFTEST_END_STMT
 
-/* Like ASSERT_STREQ_AT, but treat LOC as the effective location of the
+/* Like ASSERT_STREQ, but treat LOC as the effective location of the
    selftest.  */
 
 #define ASSERT_STREQ_AT(LOC, EXPECTED, ACTUAL)			    \
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952)
  2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
@ 2016-08-03 15:17             ` David Malcolm
  2016-08-04 18:09               ` Jeff Law
  2016-08-03 15:17             ` [PATCH 2/4] (v3) On-demand locations within string-literals David Malcolm
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-03 15:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Joseph Myers, David Malcolm

This patch updates c-format.c to use the new class substring_loc, added
in the previous patch, replacing location_column_from_byte_offset.
Hence with this patch, Wformat can underline the precise erroneous
format string in many more cases.

The patch also introduces two new functions for emitting Wformat
warnings: format_warning_at_substring and format_warning_at_char,
providing an inform in the face of macros where the pertinent part of
the format string may be separate from the function call.

Successfully bootstrapped&regrtested in conjunction with the rest of the
patch kit on x86_64-pc-linux-gnu.

(The v2 version of the patch had a successful selftest run for stage 1 on
powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the patch
kit, and a successful build of stage1 for all targets via config-list.mk;
the patch has only been rebased since)

OK for trunk if it passes individual testing? (on top of patches 1-2)

gcc/c-family/ChangeLog:
	PR c/52952
	* c-format.c: Include "diagnostic.h".
	(location_column_from_byte_offset): Delete.
	(location_from_offset): Delete.
	(format_warning_va): New function.
	(format_warning_at_substring): New function.
	(format_warning_at_char): New function.
	(check_format_arg): Capture location of format_tree and pass to
	check_format_info_main.
	(check_format_info_main): Add params FMT_PARAM_LOC and
	FORMAT_STRING_CST.  Convert calls to warning_at to calls to
	format_warning_at_char.  Pass a substring_loc instance to
	check_format_types.
	(check_format_types): Convert first param from a location_t
	to a const substring_loc & and rename to "fmt_loc".  Attempt
	to extract the range of the relevant parameter and pass it
	to format_type_warning.
	(format_type_warning): Convert first param from a location_t
	to a const substring_loc & and rename to "fmt_loc".  Add
	params "param_range" and "type".  Replace calls to warning_at
	with calls to format_warning_at_substring.

gcc/testsuite/ChangeLog:
	PR c/52952
	* gcc.dg/cpp/pr66415-1.c: Likewise.
	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
	* gcc.dg/format/c90-printf-1.c: Likewise.
	* gcc.dg/format/diagnostic-ranges.c: New test case.
---
 gcc/c-family/c-format.c                         | 476 +++++++++++++++---------
 gcc/testsuite/gcc.dg/cpp/pr66415-1.c            |   8 +-
 gcc/testsuite/gcc.dg/format/asm_fprintf-1.c     |   6 +-
 gcc/testsuite/gcc.dg/format/c90-printf-1.c      |  14 +-
 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c | 222 +++++++++++
 5 files changed, 544 insertions(+), 182 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/format/diagnostic-ranges.c

diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index c19c411..5b79588 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -29,6 +29,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "intl.h"
 #include "langhooks.h"
 #include "c-format.h"
+#include "diagnostic.h"
 
 /* Handle attributes associated with format checking.  */
 
@@ -65,78 +66,169 @@ static int first_target_format_type;
 static const char *format_name (int format_num);
 static int format_flags (int format_num);
 
-/* Given a string S of length LINE_WIDTH, find the visual column
-   corresponding to OFFSET bytes.   */
+/* Emit a warning governed by option OPT, using GMSGID as the format
+   string and AP as its arguments.
 
-static unsigned int
-location_column_from_byte_offset (const char *s, int line_width,
-				  unsigned int offset)
-{
-  const char * c = s;
-  if (*c != '"')
-    return 0;
+   Attempt to obtain precise location information within a string
+   literal from FMT_LOC.
+
+   Case 1: if substring location is available, and is within the range of
+   the format string itself, the primary location of the
+   diagnostic is the substring range obtained from FMT_LOC, with the
+   caret at the *end* of the substring range.
+
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf ("hello %i", msg);
+                    ~^
+
+   Case 2: if the substring location is available, but is not within
+   the range of the format string, the primary location is that of the
+   format string, and an note is emitted showing the substring location.
+
+   For example:
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf("hello " INT_FMT " world", msg);
+            ^~~~~~~~~~~~~~~~~~~~~~~~~
+     test.c:19: note: format string is defined here
+     #define INT_FMT "%i"
+                      ~^
+
+   Case 3: if precise substring information is unavailable, the primary
+   location is that of the whole string passed to FMT_LOC's constructor.
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf(fmt, msg);
+            ^~~
+
+   For each of cases 1-3, if param_range is non-NULL, then it is used
+   as a secondary range within the warning.  For example, here it
+   is used with case 1:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo %s bar", long_i + long_j);
+                  ~^       ~~~~~~~~~~~~~~~
+
+   and here with case 2:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo " STR_FMT " bar", long_i + long_j);
+             ^~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~
+     test.c:89:16: note: format string is defined here
+     #define STR_FMT "%s"
+                      ~^
 
-  c++, offset--;
-  while (offset > 0)
+   and with case 3:
+
+     test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
+     printf(fmt, msg);
+            ^~~  ~~~
+
+   Return true if a warning was emitted, false otherwise.  */
+
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
+		   int opt, const char *gmsgid, va_list *ap)
+{
+  bool substring_within_range = false;
+  location_t primary_loc;
+  location_t substring_loc = UNKNOWN_LOCATION;
+  source_range fmt_loc_range
+    = get_range_from_loc (line_table, fmt_loc.get_fmt_string_loc ());
+  source_range fmt_substring_range;
+  const char *err = fmt_loc.get_range (&fmt_substring_range);
+  if (err)
+    /* Case 3: unable to get substring location.  */
+    primary_loc = fmt_loc.get_fmt_string_loc ();
+  else
     {
-      if (c - s >= line_width)
-	return 0;
+      substring_loc = make_location (fmt_substring_range.m_finish,
+				     fmt_substring_range.m_start,
+				     fmt_substring_range.m_finish);
 
-      switch (*c)
+      if (fmt_substring_range.m_start >= fmt_loc_range.m_start
+	  && fmt_substring_range.m_finish <= fmt_loc_range.m_finish)
+	/* Case 1.  */
 	{
-	case '\\':
-	  c++;
-	  if (c - s >= line_width)
-	    return 0;
-	  switch (*c)
-	    {
-	    case '\\': case '\'': case '"': case '?':
-	    case '(': case '{': case '[': case '%':
-	    case 'a': case 'b': case 'f': case 'n':
-	    case 'r': case 't': case 'v': 
-	    case 'e': case 'E':
-	      c++, offset--;
-	      break;
-
-	    default:
-	      return 0;
-	    }
-	  break;
-
-	case '"':
-	  /* We found the end of the string too early.  */
-	  return 0;
-	  
-	default:
-	  c++, offset--;
-	  break;
+	  substring_within_range = true;
+	  primary_loc = substring_loc;
 	}
+      else
+	/* Case 2.  */
+	{
+	  substring_within_range = false;
+	  primary_loc = fmt_loc.get_fmt_string_loc ();
+	}
+    }
+
+  rich_location richloc (line_table, primary_loc);
+
+  if (param_range)
+    {
+      location_t param_loc = make_location (param_range->m_start,
+					    param_range->m_start,
+					    param_range->m_finish);
+      richloc.add_range (param_loc, false);
     }
-  return c - s;
+
+  diagnostic_info diagnostic;
+  diagnostic_set_info (&diagnostic, gmsgid, ap, &richloc, DK_WARNING);
+  diagnostic.option_index = opt;
+  bool warned = report_diagnostic (&diagnostic);
+
+  if (!err && substring_loc && !substring_within_range)
+    /* Case 2.  */
+    if (warned)
+      inform (substring_loc, "format string is defined here");
+
+  return warned;
 }
 
-/* Return a location that encodes the same location as LOC but shifted
-   by OFFSET bytes.  */
+/* Variadic call to format_warning_va.  */
 
-static location_t
-location_from_offset (location_t loc, int offset)
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_at_substring (const substring_loc &fmt_loc,
+			     source_range *param_range,
+			     int opt, const char *gmsgid, ...)
 {
-  gcc_checking_assert (offset >= 0);
-  if (linemap_location_from_macro_expansion_p (line_table, loc)
-      || offset < 0)
-    return loc;
+  va_list ap;
+  va_start (ap, gmsgid);
+  bool warned = format_warning_va (fmt_loc, param_range, opt, gmsgid, &ap);
+  va_end (ap);
+
+  return warned;
+}
 
-  expanded_location s = expand_location_to_spelling_point (loc);
-  int line_width;
-  const char *line = location_get_source_line (s.file, s.line, &line_width);
-  if (line == NULL)
-    return loc;
-  line += s.column - 1 ;
-  line_width -= s.column - 1;
-  unsigned int column =
-    location_column_from_byte_offset (line, line_width, (unsigned) offset);
+/* Emit a warning as per format_warning_va, but construct the substring_loc
+   for the character at offset (CHAR_IDX - 1) within a string constant
+   FORMAT_STRING_CST at FMT_STRING_LOC.  */
 
-  return linemap_position_for_loc_and_offset (line_table, loc, column);
+ATTRIBUTE_GCC_DIAG (5,6)
+static bool
+format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
+			int char_idx, int opt, const char *gmsgid, ...)
+{
+  va_list ap;
+  va_start (ap, gmsgid);
+  tree string_type = TREE_TYPE (format_string_cst);
+
+  /* The callers are of the form:
+       format_warning (format_string_loc, format_string_cst,
+		       format_chars - orig_format_chars,
+      where format_chars has already been incremented, so that
+      CHAR_IDX is one character beyond where the warning should
+      be emitted.  Fix it.  */
+  char_idx -= 1;
+
+  substring_loc fmt_loc (fmt_string_loc, string_type, char_idx, char_idx);
+  bool warned = format_warning_va (fmt_loc, NULL, opt, gmsgid, &ap);
+  va_end (ap);
+
+  return warned;
 }
 
 /* Check that we have a pointer to a string suitable for use as a format.
@@ -1018,8 +1110,9 @@ format_flags (int format_num)
 static void check_format_info (function_format_info *, tree);
 static void check_format_arg (void *, tree, unsigned HOST_WIDE_INT);
 static void check_format_info_main (format_check_results *,
-				    function_format_info *,
-				    const char *, int, tree,
+				    function_format_info *, const char *,
+				    location_t, tree,
+				    int, tree,
 				    unsigned HOST_WIDE_INT,
 				    object_allocator<format_wanted_type> &);
 
@@ -1032,8 +1125,12 @@ static void finish_dollar_format_checking (format_check_results *, int);
 static const format_flag_spec *get_flag_spec (const format_flag_spec *,
 					      int, const char *);
 
-static void check_format_types (location_t, format_wanted_type *);
-static void format_type_warning (location_t, format_wanted_type *, tree, tree);
+static void check_format_types (const substring_loc &fmt_loc,
+				format_wanted_type *);
+static void format_type_warning (const substring_loc &fmt_loc,
+				 source_range *param_range,
+				 format_wanted_type *, tree,
+				 tree);
 
 /* Decode a format type from a string, returning the type, or
    format_type_error if not valid, in which case the caller should print an
@@ -1509,6 +1606,8 @@ check_format_arg (void *ctx, tree format_tree,
   tree array_size = 0;
   tree array_init;
 
+  location_t fmt_param_loc = EXPR_LOC_OR_LOC (format_tree, input_location);
+
   if (VAR_P (format_tree))
     {
       /* Pull out a constant value if the front end didn't.  */
@@ -1684,12 +1783,13 @@ check_format_arg (void *ctx, tree format_tree,
      need not adjust it for every return.  */
   res->number_other++;
   object_allocator <format_wanted_type> fwt_pool ("format_wanted_type pool");
-  check_format_info_main (res, info, format_chars, format_length,
-			  params, arg_num, fwt_pool);
+  check_format_info_main (res, info, format_chars, fmt_param_loc, format_tree,
+			  format_length, params, arg_num, fwt_pool);
 }
 
 
-/* Do the main part of checking a call to a format function.  FORMAT_CHARS
+/* Do the main part of checking a call to a format function.
+   FORMAT_STRING_CST is the STRING_CST format string.  FORMAT_CHARS
    is the NUL-terminated format string (which at this point may contain
    internal NUL characters); FORMAT_LENGTH is its length (excluding the
    terminating NUL character).  ARG_NUM is one less than the number of
@@ -1699,6 +1799,7 @@ check_format_arg (void *ctx, tree format_tree,
 static void
 check_format_info_main (format_check_results *res,
 			function_format_info *info, const char *format_chars,
+			location_t fmt_param_loc, tree format_string_cst,
 			int format_length, tree params,
 			unsigned HOST_WIDE_INT arg_num,
 			object_allocator <format_wanted_type> &fwt_pool)
@@ -1747,10 +1848,10 @@ check_format_info_main (format_check_results *res,
 	continue;
       if (*format_chars == 0)
 	{
-          warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "spurious trailing %<%%%> in format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "spurious trailing %<%%%> in format");
 	  continue;
 	}
       if (*format_chars == '%')
@@ -1758,6 +1859,7 @@ check_format_info_main (format_check_results *res,
 	  ++format_chars;
 	  continue;
 	}
+      const char *start_of_this_format = format_chars;
       flag_chars[0] = 0;
 
       if ((fki->flags & (int) FMT_FLAG_USE_DOLLAR) && has_operand_number != 0)
@@ -1794,11 +1896,10 @@ check_format_info_main (format_check_results *res,
 						     *format_chars, NULL);
 	  if (strchr (flag_chars, *format_chars) != 0)
 	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars + 1
-						- orig_format_chars),
-			  OPT_Wformat_,
-			  "repeated %s in format", _(s->name));
+	      format_warning_at_char (format_string_loc, format_string_cst,
+				      format_chars + 1 - orig_format_chars,
+				      OPT_Wformat_,
+				      "repeated %s in format", _(s->name));
 	    }
 	  else
 	    {
@@ -1921,10 +2022,11 @@ check_format_info_main (format_check_results *res,
 	  flag_chars[i++] = fki->left_precision_char;
 	  flag_chars[i] = 0;
 	  if (!ISDIGIT (*format_chars))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"empty left precision in %s format", fki->name);
+	    format_warning_at_char (format_string_loc, format_string_cst,
+				    format_chars - orig_format_chars,
+				    OPT_Wformat_,
+				    "empty left precision in %s format",
+				    fki->name);
 	  while (ISDIGIT (*format_chars))
 	    ++format_chars;
 	}
@@ -2002,10 +2104,11 @@ check_format_info_main (format_check_results *res,
 	    {
 	      if (!(fki->flags & (int) FMT_FLAG_EMPTY_PREC_OK)
 		  && !ISDIGIT (*format_chars))
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "empty precision in %s format", fki->name);
+		format_warning_at_char (format_string_loc, format_string_cst,
+					format_chars - orig_format_chars,
+					OPT_Wformat_,
+					"empty precision in %s format",
+					fki->name);
 	      while (ISDIGIT (*format_chars))
 		++format_chars;
 	    }
@@ -2090,11 +2193,10 @@ check_format_info_main (format_check_results *res,
 		{
 		  const format_flag_spec *s = get_flag_spec (flag_specs,
 							     *format_chars, NULL);
-		  warning_at (location_from_offset (format_string_loc,
-						    format_chars 
-						    - orig_format_chars),
-			      OPT_Wformat_,
-			      "repeated %s in format", _(s->name));
+		  format_warning_at_char (format_string_loc, format_string_cst,
+					  format_chars - orig_format_chars,
+					  OPT_Wformat_,
+					  "repeated %s in format", _(s->name));
 		}
 	      else
 		{
@@ -2111,10 +2213,10 @@ check_format_info_main (format_check_results *res,
 	  || (!(fki->flags & (int) FMT_FLAG_FANCY_PERCENT_OK)
 	      && format_char == '%'))
 	{
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "conversion lacks type at end of format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "conversion lacks type at end of format");
 	  continue;
 	}
       format_chars++;
@@ -2125,27 +2227,30 @@ check_format_info_main (format_check_results *res,
       if (fci->format_chars == 0)
 	{
 	  if (ISGRAPH (format_char))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character %qc in format",
-			format_char);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "unknown conversion type character %qc in format",
+	       format_char);
 	  else
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character 0x%x in format",
-			format_char);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "unknown conversion type character 0x%x in format",
+	       format_char);
 	  continue;
 	}
       if (pedantic)
 	{
 	  if (ADJ_STD (fci->std) > C_STD_VER)
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"%s does not support the %<%%%c%> %s format",
-			C_STD_NAME (fci->std), format_char, fki->name);
+	    format_warning_at_char
+	      (format_string_loc, format_string_cst,
+	       format_chars - orig_format_chars,
+	       OPT_Wformat_,
+	       "%s does not support the %<%%%c%> %s format",
+	       C_STD_NAME (fci->std), format_char, fki->name);
 	}
 
       /* Validate the individual flags used, removing any that are invalid.  */
@@ -2160,11 +2265,11 @@ check_format_info_main (format_check_results *res,
 	      continue;
 	    if (strchr (fci->flag_chars, flag_chars[i]) == 0)
 	      {
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars 
-						  - orig_format_chars),
-			    OPT_Wformat_, "%s used with %<%%%c%> %s format",
-			    _(s->name), format_char, fki->name);
+		format_warning_at_char (format_string_loc, format_string_cst,
+					format_chars - orig_format_chars,
+					OPT_Wformat_,
+					"%s used with %<%%%c%> %s format",
+					_(s->name), format_char, fki->name);
 		d++;
 		continue;
 	      }
@@ -2277,10 +2382,10 @@ check_format_info_main (format_check_results *res,
 	    ++format_chars;
 	  if (*format_chars != ']')
 	    /* The end of the format string was reached.  */
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"no closing %<]%> for %<%%[%> format");
+	    format_warning_at_char (format_string_loc, format_string_cst,
+				    format_chars - orig_format_chars,
+				    OPT_Wformat_,
+				    "no closing %<]%> for %<%%[%> format");
 	}
 
       wanted_type = 0;
@@ -2293,12 +2398,14 @@ check_format_info_main (format_check_results *res,
 	  wanted_type_std = fci->types[length_chars_val].std;
 	  if (wanted_type == 0)
 	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars - orig_format_chars),
-			  OPT_Wformat_,
-			  "use of %qs length modifier with %qc type character"
-			  " has either no effect or undefined behavior",
-			  length_chars, format_char);
+	      format_warning_at_char
+		(format_string_loc, format_string_cst,
+		 format_chars - orig_format_chars,
+		 OPT_Wformat_,
+		 "use of %qs length modifier with %qc type"
+		 " character"
+		 " has either no effect or undefined behavior",
+		 length_chars, format_char);
 	      /* Heuristic: skip one argument when an invalid length/type
 		 combination is encountered.  */
 	      arg_num++;
@@ -2314,12 +2421,13 @@ check_format_info_main (format_check_results *res,
 		   && ADJ_STD (wanted_type_std) > ADJ_STD (fci->std))
 	    {
 	      if (ADJ_STD (wanted_type_std) > C_STD_VER)
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "%s does not support the %<%%%s%c%> %s format",
-			    C_STD_NAME (wanted_type_std), length_chars,
-			    format_char, fki->name);
+		format_warning_at_char
+		  (format_string_loc, format_string_cst,
+		   format_chars - orig_format_chars,
+		   OPT_Wformat_,
+		   "%s does not support the %<%%%s%c%> %s format",
+		   C_STD_NAME (wanted_type_std), length_chars,
+		   format_char, fki->name);
 	    }
 	}
 
@@ -2421,14 +2529,20 @@ check_format_info_main (format_check_results *res,
 	}
 
       if (first_wanted_type != 0)
-        check_format_types (format_string_loc, first_wanted_type);
+	{
+	  ptrdiff_t offset_to_format_start = (start_of_this_format - 1) - orig_format_chars;
+	  ptrdiff_t offset_to_format_end = (format_chars - 1) - orig_format_chars;
+	  substring_loc fmt_loc (fmt_param_loc, TREE_TYPE (format_string_cst),
+				 offset_to_format_start, offset_to_format_end);
+	  check_format_types (fmt_loc, first_wanted_type);
+	}
     }
 
   if (format_chars - orig_format_chars != format_length)
-    warning_at (location_from_offset (format_string_loc,
-				      format_chars + 1 - orig_format_chars),
-		OPT_Wformat_contains_nul,
-		"embedded %<\\0%> in format");
+    format_warning_at_char (format_string_loc, format_string_cst,
+			    format_chars + 1 - orig_format_chars,
+			    OPT_Wformat_contains_nul,
+			    "embedded %<\\0%> in format");
   if (info->first_arg_num != 0 && params != 0
       && has_operand_number <= 0)
     {
@@ -2439,12 +2553,12 @@ check_format_info_main (format_check_results *res,
     finish_dollar_format_checking (res, fki->flags & (int) FMT_FLAG_DOLLAR_GAP_POINTER_OK);
 }
 
-
 /* Check the argument types from a single format conversion (possibly
-   including width and precision arguments).  LOC is the location of
-   the format string.  */
+   including width and precision arguments).  FMT_LOC is the
+   location of the format conversion.  */
 static void
-check_format_types (location_t loc, format_wanted_type *types)
+check_format_types (const substring_loc &fmt_loc,
+		    format_wanted_type *types)
 {
   for (; types != 0; types = types->next)
     {
@@ -2471,7 +2585,7 @@ check_format_types (location_t loc, format_wanted_type *types)
       cur_param = types->param;
       if (!cur_param)
         {
-          format_type_warning (loc, types, wanted_type, NULL);
+          format_type_warning (fmt_loc, NULL, types, wanted_type, NULL);
           continue;
         }
 
@@ -2481,6 +2595,16 @@ check_format_types (location_t loc, format_wanted_type *types)
       orig_cur_type = cur_type;
       char_type_flag = 0;
 
+      source_range param_range;
+      source_range *param_range_ptr;
+      if (CAN_HAVE_LOCATION_P (cur_param))
+	{
+	  param_range = EXPR_LOCATION_RANGE (cur_param);
+	  param_range_ptr = &param_range;
+	}
+      else
+	param_range_ptr = NULL;
+
       STRIP_NOPS (cur_param);
 
       /* Check the types of any additional pointer arguments
@@ -2545,7 +2669,8 @@ check_format_types (location_t loc, format_wanted_type *types)
 	    }
 	  else
 	    {
-              format_type_warning (loc, types, wanted_type, orig_cur_type);
+	      format_type_warning (fmt_loc, param_range_ptr,
+				   types, wanted_type, orig_cur_type);
 	      break;
 	    }
 	}
@@ -2613,20 +2738,24 @@ check_format_types (location_t loc, format_wanted_type *types)
 	  && TYPE_PRECISION (cur_type) == TYPE_PRECISION (wanted_type))
 	continue;
       /* Now we have a type mismatch.  */
-      format_type_warning (loc, types, wanted_type, orig_cur_type);
+      format_type_warning (fmt_loc, param_range_ptr, types,
+			   wanted_type, orig_cur_type);
     }
 }
 
 
-/* Give a warning at LOC about a format argument of different type from that
-   expected.  WANTED_TYPE is the type the argument should have, possibly
-   stripped of pointer dereferences.  The description (such as "field
+/* Give a warning at FMT_LOC about a format argument of different type
+   from that expected.  If non-NULL, PARAM_RANGE is the source range of the
+   relevant argument.  WANTED_TYPE is the type the argument should have,
+   possibly stripped of pointer dereferences.  The description (such as "field
    precision"), the placement in the format string, a possibly more
    friendly name of WANTED_TYPE, and the number of pointer dereferences
    are taken from TYPE.  ARG_TYPE is the type of the actual argument,
    or NULL if it is missing.  */
 static void
-format_type_warning (location_t loc, format_wanted_type *type,
+format_type_warning (const substring_loc &fmt_loc,
+		     source_range *param_range,
+		     format_wanted_type *type,
 		     tree wanted_type, tree arg_type)
 {
   int kind = type->kind;
@@ -2635,7 +2764,6 @@ format_type_warning (location_t loc, format_wanted_type *type,
   int format_length = type->format_length;
   int pointer_count = type->pointer_count;
   int arg_num = type->arg_num;
-  unsigned int offset_loc = type->offset_loc;
 
   char *p;
   /* If ARG_TYPE is a typedef with a misleading name (for example,
@@ -2669,41 +2797,47 @@ format_type_warning (location_t loc, format_wanted_type *type,
       p[pointer_count + 1] = 0;
     }
 
-  loc = location_from_offset (loc, offset_loc);
-		      
   if (wanted_type_name)
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type_name, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type_name, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type_name, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type_name, p);
     }
   else
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type, p);
     }
 }
 
diff --git a/gcc/testsuite/gcc.dg/cpp/pr66415-1.c b/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
index 349ec48..1f67cb4 100644
--- a/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
+++ b/gcc/testsuite/gcc.dg/cpp/pr66415-1.c
@@ -1,9 +1,15 @@
 /* PR c/66415 */
 /* { dg-do compile } */
-/* { dg-options "-Wformat" } */
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
 
 void
 fn1 (void)
 {
   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"); /* { dg-warning "71:format" } */
+
+/* { dg-begin-multiline-output "" }
+   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
+                                                                      ~^
+   { dg-end-multiline-output "" } */
+
 }
diff --git a/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c b/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
index 2eabbf9..50ca572 100644
--- a/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
+++ b/gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
@@ -66,9 +66,9 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   asm_fprintf ("%d", i, i); /* { dg-warning "16:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   asm_fprintf (""); /* { dg-warning "16:zero-length" "warning for empty format" } */
-  asm_fprintf ("\0"); /* { dg-warning "17:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0", i); /* { dg-warning "19:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "19:embedded|too many" "warning for embedded NUL" } */
+  asm_fprintf ("\0"); /* { dg-warning "18:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0", i); /* { dg-warning "20:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "20:embedded|too many" "warning for embedded NUL" } */
   asm_fprintf (NULL); /* { dg-warning "null" "null format string warning" } */
   asm_fprintf ("%"); /* { dg-warning "17:trailing" "trailing % warning" } */
   asm_fprintf ("%++d", i); /* { dg-warning "19:repeated" "repeated flag warning" } */
diff --git a/gcc/testsuite/gcc.dg/format/c90-printf-1.c b/gcc/testsuite/gcc.dg/format/c90-printf-1.c
index 5329dad..338b971 100644
--- a/gcc/testsuite/gcc.dg/format/c90-printf-1.c
+++ b/gcc/testsuite/gcc.dg/format/c90-printf-1.c
@@ -58,11 +58,11 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%-%"); /* { dg-warning "13:type" "missing type" } */
   /* { dg-warning "14:trailing" "bogus %%" { target *-*-* } 58 } */
   printf ("%-%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 60 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 60 } */
   printf ("%5%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 62 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 62 } */
   printf ("%h%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 64 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 64 } */
   /* Valid and invalid %h, %l, %L constructions.  */
   printf ("%hd", i);
   printf ("%hi", i);
@@ -184,8 +184,8 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%-08G", d); /* { dg-warning "11:flags|ignored" "0 flag ignored with - flag" } */
   /* Various tests of bad argument types.  */
   printf ("%d", l); /* { dg-warning "13:format" "bad argument types" } */
-  printf ("%*.*d", l, i2, i); /* { dg-warning "13:field" "bad * argument types" } */
-  printf ("%*.*d", i1, l, i); /* { dg-warning "15:field" "bad * argument types" } */
+  printf ("%*.*d", l, i2, i); /* { dg-warning "16:field" "bad * argument types" } */
+  printf ("%*.*d", i1, l, i); /* { dg-warning "16:field" "bad * argument types" } */
   printf ("%ld", i); /* { dg-warning "14:format" "bad argument types" } */
   printf ("%s", n); /* { dg-warning "13:format" "bad argument types" } */
   printf ("%p", i); /* { dg-warning "13:format" "bad argument types" } */
@@ -231,8 +231,8 @@ foo (int i, int i1, int i2, unsigned int u, double d, char *s, void *p,
   printf ("%d", i, i); /* { dg-warning "11:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   printf (""); /* { dg-warning "11:zero-length" "warning for empty format" } */
-  printf ("\0"); /* { dg-warning "12:embedded" "warning for embedded NUL" } */
-  printf ("%d\0", i); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+  printf ("\0"); /* { dg-warning "13:embedded" "warning for embedded NUL" } */
+  printf ("%d\0", i); /* { dg-warning "15:embedded" "warning for embedded NUL" } */
   printf ("%d\0%d", i, i); /* { dg-warning "embedded|too many" "warning for embedded NUL" } */
   printf (NULL); /* { dg-warning "3:null" "null format string warning" } */
   printf ("%"); /* { dg-warning "12:trailing" "trailing % warning" } */
diff --git a/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
new file mode 100644
index 0000000..9e86b52
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
@@ -0,0 +1,222 @@
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
+
+/* See PR 52952. */
+
+#include "format.h"
+
+void test_mismatching_types (const char *msg)
+{
+  printf("hello %i", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %i", msg);
+                 ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments (void)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, 101, 102);
+/* TODO: ideally would also underline "101".  */
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments_2 (int i, int j)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, i + j, 102);
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+           100, i + j, 102);
+                ~~~~~         
+   { dg-end-multiline-output "" } */
+}
+
+void multiline_format_string (void) {
+  printf ("before the fmt specifier" /* { dg-warning "11: format '%d' expects a matching 'int' argument" } */
+/* { dg-begin-multiline-output "" }
+   printf ("before the fmt specifier"
+           ^~~~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+
+          "%"
+          "d" /* { dg-message "12: format string is defined here" } */
+          "after the fmt specifier");
+
+/* { dg-begin-multiline-output "" }
+           "%"
+            ~~
+           "d"
+           ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_hex (const char *msg)
+{
+  /* "%" is \x25
+     "i" is \x69 */
+  printf("hello \x25\x69", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \x25\x69", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_oct (const char *msg)
+{
+  /* "%" is octal 045
+     "i" is octal 151.  */
+  printf("hello \045\151", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \045\151", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple (const char *msg)
+{
+  /* "%" is \x25 in hex
+     "i" is \151 in octal.  */
+  printf("prefix"  "\x25"  "\151"  "suffix",  /* { dg-warning "format '%i'" } */
+         msg);
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+          ^~~~~~~~
+  { dg-end-multiline-output "" } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+                     ~~~~~~~~~~~^
+  { dg-end-multiline-output "" } */
+}
+
+void test_u8 (const char *msg)
+{
+  printf(u8"hello %i", msg);/* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf(u8"hello %i", msg);
+                   ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_param (long long_i, long long_j)
+{
+  printf ("foo %s bar", long_i + long_j); /* { dg-warning "17: format '%s' expects argument of type 'char \\*', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf ("foo %s bar", long_i + long_j);
+                ~^       ~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void test_field_width_specifier (long l, int i1, int i2)
+{
+  printf (" %*.*d ", l, i1, i2); /* { dg-warning "17: field width specifier '\\*' expects argument of type 'int', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %*.*d ", l, i1, i2);
+             ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_spurious_percent (void)
+{
+  printf("hello world %"); /* { dg-warning "23: spurious trailing" } */
+
+/* { dg-begin-multiline-output "" }
+   printf("hello world %");
+                       ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_empty_precision (char *s, size_t m, double d)
+{
+  strfmon (s, m, "%#.5n", d); /* { dg-warning "20: empty left precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#.5n", d);
+                    ^
+   { dg-end-multiline-output "" } */
+
+  strfmon (s, m, "%#5.n", d); /* { dg-warning "22: empty precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#5.n", d);
+                      ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_repeated (int i)
+{
+  printf ("%++d", i); /* { dg-warning "14: repeated '\\+' flag in format" } */
+/* { dg-begin-multiline-output "" }
+   printf ("%++d", i);
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_conversion_lacks_type (void)
+{
+  printf (" %h"); /* { dg-warning "14:conversion lacks type at end of format" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %h");
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_embedded_nul (void)
+{
+  printf (" \0 "); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+/* { dg-begin-multiline-output "" }
+   printf (" \0 ");
+             ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_macro (const char *msg)
+{
+#define INT_FMT "%i" /* { dg-message "19: format string is defined here" } */
+  printf("hello " INT_FMT " world", msg);  /* { dg-warning "10: format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* { dg-begin-multiline-output "" }
+   printf("hello " INT_FMT " world", msg);
+          ^~~~~~~~
+   { dg-end-multiline-output "" } */
+/* { dg-begin-multiline-output "" }
+ #define INT_FMT "%i"
+                  ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_non_contiguous_strings (void)
+{
+  __builtin_printf(" %" "d ", 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 200 } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                    ^~~~
+   { dg-end-multiline-output "" } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                      ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_const_arrays (void)
+{
+  /* TODO: ideally we'd highlight both the format string *and* the use of
+     it here.  For now, just verify that we gracefully handle this case.  */
+  const char a[] = " %d ";
+  __builtin_printf(a, 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(a, 0.5);
+                    ^
+   { dg-end-multiline-output "" } */
+}
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 17:27                             ` David Malcolm
  2016-07-30  1:18                               ` Manuel López-Ibáñez
@ 2016-08-03 15:56                               ` Jeff Law
  1 sibling, 0 replies; 61+ messages in thread
From: Jeff Law @ 2016-08-03 15:56 UTC (permalink / raw)
  To: David Malcolm, Manuel López-Ibáñez
  Cc: Martin Sebor, GCC Patches, Richard Biener

On 07/29/2016 11:27 AM, David Malcolm wrote:
> On Fri, 2016-07-29 at 17:53 +0100, Manuel LÃ³pez-IbÃ¡Ã±ez wrote:
>> On 29 July 2016 at 16:25, David Malcolm <dmalcolm@redhat.com> wrote:
>>>
>>> FWIW, it appears that clang uses the on-demand approach; the
>>> relevant
>>> code appears to be StringLiteral::getLocationOfByte:
>>> http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008
>>
>> As far as I know, llvm doesn't do language diagnostics from the
>> middle-end/LTO. Thus, they do not have those problems.
>
> If you really want to have middle-end diagnostics from LTO, I can make
> the on-demand approach work.
>
> I can also do the stored-location approach, but it would mean rewriting
> all the patches again, I think, would be less efficient.
>
> I would prefer the on-demand approach.
>
> Who is empowered to make a decision here?
ISTM we've got a bit of a deadlock here with the two intertwined 
patches.  I'm wondering if we can move both forward, perhaps without the 
higher quality diagnostics for Martin's work initially.  Then iterate on 
what's in-tree to add the higher quality diagnostics, then figure out 
how to deal with some of the issues we have in the LTO space.

Martin's model of running early or late depending on flags is, IMHO, the 
right approach.  And more generally its a good solution for other 
problems in this space.  With that in mind, finding a way to get at the 
diagnostics framework from within the middle end and eventually LTO is, 
IMHO, important.

Given that the diagnostics are the uncommon case, I would strongly 
prefer an on-demand approach rather than recording a ton of stuff in the 
front-end for the unlikely case that we're going to want a diagnostic in 
the middle-end or LTO.

Jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-07-29 21:42       ` Joseph Myers
  2016-07-30  1:16         ` David Malcolm
@ 2016-08-03 15:59         ` Jeff Law
  2016-08-04 14:27           ` David Malcolm
  1 sibling, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-03 15:59 UTC (permalink / raw)
  To: Joseph Myers, David Malcolm; +Cc: gcc-patches

On 07/29/2016 03:42 PM, Joseph Myers wrote:
> On Tue, 26 Jul 2016, David Malcolm wrote:
>
>> This patch implements precise tracking of source locations for the
>> individual chars within string literals, so that we can e.g. underline
>> specific ranges in -Wformat diagnostics.  It handles macros,
>> concatenated tokens, escaped characters etc.
>
> What if the string literal results from stringizing other tokens (which
> might have arisen in turn from macro expansion, including expansion of
> built-in macros not just those defined in source files, etc.)?  "You don't
> get precise locations" would be a fine answer for such cases - provided
> there is good testsuite coverage of them to show they don't crash the
> compiler or underline nonsensical characters.
I think losing precise locations in some circumstances would be fine as 
well -- as long as we understand the limitations.

And, yes, crashing or underlining nonsensical characters would be bad, 
so it'd be obviously good to test some of that to ensure the fallbacks 
work as expected.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT
  2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
                               ` (2 preceding siblings ...)
  2016-08-03 15:17             ` [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
@ 2016-08-03 16:06             ` Jeff Law
  2016-08-04 19:02               ` David Malcolm
  3 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-03 16:06 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers

On 08/03/2016 09:45 AM, David Malcolm wrote:
> I split out the selftest.h changes from v2 of the kit for ease of review;
> here they are.
>
> Successfully bootstrapped&regrtested in conjunction with the rest of the
> patch kit on x86_64-pc-linux-gnu.
>
> OK for trunk?
>
> gcc/ChangeLog:
> 	* selftest.h (ASSERT_TRUE): Reimplement in terms of...
> 	(ASSERT_TRUE_AT): New macro.
> 	(ASSERT_FALSE): Reimplement in terms of...
> 	(ASSERT_FALSE_AT): New macro.
> 	(ASSERT_STREQ_AT): Fix typo in comment.
OK.  Though I do wonder if these should just be normal functions...  I 
assume there's a good reason for the macro pain :)


jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-08-03 15:59         ` [PATCH 1/3] (v2) On-demand locations within string-literals Jeff Law
@ 2016-08-04 14:27           ` David Malcolm
  2016-08-04 17:37             ` Jeff Law
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-04 14:27 UTC (permalink / raw)
  To: Jeff Law, Joseph Myers; +Cc: gcc-patches

On Wed, 2016-08-03 at 09:59 -0600, Jeff Law wrote:
> On 07/29/2016 03:42 PM, Joseph Myers wrote:
> > On Tue, 26 Jul 2016, David Malcolm wrote:
> > 
> > > This patch implements precise tracking of source locations for
> > > the
> > > individual chars within string literals, so that we can e.g.
> > > underline
> > > specific ranges in -Wformat diagnostics.  It handles macros,
> > > concatenated tokens, escaped characters etc.
> > 
> > What if the string literal results from stringizing other tokens
> > (which
> > might have arisen in turn from macro expansion, including expansion
> > of
> > built-in macros not just those defined in source files, etc.)? 
> >  "You don't
> > get precise locations" would be a fine answer for such cases -
> > provided
> > there is good testsuite coverage of them to show they don't crash
> > the
> > compiler or underline nonsensical characters.
> I think losing precise locations in some circumstances would be fine
> as 
> well -- as long as we understand the limitations.

In v3 of the patch, this fails gracefully.

> And, yes, crashing or underlining nonsensical characters would be
> bad, 

The API in input.c is get_source_range_for_substring, which returns an
error message (intended for us, rather than end-users); it is wrapped
by this method in c-common.c:

/* Attempt to determine the source range of the substring.
   If successful, return NULL and write the source range to *OUT_RANGE.
   Otherwise return an error message.  Error messages are intended
   for GCC developers (to help debugging) rather than for end-users.  */

const char *
substring_loc::get_range (source_range *out_range) const

> so it'd be obviously good to test some of that to ensure the
> fallbacks 
> work as expected.

As for test coverage, v2 and v3 of the kit add over a thousand lines of
selftest code that heavily exercise string lexing, using the
 line_table_case machinery to run the tests with various interesting
boundary conditions with line_table (e.g. near
 LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES).

In terms of test coverage of the fallbacks, patch 2 of v3 of the kit
directly exercises the substr_loc.get_range in 
gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via
gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the
tests there cover the failures, via:

  error_at (strloc, "unable to read substring range: %s", err);

which we wouldn't do in a normal diagnostic (but which is appropriate
for testing the machinery itself).

Patch 3 of the v3 kit adds a format_warning_va function to c-format.c
which is responsible for dealing with failures:
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html

Sadly the comment got a bit mangled by git in that patch due to the
proximity to the deleted function location_column_from_byte_offset;
here's an inline copy (after patch 4, which adds param
CORRECTED_SUBSTRING for doing fix-it hints for bad format strings):

/* Emit a warning governed by option OPT, using GMSGID as the format
   string and AP as its arguments.

   Attempt to obtain precise location information within a string
   literal from FMT_LOC.

   Case 1: if substring location is available, and is within the range of
   the format string itself, the primary location of the
   diagnostic is the substring range obtained from FMT_LOC, with the
   caret at the *end* of the substring range.

   For example:

     test.c:90:10: warning: problem with '%i' here [-Wformat=]
     printf ("hello %i", msg);
                    ~^

   Case 2: if the substring location is available, but is not within
   the range of the format string, the primary location is that of the
   format string, and an note is emitted showing the substring location.

   For example:
     test.c:90:10: warning: problem with '%i' here [-Wformat=]
     printf("hello " INT_FMT " world", msg);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
     test.c:19: note: format string is defined here
     #define INT_FMT "%i"
                      ~^

   Case 3: if precise substring information is unavailable, the primary
   location is that of the whole string passed to FMT_LOC's constructor.
   For example:

     test.c:90:10: warning: problem with '%i' here [-Wformat=]
     printf(fmt, msg);
            ^~~

   For each of cases 1-3, if param_range is non-NULL, then it is used
   as a secondary range within the warning.  For example, here it
   is used with case 1:

     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
     printf ("foo %s bar", long_i + long_j);
                  ~^       ~~~~~~~~~~~~~~~

   and here with case 2:

     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
     printf ("foo " STR_FMT " bar", long_i + long_j);
             ^~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~
     test.c:89:16: note: format string is defined here
     #define STR_FMT "%s"
                      ~^

   and with case 3:

     test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
     printf(fmt, msg);
            ^~~  ~~~

   If CORRECTED_SUBSTRING is non-NULL, use it for cases 1 and 2 to provide
   a fix-it hint, suggesting that it should replace the text within the
   substring range.  For example:

     test.c:90:10: warning: problem with '%i' here [-Wformat=]
     printf ("hello %i", msg);
                    ~^
                    %s

   Return true if a warning was emitted, false otherwise.  */

ATTRIBUTE_GCC_DIAG (5,0)
static bool
format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
		   const char *corrected_substring,
		   int opt, const char *gmsgid, va_list *ap)

etc

Looking at patch 3, there's a fair amount of end-to-end testing in
 gcc.dg/format/diagnostic-ranges.c but it looks like I forgot to add an
end-to-end test there of failure due to stringification; I can add one.
 Is the rest of the v3 patch kit reviewable?

Thanks
Dave

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/3] (v2) On-demand locations within string-literals
  2016-08-04 14:27           ` David Malcolm
@ 2016-08-04 17:37             ` Jeff Law
  0 siblings, 0 replies; 61+ messages in thread
From: Jeff Law @ 2016-08-04 17:37 UTC (permalink / raw)
  To: David Malcolm, Joseph Myers; +Cc: gcc-patches

On 08/04/2016 08:27 AM, David Malcolm wrote:
>
> As for test coverage, v2 and v3 of the kit add over a thousand lines of
> selftest code that heavily exercise string lexing, using the
>  line_table_case machinery to run the tests with various interesting
> boundary conditions with line_table (e.g. near
>  LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES).
>
> In terms of test coverage of the fallbacks, patch 2 of v3 of the kit
> directly exercises the substr_loc.get_range in
> gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via
> gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the
> tests there cover the failures, via:
>
>   error_at (strloc, "unable to read substring range: %s", err);
>
> which we wouldn't do in a normal diagnostic (but which is appropriate
> for testing the machinery itself).
>
> Patch 3 of the v3 kit adds a format_warning_va function to c-format.c
> which is responsible for dealing with failures:
> https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html
THanks for pointing this out.  I hadn't started looking at the meat of 
the on-demand locations until this morning.

>
>
> Looking at patch 3, there's a fair amount of end-to-end testing in
>  gcc.dg/format/diagnostic-ranges.c but it looks like I forgot to add an
> end-to-end test there of failure due to stringification; I can add one.
>  Is the rest of the v3 patch kit reviewable?
Absolutely.  I wasn't trying to imply that it wasn't  -- in fact most of 
it is self-approvable stuff and I've only got a couple questions about 
the rest.

Jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/4] (v3) On-demand locations within string-literals
  2016-08-03 15:17             ` [PATCH 2/4] (v3) On-demand locations within string-literals David Malcolm
@ 2016-08-04 17:38               ` Jeff Law
  2016-08-04 19:21                 ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-04 17:38 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers

On 08/03/2016 09:45 AM, David Malcolm wrote:
> Changes in v3:
> - Avoid including cpplib.h from input.h
> - Properly handle stringified macro arguments (with tests for this)
> - Minor whitespace fixes
> - Move selftest.h changes to a separate patch
>
> Changes in v2:
> - Tweaks to substring location selftests
> - Many more selftests (EBCDIC, the various wide string types, etc)
> - Clean up conditions in charset.c; require source == execution charset
>   to have substring locations
> - Make string_concat_db field private
> - Return error messages rather than bool
> - Fix source_range for charset.c:convert_escape
> - Introduce class substring_loc
> - Handle bad input locations more gracefully
> - Ensure that we can read substring information for a token which
>   starts in one linemap and ends in another (seen in
>   gcc.dg/cpp/pr69985.c)
>
> This version addresses Joseph's qn about stringification of macro
> arguments (by failing gracefully on them), and the modularity
> concerns noted by Manu.
>
> Successfully bootstrapped&regrtested in conjunction with the rest of the
> patch kit on x86_64-pc-linux-gnu.
>
> v2 of the kit successfully passes a full config-list.mk and a successful selftest
> run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111), both in conjunction with the
> rest of the patch kit; I plan to repeat those tests.
>
> I believe I can self-approve the changes to input.c, input.h, libcpp,
> and the testsuite; the remaining changes needing approval are those
> to c-family and to gcc.c.
I think that's a fair assessment.  You might consider pulling those out 
as a distinct hunk in the future -- if you haven't noticed, I often try 
to knock out the smaller patches first (without even looking to see how 
much might be bits the author can self-approve).


>
> OK for trunk if it passes testing? (by itself)
>
>
> gcc/c-family/ChangeLog:
> 	* c-common.c: Include "substring-locations.h".
> 	(get_cpp_ttype_from_string_type): New function.
> 	(g_string_concat_db): New global.
> 	(substring_loc::get_range): New method.
> 	* c-common.h (g_string_concat_db): New declaration.
> 	(class substring_loc): New class.
> 	* c-lex.c (lex_string): When concatenating strings, capture the
> 	locations of all tokens using a new obstack, and record the
> 	concatenation locations within g_string_concat_db.
> 	* c-opts.c (c_common_init_options): Construct g_string_concat_db
> 	on the ggc-heap.
>
> gcc/ChangeLog:
> 	* gcc.c (cpp_options): Rename string to...
> 	(cpp_options_): ...this, to avoid clashing with struct in
> 	cpplib.h.
> 	(static_specs): Update initialize for above renaming
> 	* input.c (string_concat::string_concat): New constructor.
> 	(string_concat_db::string_concat_db): New constructor.
> 	(string_concat_db::record_string_concatenation): New method.
> 	(string_concat_db::get_string_concatenation): New method.
> 	(string_concat_db::get_key_loc): New method.
> 	(class auto_cpp_string_vec): New class.
> 	(get_substring_ranges_for_loc): New function.
> 	(get_source_range_for_substring): New function.
> 	(get_num_source_ranges_for_substring): New function.
> 	(class selftest::lexer_test_options): New class.
> 	(struct selftest::lexer_test): New struct.
> 	(class selftest::ebcdic_execution_charset): New class.
> 	(selftest::ebcdic_execution_charset::s_singleton): New variable.
> 	(selftest::lexer_test::lexer_test): New constructor.
> 	(selftest::lexer_test::~lexer_test): New destructor.
> 	(selftest::lexer_test::get_token): New method.
> 	(selftest::assert_char_at_range): New function.
> 	(ASSERT_CHAR_AT_RANGE): New macro.
> 	(selftest::assert_num_substring_ranges): New function.
> 	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
> 	(selftest::assert_has_no_substring_ranges): New function.
> 	(ASSERT_HAS_NO_SUBSTRING_RANGES): New macro.
> 	(selftest::test_lexer_string_locations_simple): New function.
> 	(selftest::test_lexer_string_locations_ebcdic): New function.
> 	(selftest::test_lexer_string_locations_hex): New function.
> 	(selftest::test_lexer_string_locations_oct): New function.
> 	(selftest::test_lexer_string_locations_letter_escape_1): New function.
> 	(selftest::test_lexer_string_locations_letter_escape_2): New function.
> 	(selftest::test_lexer_string_locations_ucn4): New function.
> 	(selftest::test_lexer_string_locations_ucn8): New function.
> 	(selftest::uint32_from_big_endian): New function.
> 	(selftest::test_lexer_string_locations_wide_string): New function.
> 	(selftest::uint16_from_big_endian): New function.
> 	(selftest::test_lexer_string_locations_string16): New function.
> 	(selftest::test_lexer_string_locations_string32): New function.
> 	(selftest::test_lexer_string_locations_u8): New function.
> 	(selftest::test_lexer_string_locations_utf8_source): New function.
> 	(selftest::test_lexer_string_locations_concatenation_1): New
> 	function.
> 	(selftest::test_lexer_string_locations_concatenation_2): New
> 	function.
> 	(selftest::test_lexer_string_locations_concatenation_3): New
> 	function.
> 	(selftest::test_lexer_string_locations_macro): New function.
> 	(selftest::test_lexer_string_locations_stringified_macro_argument):
> 	New function.
> 	(selftest::test_lexer_string_locations_non_string): New function.
> 	(selftest::test_lexer_string_locations_long_line): New function.
> 	(selftest::test_lexer_char_constants): New function.
> 	(selftest::input_c_tests): Call the new test functions once per
> 	case within the line_table test matrix.
> 	* input.h (struct string_concat): New struct.
> 	(struct location_hash): New struct.
> 	(class string_concat_db): New class.
> 	* substring-locations.h: New header.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
> 	* gcc.dg/plugin/diagnostic-test-string-literals-2.c: New file.
> 	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New file.
> 	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add the above new files.
>
> libcpp/ChangeLog:
> 	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
> 	constructor.
> 	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
> 	(cpp_substring_ranges::add_range): New method.
> 	(cpp_substring_ranges::add_n_ranges): New method.
> 	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
> 	they are non-NULL, read position information from *loc_reader
> 	and update char_range->m_finish accordingly.
> 	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
> 	params.  If loc_reader is non-NULL, read location information from
> 	it, and update *ranges accordingly, using char_range.
> 	Conditionalize the conversion into tbuf on tbuf being non-NULL.
> 	(convert_hex): Likewise, conditionalizing the call to
> 	emit_numeric_escape on tbuf.
> 	(convert_oct): Likewise.
> 	(convert_escape): Add params "loc_reader" and "ranges".  If
> 	loc_reader is non-NULL, read location information from it, and
> 	update *ranges accordingly.  Conditionalize the conversion into
> 	tbuf on tbuf being non-NULL.
> 	(cpp_interpret_string): Rename to...
> 	(cpp_interpret_string_1): ...this, adding params "loc_readers" and
> 	"out".  Use "to" to conditionalize the initialization and usage of
> 	"tbuf", such as running the converter.  If "loc_readers" is
> 	non-NULL, use the instances within it, reading location
> 	information from them, and passing them to convert_escape; likewise
> 	write to "out" if loc_readers is non-NULL.  Check for leading
> 	quote and issue an error if it is not present.  Update boundary
> 	check from "== limit" to ">= limit" to protect against erroneous
> 	location values to calls that are not parsing string literals.
> 	(cpp_interpret_string): Reimplement in terms to
> 	cpp_interpret_string_1.
> 	(noop_error_cb): New function.
> 	(cpp_interpret_string_ranges): New function.
> 	(cpp_string_location_reader::cpp_string_location_reader): New
> 	constructor.
> 	(cpp_string_location_reader::get_next): New method.
> 	* include/cpplib.h (class cpp_string_location_reader): New class.
> 	(class cpp_substring_ranges): New class.
> 	(cpp_interpret_string_ranges): New prototype.
> 	* internal.h (_cpp_valid_ucn): Add params "char_range" and
> 	"loc_reader".
> 	* lex.c (forms_identifier_p): Pass NULL for new params to
> 	_cpp_valid_ucn.
> ---
>
> diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
> index 27031b5..7a8b6ea 100644
> --- a/gcc/c-family/c-common.c
> +++ b/gcc/c-family/c-common.c
> @@ -1098,6 +1099,67 @@ fix_string_type (tree value)
>    TREE_STATIC (value) = 1;
>    return value;
>  }
> +
> +/* Given a string of type STRING_TYPE, determine what kind of string
> +   token created it: CPP_STRING, CPP_STRING16, CPP_STRING32, or
> +   CPP_WSTRING.  Return CPP_OTHER in case of error.
> +
> +   This effectively reverses part of the logic in
> +   lex_string and fix_string_type.  */
> +
> +static enum cpp_ttype
> +get_cpp_ttype_from_string_type (tree string_type)
> +{
> +  gcc_assert (string_type);
> +  if (TREE_CODE (string_type) != ARRAY_TYPE)
> +    return CPP_OTHER;
> +
> +  tree element_type = TREE_TYPE (string_type);
> +  if (TREE_CODE (element_type) != INTEGER_TYPE)
> +    return CPP_OTHER;
> +
> +  int bits_per_character = TYPE_PRECISION (element_type);
> +  switch (bits_per_character)
> +    {
> +    case 8:
> +      return CPP_STRING;  /* It could have also been CPP_UTF8STRING.  */
> +    case 16:
> +      return CPP_STRING16;
> +    case 32:
> +      return CPP_STRING32;
> +    }
> +
> +  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
> +    return CPP_WSTRING;
Doesn't the switch above effectively mean we don't use CPP_WSTRING?  In 
what cases do you expect it to be used?

> diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
> index 8c80574..7b5da57 100644
> --- a/gcc/c-family/c-common.h
> +++ b/gcc/c-family/c-common.h
> @@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch (cpp_reader *pfile);
>     __TIME__ can store.  */
>  #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
>
> +extern GTY(()) string_concat_db *g_string_concat_db;
Presumably this DB needs to persist through the entire compilation unit 
and the nodes inside reference GC'd objects, right?  Just want to make 
100% sure that we need to expose this to the GC system before ack-ing.

The rest looks reasonable.  So we just need to reach closure on those 
two issues IMHO.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952)
  2016-08-03 15:17             ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
@ 2016-08-04 18:09               ` Jeff Law
  2016-08-04 19:25                 ` David Malcolm
  2016-08-08 20:16                 ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
  0 siblings, 2 replies; 61+ messages in thread
From: Jeff Law @ 2016-08-04 18:09 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers

On 08/03/2016 09:45 AM, David Malcolm wrote:
> This patch updates c-format.c to use the new class substring_loc, added
> in the previous patch, replacing location_column_from_byte_offset.
> Hence with this patch, Wformat can underline the precise erroneous
> format string in many more cases.
>
> The patch also introduces two new functions for emitting Wformat
> warnings: format_warning_at_substring and format_warning_at_char,
> providing an inform in the face of macros where the pertinent part of
> the format string may be separate from the function call.
>
> Successfully bootstrapped&regrtested in conjunction with the rest of the
> patch kit on x86_64-pc-linux-gnu.
>
> (The v2 version of the patch had a successful selftest run for stage 1 on
> powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the patch
> kit, and a successful build of stage1 for all targets via config-list.mk;
> the patch has only been rebased since)
>
> OK for trunk if it passes individual testing? (on top of patches 1-2)
>
> gcc/c-family/ChangeLog:
> 	PR c/52952
> 	* c-format.c: Include "diagnostic.h".
> 	(location_column_from_byte_offset): Delete.
> 	(location_from_offset): Delete.
> 	(format_warning_va): New function.
> 	(format_warning_at_substring): New function.
> 	(format_warning_at_char): New function.
> 	(check_format_arg): Capture location of format_tree and pass to
> 	check_format_info_main.
> 	(check_format_info_main): Add params FMT_PARAM_LOC and
> 	FORMAT_STRING_CST.  Convert calls to warning_at to calls to
> 	format_warning_at_char.  Pass a substring_loc instance to
> 	check_format_types.
> 	(check_format_types): Convert first param from a location_t
> 	to a const substring_loc & and rename to "fmt_loc".  Attempt
> 	to extract the range of the relevant parameter and pass it
> 	to format_type_warning.
> 	(format_type_warning): Convert first param from a location_t
> 	to a const substring_loc & and rename to "fmt_loc".  Add
> 	params "param_range" and "type".  Replace calls to warning_at
> 	with calls to format_warning_at_substring.
>
> gcc/testsuite/ChangeLog:
> 	PR c/52952
> 	* gcc.dg/cpp/pr66415-1.c: Likewise.
> 	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
> 	* gcc.dg/format/c90-printf-1.c: Likewise.
> 	* gcc.dg/format/diagnostic-ranges.c: New test case.
> ---
>

> @@ -1758,6 +1859,7 @@ check_format_info_main (format_check_results *res,
>  	  ++format_chars;
>  	  continue;
>  	}
> +      const char *start_of_this_format = format_chars;
Do you realize that this isn't used for ~700 lines after this point?  Is 
there any sensible way to factor some code here to avoid the coding 
disconnect.  I realize the function was huge before you got in here, but 
if at all possible, I'd like to see a bit of cleanup.

I think this is OK after that cleanup.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT
  2016-08-03 16:06             ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT Jeff Law
@ 2016-08-04 19:02               ` David Malcolm
  0 siblings, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-04 19:02 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers

On Wed, 2016-08-03 at 10:06 -0600, Jeff Law wrote:
> On 08/03/2016 09:45 AM, David Malcolm wrote:
> > I split out the selftest.h changes from v2 of the kit for ease of
> > review;
> > here they are.
> > 
> > Successfully bootstrapped&regrtested in conjunction with the rest
> > of the
> > patch kit on x86_64-pc-linux-gnu.
> > 
> > OK for trunk?
> > 
> > gcc/ChangeLog:
> > 	* selftest.h (ASSERT_TRUE): Reimplement in terms of...
> > 	(ASSERT_TRUE_AT): New macro.
> > 	(ASSERT_FALSE): Reimplement in terms of...
> > 	(ASSERT_FALSE_AT): New macro.
> > 	(ASSERT_STREQ_AT): Fix typo in comment.
> OK.  Though I do wonder if these should just be normal functions... 
>  I 
> assume there's a good reason for the macro pain :)

I tried to do it with an inline function, but a macro seems to be
better: as a macro, we can capture the stringification of the input
expression, so that we can print it (and its evaluated value) if it
fails.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/4] (v3) On-demand locations within string-literals
  2016-08-04 17:38               ` Jeff Law
@ 2016-08-04 19:21                 ` David Malcolm
  2016-08-04 20:18                   ` Jeff Law
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-04 19:21 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers

On Thu, 2016-08-04 at 11:37 -0600, Jeff Law wrote:
> On 08/03/2016 09:45 AM, David Malcolm wrote:
> > Changes in v3:
> > - Avoid including cpplib.h from input.h
> > - Properly handle stringified macro arguments (with tests for this)
> > - Minor whitespace fixes
> > - Move selftest.h changes to a separate patch
> > 
> > Changes in v2:
> > - Tweaks to substring location selftests
> > - Many more selftests (EBCDIC, the various wide string types, etc)
> > - Clean up conditions in charset.c; require source == execution
> > charset
> >   to have substring locations
> > - Make string_concat_db field private
> > - Return error messages rather than bool
> > - Fix source_range for charset.c:convert_escape
> > - Introduce class substring_loc
> > - Handle bad input locations more gracefully
> > - Ensure that we can read substring information for a token which
> >   starts in one linemap and ends in another (seen in
> >   gcc.dg/cpp/pr69985.c)
> > 
> > This version addresses Joseph's qn about stringification of macro
> > arguments (by failing gracefully on them), and the modularity
> > concerns noted by Manu.
> > 
> > Successfully bootstrapped&regrtested in conjunction with the rest
> > of the
> > patch kit on x86_64-pc-linux-gnu.
> > 
> > v2 of the kit successfully passes a full config-list.mk and a
> > successful selftest
> > run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111), both in
> > conjunction with the
> > rest of the patch kit; I plan to repeat those tests.
> > 
> > I believe I can self-approve the changes to input.c, input.h,
> > libcpp,
> > and the testsuite; the remaining changes needing approval are those
> > to c-family and to gcc.c.
> I think that's a fair assessment.  You might consider pulling those
> out 
> as a distinct hunk in the future -- if you haven't noticed, I often
> try 
> to knock out the smaller patches first (without even looking to see
> how 
> much might be bits the author can self-approve).
> 
> 
> > 
> > OK for trunk if it passes testing? (by itself)
> > 
> > 
> > gcc/c-family/ChangeLog:
> > 	* c-common.c: Include "substring-locations.h".
> > 	(get_cpp_ttype_from_string_type): New function.
> > 	(g_string_concat_db): New global.
> > 	(substring_loc::get_range): New method.
> > 	* c-common.h (g_string_concat_db): New declaration.
> > 	(class substring_loc): New class.
> > 	* c-lex.c (lex_string): When concatenating strings, capture the
> > 	locations of all tokens using a new obstack, and record the
> > 	concatenation locations within g_string_concat_db.
> > 	* c-opts.c (c_common_init_options): Construct
> > g_string_concat_db
> > 	on the ggc-heap.
> > 
> > gcc/ChangeLog:
> > 	* gcc.c (cpp_options): Rename string to...
> > 	(cpp_options_): ...this, to avoid clashing with struct in
> > 	cpplib.h.
> > 	(static_specs): Update initialize for above renaming
> > 	* input.c (string_concat::string_concat): New constructor.
> > 	(string_concat_db::string_concat_db): New constructor.
> > 	(string_concat_db::record_string_concatenation): New method.
> > 	(string_concat_db::get_string_concatenation): New method.
> > 	(string_concat_db::get_key_loc): New method.
> > 	(class auto_cpp_string_vec): New class.
> > 	(get_substring_ranges_for_loc): New function.
> > 	(get_source_range_for_substring): New function.
> > 	(get_num_source_ranges_for_substring): New function.
> > 	(class selftest::lexer_test_options): New class.
> > 	(struct selftest::lexer_test): New struct.
> > 	(class selftest::ebcdic_execution_charset): New class.
> > 	(selftest::ebcdic_execution_charset::s_singleton): New
> > variable.
> > 	(selftest::lexer_test::lexer_test): New constructor.
> > 	(selftest::lexer_test::~lexer_test): New destructor.
> > 	(selftest::lexer_test::get_token): New method.
> > 	(selftest::assert_char_at_range): New function.
> > 	(ASSERT_CHAR_AT_RANGE): New macro.
> > 	(selftest::assert_num_substring_ranges): New function.
> > 	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
> > 	(selftest::assert_has_no_substring_ranges): New function.
> > 	(ASSERT_HAS_NO_SUBSTRING_RANGES): New macro.
> > 	(selftest::test_lexer_string_locations_simple): New function.
> > 	(selftest::test_lexer_string_locations_ebcdic): New function.
> > 	(selftest::test_lexer_string_locations_hex): New function.
> > 	(selftest::test_lexer_string_locations_oct): New function.
> > 	(selftest::test_lexer_string_locations_letter_escape_1): New
> > function.
> > 	(selftest::test_lexer_string_locations_letter_escape_2): New
> > function.
> > 	(selftest::test_lexer_string_locations_ucn4): New function.
> > 	(selftest::test_lexer_string_locations_ucn8): New function.
> > 	(selftest::uint32_from_big_endian): New function.
> > 	(selftest::test_lexer_string_locations_wide_string): New
> > function.
> > 	(selftest::uint16_from_big_endian): New function.
> > 	(selftest::test_lexer_string_locations_string16): New function.
> > 	(selftest::test_lexer_string_locations_string32): New function.
> > 	(selftest::test_lexer_string_locations_u8): New function.
> > 	(selftest::test_lexer_string_locations_utf8_source): New
> > function.
> > 	(selftest::test_lexer_string_locations_concatenation_1): New
> > 	function.
> > 	(selftest::test_lexer_string_locations_concatenation_2): New
> > 	function.
> > 	(selftest::test_lexer_string_locations_concatenation_3): New
> > 	function.
> > 	(selftest::test_lexer_string_locations_macro): New function.
> > 	(selftest::test_lexer_string_locations_stringified_macro_argume
> > nt):
> > 	New function.
> > 	(selftest::test_lexer_string_locations_non_string): New
> > function.
> > 	(selftest::test_lexer_string_locations_long_line): New
> > function.
> > 	(selftest::test_lexer_char_constants): New function.
> > 	(selftest::input_c_tests): Call the new test functions once per
> > 	case within the line_table test matrix.
> > 	* input.h (struct string_concat): New struct.
> > 	(struct location_hash): New struct.
> > 	(class string_concat_db): New class.
> > 	* substring-locations.h: New header.
> > 
> > gcc/testsuite/ChangeLog:
> > 	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
> > 	* gcc.dg/plugin/diagnostic-test-string-literals-2.c: New file.
> > 	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New
> > file.
> > 	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add the above
> > new files.
> > 
> > libcpp/ChangeLog:
> > 	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
> > 	constructor.
> > 	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
> > 	(cpp_substring_ranges::add_range): New method.
> > 	(cpp_substring_ranges::add_n_ranges): New method.
> > 	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
> > 	they are non-NULL, read position information from *loc_reader
> > 	and update char_range->m_finish accordingly.
> > 	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
> > 	params.  If loc_reader is non-NULL, read location information
> > from
> > 	it, and update *ranges accordingly, using char_range.
> > 	Conditionalize the conversion into tbuf on tbuf being non-NULL.
> > 	(convert_hex): Likewise, conditionalizing the call to
> > 	emit_numeric_escape on tbuf.
> > 	(convert_oct): Likewise.
> > 	(convert_escape): Add params "loc_reader" and "ranges".  If
> > 	loc_reader is non-NULL, read location information from it, and
> > 	update *ranges accordingly.  Conditionalize the conversion into
> > 	tbuf on tbuf being non-NULL.
> > 	(cpp_interpret_string): Rename to...
> > 	(cpp_interpret_string_1): ...this, adding params "loc_readers"
> > and
> > 	"out".  Use "to" to conditionalize the initialization and usage
> > of
> > 	"tbuf", such as running the converter.  If "loc_readers" is
> > 	non-NULL, use the instances within it, reading location
> > 	information from them, and passing them to convert_escape;
> > likewise
> > 	write to "out" if loc_readers is non-NULL.  Check for leading
> > 	quote and issue an error if it is not present.  Update boundary
> > 	check from "== limit" to ">= limit" to protect against
> > erroneous
> > 	location values to calls that are not parsing string literals.
> > 	(cpp_interpret_string): Reimplement in terms to
> > 	cpp_interpret_string_1.
> > 	(noop_error_cb): New function.
> > 	(cpp_interpret_string_ranges): New function.
> > 	(cpp_string_location_reader::cpp_string_location_reader): New
> > 	constructor.
> > 	(cpp_string_location_reader::get_next): New method.
> > 	* include/cpplib.h (class cpp_string_location_reader): New
> > class.
> > 	(class cpp_substring_ranges): New class.
> > 	(cpp_interpret_string_ranges): New prototype.
> > 	* internal.h (_cpp_valid_ucn): Add params "char_range" and
> > 	"loc_reader".
> > 	* lex.c (forms_identifier_p): Pass NULL for new params to
> > 	_cpp_valid_ucn.
> > ---
> > 
> > diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
> > index 27031b5..7a8b6ea 100644
> > --- a/gcc/c-family/c-common.c
> > +++ b/gcc/c-family/c-common.c
> > @@ -1098,6 +1099,67 @@ fix_string_type (tree value)
> >    TREE_STATIC (value) = 1;
> >    return value;
> >  }
> > +
> > +/* Given a string of type STRING_TYPE, determine what kind of
> > string
> > +   token created it: CPP_STRING, CPP_STRING16, CPP_STRING32, or
> > +   CPP_WSTRING.  Return CPP_OTHER in case of error.
> > +
> > +   This effectively reverses part of the logic in
> > +   lex_string and fix_string_type.  */
> > +
> > +static enum cpp_ttype
> > +get_cpp_ttype_from_string_type (tree string_type)
> > +{
> > +  gcc_assert (string_type);
> > +  if (TREE_CODE (string_type) != ARRAY_TYPE)
> > +    return CPP_OTHER;
> > +
> > +  tree element_type = TREE_TYPE (string_type);
> > +  if (TREE_CODE (element_type) != INTEGER_TYPE)
> > +    return CPP_OTHER;
> > +
> > +  int bits_per_character = TYPE_PRECISION (element_type);
> > +  switch (bits_per_character)
> > +    {
> > +    case 8:
> > +      return CPP_STRING;  /* It could have also been
> > CPP_UTF8STRING.  */
> > +    case 16:
> > +      return CPP_STRING16;
> > +    case 32:
> > +      return CPP_STRING32;
> > +    }
> > +
> > +  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
> > +    return CPP_WSTRING;
> Doesn't the switch above effectively mean we don't use CPP_WSTRING? 
>  In 
> what cases do you expect it to be used?

I was attempting to provide an inverse of lex_string and
fix_string_type, going back from a tree type to a cpp_ttype.  The
purpose of the ttype is to determine the execution charset of a
STRING_CST to enforce the requirement in cpp_interpret_string_ranges
that there's a 1:1 correspondence between bytes in the source encoding
and bytes in the execution encoding.

c-lex.c: lex_string has:
	case CPP_WSTRING:
	  value = build_string (TYPE_PRECISION (wchar_type_node)
				/ TYPE_PRECISION (char_type_node),
				"\0\0\0");  /* widest supported wchar_t
					       is 32 bits */

Given that, it looks like it's not possible for that conditional to
fire (unless we somehow have a 24-bit wchar_t???)

Should I just drop the CPP_WSTRING conditional?  (and update the
function comment, to capture the fact that the cpp_ttype is one with
the same execution encoding as the STRING_CST, not necessarily equal to
the exact cpp_ttype that was in use).

> > diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
> > index 8c80574..7b5da57 100644
> > --- a/gcc/c-family/c-common.h
> > +++ b/gcc/c-family/c-common.h
> > @@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch
> > (cpp_reader *pfile);
> >     __TIME__ can store.  */
> >  #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
> > 
> > +extern GTY(()) string_concat_db *g_string_concat_db;
> Presumably this DB needs to persist through the entire compilation
> unit 
> and the nodes inside reference GC'd objects, right?  Just want to
> make 
> 100% sure that we need to expose this to the GC system before ack
> -ing.

It needs to persist for as long as we might make queries about
substring locations.  It doesn't reference GC'd objects.  However it
might be desirable to locate locations within string constants loaded
from PCH files, hence I put it in GTY memory.  Is this overkill?

> The rest looks reasonable.  So we just need to reach closure on those
> two issues IMHO.
> 
> jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952)
  2016-08-04 18:09               ` Jeff Law
@ 2016-08-04 19:25                 ` David Malcolm
  2016-08-04 20:22                   ` Jeff Law
  2016-08-08 20:16                 ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
  1 sibling, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-04 19:25 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers, Martin Sebor

On Thu, 2016-08-04 at 12:08 -0600, Jeff Law wrote:
> On 08/03/2016 09:45 AM, David Malcolm wrote:
> > This patch updates c-format.c to use the new class substring_loc,
> > added
> > in the previous patch, replacing location_column_from_byte_offset.
> > Hence with this patch, Wformat can underline the precise erroneous
> > format string in many more cases.
> > 
> > The patch also introduces two new functions for emitting Wformat
> > warnings: format_warning_at_substring and format_warning_at_char,
> > providing an inform in the face of macros where the pertinent part
> > of
> > the format string may be separate from the function call.
> > 
> > Successfully bootstrapped&regrtested in conjunction with the rest
> > of the
> > patch kit on x86_64-pc-linux-gnu.
> > 
> > (The v2 version of the patch had a successful selftest run for
> > stage 1 on
> > powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the
> > patch
> > kit, and a successful build of stage1 for all targets via config
> > -list.mk;
> > the patch has only been rebased since)
> > 
> > OK for trunk if it passes individual testing? (on top of patches 1
> > -2)
> > 
> > gcc/c-family/ChangeLog:
> > 	PR c/52952
> > 	* c-format.c: Include "diagnostic.h".
> > 	(location_column_from_byte_offset): Delete.
> > 	(location_from_offset): Delete.
> > 	(format_warning_va): New function.
> > 	(format_warning_at_substring): New function.
> > 	(format_warning_at_char): New function.
> > 	(check_format_arg): Capture location of format_tree and pass to
> > 	check_format_info_main.
> > 	(check_format_info_main): Add params FMT_PARAM_LOC and
> > 	FORMAT_STRING_CST.  Convert calls to warning_at to calls to
> > 	format_warning_at_char.  Pass a substring_loc instance to
> > 	check_format_types.
> > 	(check_format_types): Convert first param from a location_t
> > 	to a const substring_loc & and rename to "fmt_loc".  Attempt
> > 	to extract the range of the relevant parameter and pass it
> > 	to format_type_warning.
> > 	(format_type_warning): Convert first param from a location_t
> > 	to a const substring_loc & and rename to "fmt_loc".  Add
> > 	params "param_range" and "type".  Replace calls to warning_at
> > 	with calls to format_warning_at_substring.
> > 
> > gcc/testsuite/ChangeLog:
> > 	PR c/52952
> > 	* gcc.dg/cpp/pr66415-1.c: Likewise.
> > 	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
> > 	* gcc.dg/format/c90-printf-1.c: Likewise.
> > 	* gcc.dg/format/diagnostic-ranges.c: New test case.
> > ---
> > 
> 
> > @@ -1758,6 +1859,7 @@ check_format_info_main (format_check_results
> > *res,
> >  	  ++format_chars;
> >  	  continue;
> >  	}
> > +      const char *start_of_this_format = format_chars;
> Do you realize that this isn't used for ~700 lines after this point? 
>  Is 
> there any sensible way to factor some code here to avoid the coding 
> disconnect.  I realize the function was huge before you got in here,
> but 
> if at all possible, I'd like to see a bit of cleanup.
> 
> I think this is OK after that cleanup.

format_chars can get modified in numerous places in the intervening
lines, which is why I stash the value there.

I can do some kind of cleanup of check_format_info_main, maybe
splitting out the things in the body of loop, moving them to support
functions.

That said, I note that Martin's sprintf patch:
  https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00056.html
also touches those ~700 lines in check_format_info_main in over a dozen
places.  Given that, would you prefer I do the cleanup before or after
the substring_loc patch?

[CCing Martin]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955)
  2016-08-03 15:17             ` [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
@ 2016-08-04 19:55               ` Jeff Law
  2016-08-04 21:06                 ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-04 19:55 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers

On 08/03/2016 09:45 AM, David Malcolm wrote:
> This adds fix-it hints to c-format.c so that it can (sometimes) suggest
> the format string the user should have used.
>
> The patch adds selftests for the new code in c-format.c.  These
> selftests are thus lang-specific.  This is the first time we've had
> lang-specific selftests, and hence the patch also adds a langhook for
> running them.  (Note that currently the Makefile only invokes the
> selftests for cc1).
>
> Successfully bootstrapped&regrtested in conjunction with the rest of the
> patch kit on x86_64-pc-linux-gnu.
>
> (The v2 version of the patch had a successful selftest run for stage 1 on
> powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the patch
> kit, and a successful build of stage1 for all targets via config-list.mk;
> the patch has only been rebased since)
>
> OK for trunk if it passes testing?
>
> gcc/c-family/ChangeLog:
> 	PR c/64955
> 	* c-common.h (selftest::c_format_c_tests): New declaration.
> 	(selftest::run_c_tests): New declaration.
> 	* c-format.c: Include "selftest.h.
> 	(format_warning_va): Add param "corrected_substring" and use
> 	it to add a replacement fix-it hint.
> 	(format_warning_at_substring): Likewise.
> 	(format_warning_at_char): Update for new param of
> 	format_warning_va.
> 	(check_format_info_main): Pass "fki" to check_format_types.
> 	(check_format_types): Add param "fki" and pass it to
> 	format_type_warning.
> 	(deref_n_times): New function.
> 	(get_modifier_for_format_len): New function.
> 	(selftest::test_get_modifier_for_format_len): New function.
> 	(get_format_for_type): New function.
> 	(format_type_warning): Add param "fki" and use it to attempt
> 	to provide hints for argument types when calling
> 	format_warning_at_substring.
> 	(selftest::get_info): New function.
> 	(selftest::assert_format_for_type_streq): New function.
> 	(ASSERT_FORMAT_FOR_TYPE_STREQ): New macro.
> 	(selftest::test_get_format_for_type_printf): New function.
> 	(selftest::test_get_format_for_type_scanf): New function.
> 	(selftest::c_format_c_tests): New function.
>
> gcc/c/ChangeLog:
> 	PR c/64955
> 	* c-lang.c (LANG_HOOKS_RUN_LANG_SELFTESTS): If CHECKING_P, wire
> 	this up to selftest::run_c_tests.
> 	(selftest::run_c_tests): New function.
>
> gcc/ChangeLog:
> 	PR c/64955
> 	* langhooks-def.h (LANG_HOOKS_RUN_LANG_SELFTESTS): New default
> 	do-nothing langhook.
> 	(LANG_HOOKS_INITIALIZER): Add LANG_HOOKS_RUN_LANG_SELFTESTS.
> 	* langhooks.h (struct lang_hooks): Add run_lang_selftests.
> 	* selftest-run-tests.c: Include "tree.h" and "langhooks.h".
> 	(selftest::run_tests): Call lang_hooks.run_lang_selftests.
>
> gcc/testsuite/ChangeLog:
> 	PR c/64955
> 	* gcc.dg/format/diagnostic-ranges.c: Add fix-it hints to expected
> 	output.
So presumably we always use the type of the argument as the "correct" 
type and assume the format string is what needs to be fixed (with the 
exception of getting the right amount of *s handled).  That seems 
intuitively the right thing to do, but do we have a hit rate better than 
50% in practice?

This is OK.  I'm really just curious about your thoughts/experiences on 
the heuristics.

jeff


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/4] (v3) On-demand locations within string-literals
  2016-08-04 19:21                 ` David Malcolm
@ 2016-08-04 20:18                   ` Jeff Law
  2016-08-05 18:17                     ` [Committed] [PATCH 2/4] (v4) " David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-04 20:18 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers

On 08/04/2016 01:21 PM, David Malcolm wrote:

>>> +
>>> +static enum cpp_ttype
>>> +get_cpp_ttype_from_string_type (tree string_type)
>>> +{
>>> +  gcc_assert (string_type);
>>> +  if (TREE_CODE (string_type) != ARRAY_TYPE)
>>> +    return CPP_OTHER;
>>> +
>>> +  tree element_type = TREE_TYPE (string_type);
>>> +  if (TREE_CODE (element_type) != INTEGER_TYPE)
>>> +    return CPP_OTHER;
>>> +
>>> +  int bits_per_character = TYPE_PRECISION (element_type);
>>> +  switch (bits_per_character)
>>> +    {
>>> +    case 8:
>>> +      return CPP_STRING;  /* It could have also been
>>> CPP_UTF8STRING.  */
>>> +    case 16:
>>> +      return CPP_STRING16;
>>> +    case 32:
>>> +      return CPP_STRING32;
>>> +    }
>>> +
>>> +  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
>>> +    return CPP_WSTRING;
>> Doesn't the switch above effectively mean we don't use CPP_WSTRING?
>>  In
>> what cases do you expect it to be used?
>
> I was attempting to provide an inverse of lex_string and
> fix_string_type, going back from a tree type to a cpp_ttype.
And I'm guessing (without looking closely at those routines) that you 
may not be able to reliably map backwards because CPP_WSTRING and one of 
CCP_STRING{16,32} are indistinguishable at this point.  At least I think 
they are indistinguishable.



The
> purpose of the ttype is to determine the execution charset of a
> STRING_CST to enforce the requirement in cpp_interpret_string_ranges
> that there's a 1:1 correspondence between bytes in the source encoding
> and bytes in the execution encoding.
>
> c-lex.c: lex_string has:
> 	case CPP_WSTRING:
> 	  value = build_string (TYPE_PRECISION (wchar_type_node)
> 				/ TYPE_PRECISION (char_type_node),
> 				"\0\0\0");  /* widest supported wchar_t
> 					       is 32 bits */
>
> Given that, it looks like it's not possible for that conditional to
> fire (unless we somehow have a 24-bit wchar_t???)
I think wchar_t has to be an integral type, so this could only happen if 
one of the standard integral types was 24 bits.  I guess that is 
possible.  I think the code as written would catch that case -- but then 
again, if we had such a target in GCC, we'd probably end up defining 
CPP_STRING24 and the WCHAR code wouldn't fire.


>
> Should I just drop the CPP_WSTRING conditional?  (and update the
> function comment, to capture the fact that the cpp_ttype is one with
> the same execution encoding as the STRING_CST, not necessarily equal to
> the exact cpp_ttype that was in use).
I'd probably put a default case with some kind of assert/checking 
failure so that if some of this nature ever happens we'll get a nice 
loud message that our assumptions were incorrect ;-)

And yes, I think your suggestion on the function comment is spot-on.

OK with those changes.

>
>>> diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
>>> index 8c80574..7b5da57 100644
>>> --- a/gcc/c-family/c-common.h
>>> +++ b/gcc/c-family/c-common.h
>>> @@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch
>>> (cpp_reader *pfile);
>>>     __TIME__ can store.  */
>>>  #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
>>>
>>> +extern GTY(()) string_concat_db *g_string_concat_db;
>> Presumably this DB needs to persist through the entire compilation
>> unit
>> and the nodes inside reference GC'd objects, right?  Just want to
>> make
>> 100% sure that we need to expose this to the GC system before ack
>> -ing.
>
> It needs to persist for as long as we might make queries about
> substring locations.  It doesn't reference GC'd objects.  However it
> might be desirable to locate locations within string constants loaded
> from PCH files, hence I put it in GTY memory.  Is this overkill?
Ugh.  OK.  I'd rather it not be in GC, but I see the motivation.

And presumably with Martin's code wanting to issue warnings from the 
middle-end, we have to keep the table around through the middle end.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952)
  2016-08-04 19:25                 ` David Malcolm
@ 2016-08-04 20:22                   ` Jeff Law
  2016-08-06  0:56                     ` [PATCH] c-format.c: cleanup of check_format_info_main David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2016-08-04 20:22 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers, Martin Sebor

On 08/04/2016 01:24 PM, David Malcolm wrote:

>> Do you realize that this isn't used for ~700 lines after this point?
>>  Is
>> there any sensible way to factor some code here to avoid the coding
>> disconnect.  I realize the function was huge before you got in here,
>> but
>> if at all possible, I'd like to see a bit of cleanup.
>>
>> I think this is OK after that cleanup.
>
> format_chars can get modified in numerous places in the intervening
> lines, which is why I stash the value there.
Yea, I figured that was the case.  I first noticed the stashed value, 
but didn't see where it was used for far longer than I expected.

>
> I can do some kind of cleanup of check_format_info_main, maybe
> splitting out the things in the body of loop, moving them to support
> functions.
That's essentially what I was thinking.

>
> That said, I note that Martin's sprintf patch:
>   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00056.html
> also touches those ~700 lines in check_format_info_main in over a dozen
> places.  Given that, would you prefer I do the cleanup before or after
> the substring_loc patch?
I think you should go first with the cleanup.  It'll cause Martin some 
heartburn, but that happens sometimes.

And FWIW, if you hadn't needed to stash away that value I probably 
wouldn't have noticed how badly that function (and the loop in 
particular) needed some refactoring.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955)
  2016-08-04 19:55               ` Jeff Law
@ 2016-08-04 21:06                 ` David Malcolm
  0 siblings, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-04 21:06 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers

On Thu, 2016-08-04 at 13:55 -0600, Jeff Law wrote:
> On 08/03/2016 09:45 AM, David Malcolm wrote:
> > This adds fix-it hints to c-format.c so that it can (sometimes)
> > suggest
> > the format string the user should have used.
> > 
> > The patch adds selftests for the new code in c-format.c.  These
> > selftests are thus lang-specific.  This is the first time we've had
> > lang-specific selftests, and hence the patch also adds a langhook
> > for
> > running them.  (Note that currently the Makefile only invokes the
> > selftests for cc1).
> > 
> > Successfully bootstrapped&regrtested in conjunction with the rest
> > of the
> > patch kit on x86_64-pc-linux-gnu.
> > 
> > (The v2 version of the patch had a successful selftest run for
> > stage 1 on
> > powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the
> > patch
> > kit, and a successful build of stage1 for all targets via config
> > -list.mk;
> > the patch has only been rebased since)
> > 
> > OK for trunk if it passes testing?
> > 
> > gcc/c-family/ChangeLog:
> > 	PR c/64955
> > 	* c-common.h (selftest::c_format_c_tests): New declaration.
> > 	(selftest::run_c_tests): New declaration.
> > 	* c-format.c: Include "selftest.h.
> > 	(format_warning_va): Add param "corrected_substring" and use
> > 	it to add a replacement fix-it hint.
> > 	(format_warning_at_substring): Likewise.
> > 	(format_warning_at_char): Update for new param of
> > 	format_warning_va.
> > 	(check_format_info_main): Pass "fki" to check_format_types.
> > 	(check_format_types): Add param "fki" and pass it to
> > 	format_type_warning.
> > 	(deref_n_times): New function.
> > 	(get_modifier_for_format_len): New function.
> > 	(selftest::test_get_modifier_for_format_len): New function.
> > 	(get_format_for_type): New function.
> > 	(format_type_warning): Add param "fki" and use it to attempt
> > 	to provide hints for argument types when calling
> > 	format_warning_at_substring.
> > 	(selftest::get_info): New function.
> > 	(selftest::assert_format_for_type_streq): New function.
> > 	(ASSERT_FORMAT_FOR_TYPE_STREQ): New macro.
> > 	(selftest::test_get_format_for_type_printf): New function.
> > 	(selftest::test_get_format_for_type_scanf): New function.
> > 	(selftest::c_format_c_tests): New function.
> > 
> > gcc/c/ChangeLog:
> > 	PR c/64955
> > 	* c-lang.c (LANG_HOOKS_RUN_LANG_SELFTESTS): If CHECKING_P, wire
> > 	this up to selftest::run_c_tests.
> > 	(selftest::run_c_tests): New function.
> > 
> > gcc/ChangeLog:
> > 	PR c/64955
> > 	* langhooks-def.h (LANG_HOOKS_RUN_LANG_SELFTESTS): New default
> > 	do-nothing langhook.
> > 	(LANG_HOOKS_INITIALIZER): Add LANG_HOOKS_RUN_LANG_SELFTESTS.
> > 	* langhooks.h (struct lang_hooks): Add run_lang_selftests.
> > 	* selftest-run-tests.c: Include "tree.h" and "langhooks.h".
> > 	(selftest::run_tests): Call lang_hooks.run_lang_selftests.
> > 
> > gcc/testsuite/ChangeLog:
> > 	PR c/64955
> > 	* gcc.dg/format/diagnostic-ranges.c: Add fix-it hints to
> > expected
> > 	output.
> So presumably we always use the type of the argument as the "correct"
> type and assume the format string is what needs to be fixed (with the
> exception of getting the right amount of *s handled).  That seems 
> intuitively the right thing to do, but do we have a hit rate better
> than 
> 50% in practice?
> 
> This is OK.  I'm really just curious about your thoughts/experiences
> on 
> the heuristics.

Our current behavior is to emit a warning of the form:

test.c:9:18: warning: format â€˜%iâ€™ expects argument of type â€˜intâ€™, but
argument 2 has type â€˜const char *â€™ [-Wformat=]
   printf("hello %i", msg);
                  ^

Note that the text of the diagnostic tells the user the two types
involved within the message.

My experience with printf-style issues of this form is that I know want
expression I want to print, but I don't necessarily know exactly
whether it's say, an int vs a long: it's not that I want to print "an
int", it's that I wanted to print some specific expression.  Hence 
 (assuming this experience is typical) for mismatches of printf-style
calls, I think it's most helpful to the user to tell the user what
format code they need for the expression they want to print, which the
patch does, via the fix-it:

test.c:9:18: warning: format â€˜%iâ€™ expects argument of type â€˜intâ€™, but
argument 2 has type â€˜const char *â€™ [-Wformat=]
   printf("hello %i",
msg);
                 ~^
                 %s

Note how the text of the diagnostic is unchanged; it's just the fixit that's new (and the underline, which is patch 3 of the kit).

For scanf-style calls this argument may not be so strong, but I think it still holds for the "do I have an int or a long?" cases (giving int * vs long *).   It would be nice for scanf to detect the "you forgot to put an & in front of the destination lvalue" case and offer a fix-it hint for it, but I think that's a followup.

So I think it's likely to be much better than 50%, but that's a gut feeling, based on the above arguments.

I'm not aware of anyone who's done formal usability testing of a compiler's diagnostics.  I think there are two distinct types of activity:
(a) the edit->try to compile->edit->try to compile -> "it compiles!" cycle (nested within a debug cycle)
(b) compiling pre-existing code, perhaps with a different configuration than it was written on, often written by someone else

I'd be very interested in seeing usability studies of both aspects of a compiler, but in particular of (a): is there a published corpus somewhere of the kind of half-written non-compiling code that happens during (a) out there?  (I know this invites various sarcastic responses, but I'm serious :) )

(For myself, I attempt to run with a relatively recent build of gcc trunk as my day-to-day compiler, and if I see a diagnostic that could be improved whilst I'm hacking on something I make a note of it, and try to come back to it later as an RFE; bug reports of this form are most welcome).

Dave

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2016-08-04 20:18                   ` Jeff Law
@ 2016-08-05 18:17                     ` David Malcolm
  2016-08-06  5:48                       ` Markus Trippelsdorf
  2021-09-02 13:59                       ` [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Thomas Schwinge
  0 siblings, 2 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-05 18:17 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers

[-- Attachment #1: Type: text/plain, Size: 3578 bytes --]

On Thu, 2016-08-04 at 14:18 -0600, Jeff Law wrote:
> On 08/04/2016 01:21 PM, David Malcolm wrote:
> 
> > > > +
> > > > +static enum cpp_ttype
> > > > +get_cpp_ttype_from_string_type (tree string_type)
> > > > +{
> > > > +  gcc_assert (string_type);
> > > > +  if (TREE_CODE (string_type) != ARRAY_TYPE)
> > > > +    return CPP_OTHER;
> > > > +
> > > > +  tree element_type = TREE_TYPE (string_type);
> > > > +  if (TREE_CODE (element_type) != INTEGER_TYPE)
> > > > +    return CPP_OTHER;
> > > > +
> > > > +  int bits_per_character = TYPE_PRECISION (element_type);
> > > > +  switch (bits_per_character)
> > > > +    {
> > > > +    case 8:
> > > > +      return CPP_STRING;  /* It could have also been
> > > > CPP_UTF8STRING.  */
> > > > +    case 16:
> > > > +      return CPP_STRING16;
> > > > +    case 32:
> > > > +      return CPP_STRING32;
> > > > +    }
> > > > +
> > > > +  if (bits_per_character == TYPE_PRECISION (wchar_type_node))
> > > > +    return CPP_WSTRING;
> > > Doesn't the switch above effectively mean we don't use
> > > CPP_WSTRING?
> > >  In
> > > what cases do you expect it to be used?
> > 
> > I was attempting to provide an inverse of lex_string and
> > fix_string_type, going back from a tree type to a cpp_ttype.
> And I'm guessing (without looking closely at those routines) that you
> may not be able to reliably map backwards because CPP_WSTRING and one
> of 
> CCP_STRING{16,32} are indistinguishable at this point.  At least I
> think 
> they are indistinguishable.
> 
> 
> 
> The
> > purpose of the ttype is to determine the execution charset of a
> > STRING_CST to enforce the requirement in
> > cpp_interpret_string_ranges
> > that there's a 1:1 correspondence between bytes in the source
> > encoding
> > and bytes in the execution encoding.
> > 
> > c-lex.c: lex_string has:
> > 	case CPP_WSTRING:
> > 	  value = build_string (TYPE_PRECISION (wchar_type_node)
> > 				/ TYPE_PRECISION (char_type_node),
> > 				"\0\0\0");  /* widest supported wchar_t
> > 					       is 32 bits */
> > 
> > Given that, it looks like it's not possible for that conditional to
> > fire (unless we somehow have a 24-bit wchar_t???)
> I think wchar_t has to be an integral type, so this could only happen
> if 
> one of the standard integral types was 24 bits.  I guess that is 
> possible.  I think the code as written would catch that case -- but
> then 
> again, if we had such a target in GCC, we'd probably end up defining 
> CPP_STRING24 and the WCHAR code wouldn't fire.
> 
> 
> > 
> > Should I just drop the CPP_WSTRING conditional?  (and update the
> > function comment, to capture the fact that the cpp_ttype is one
> > with
> > the same execution encoding as the STRING_CST, not necessarily
> > equal to
> > the exact cpp_ttype that was in use).
> I'd probably put a default case with some kind of assert/checking 
> failure so that if some of this nature ever happens we'll get a nice 
> loud message that our assumptions were incorrect ;-)
> 
> And yes, I think your suggestion on the function comment is spot-on.
> 
> OK with those changes.

Thanks.  I noticed that the changes to gcc.c were also now redundant
after removing the #include of cpplib.h, so I removed them.

Successfully bootstrapped&regrtested the updated patch on x86_64-pc
-linux-gnu, and successfully ran the stage 1 selftests on powerpc-ibm
-aix7.1.3.0 (gcc111)

Committed to trunk as r239175; I'm attaching the final version of the
patch for reference.

(I'm working on the cleanup of c-format.c's check_format_info_main you
requested as a prerequisite for patch 3 of the kit)

[-- Attachment #2: 0002-v4-On-demand-locations-within-string-literals.patch --]
[-- Type: text/x-patch, Size: 115911 bytes --]

From db0cb275de47edf77f63ddf20466f521bae3edfd Mon Sep 17 00:00:00 2001
From: David Malcolm <dmalcolm@redhat.com>
Date: Mon, 31 Aug 2015 21:04:46 -0400
Subject: (v4) On-demand locations within string-literals

Changes in v4:
- get_cpp_ttype_from_string_type: removal of CPP_WSTRING clause;
updating of comment.
- removal of changes to gcc.c made redundant now we don't include
cpplib.h from input.h

Changes in v3:
- Avoid including cpplib.h from input.h
- Properly handle stringified macro arguments (with tests for this)
- Minor whitespace fixes
- Move selftest.h changes to a separate patch

Changes in v2:
- Tweaks to substring location selftests
- Many more selftests (EBCDIC, the various wide string types, etc)
- Clean up conditions in charset.c; require source == execution charset
  to have substring locations
- Make string_concat_db field private
- Return error messages rather than bool
- Fix source_range for charset.c:convert_escape
- Introduce class substring_loc
- Handle bad input locations more gracefully
- Ensure that we can read substring information for a token which
  starts in one linemap and ends in another (seen in
  gcc.dg/cpp/pr69985.c)

gcc/c-family/ChangeLog:
	* c-common.c: Include "substring-locations.h".
	(get_cpp_ttype_from_string_type): New function.
	(g_string_concat_db): New global.
	(substring_loc::get_range): New method.
	* c-common.h (g_string_concat_db): New declaration.
	(class substring_loc): New class.
	* c-lex.c (lex_string): When concatenating strings, capture the
	locations of all tokens using a new obstack, and record the
	concatenation locations within g_string_concat_db.
	* c-opts.c (c_common_init_options): Construct g_string_concat_db
	on the ggc-heap.

gcc/ChangeLog:
	* input.c (string_concat::string_concat): New constructor.
	(string_concat_db::string_concat_db): New constructor.
	(string_concat_db::record_string_concatenation): New method.
	(string_concat_db::get_string_concatenation): New method.
	(string_concat_db::get_key_loc): New method.
	(class auto_cpp_string_vec): New class.
	(get_substring_ranges_for_loc): New function.
	(get_source_range_for_substring): New function.
	(get_num_source_ranges_for_substring): New function.
	(class selftest::lexer_test_options): New class.
	(struct selftest::lexer_test): New struct.
	(class selftest::ebcdic_execution_charset): New class.
	(selftest::ebcdic_execution_charset::s_singleton): New variable.
	(selftest::lexer_test::lexer_test): New constructor.
	(selftest::lexer_test::~lexer_test): New destructor.
	(selftest::lexer_test::get_token): New method.
	(selftest::assert_char_at_range): New function.
	(ASSERT_CHAR_AT_RANGE): New macro.
	(selftest::assert_num_substring_ranges): New function.
	(ASSERT_NUM_SUBSTRING_RANGES): New macro.
	(selftest::assert_has_no_substring_ranges): New function.
	(ASSERT_HAS_NO_SUBSTRING_RANGES): New macro.
	(selftest::test_lexer_string_locations_simple): New function.
	(selftest::test_lexer_string_locations_ebcdic): New function.
	(selftest::test_lexer_string_locations_hex): New function.
	(selftest::test_lexer_string_locations_oct): New function.
	(selftest::test_lexer_string_locations_letter_escape_1): New function.
	(selftest::test_lexer_string_locations_letter_escape_2): New function.
	(selftest::test_lexer_string_locations_ucn4): New function.
	(selftest::test_lexer_string_locations_ucn8): New function.
	(selftest::uint32_from_big_endian): New function.
	(selftest::test_lexer_string_locations_wide_string): New function.
	(selftest::uint16_from_big_endian): New function.
	(selftest::test_lexer_string_locations_string16): New function.
	(selftest::test_lexer_string_locations_string32): New function.
	(selftest::test_lexer_string_locations_u8): New function.
	(selftest::test_lexer_string_locations_utf8_source): New function.
	(selftest::test_lexer_string_locations_concatenation_1): New
	function.
	(selftest::test_lexer_string_locations_concatenation_2): New
	function.
	(selftest::test_lexer_string_locations_concatenation_3): New
	function.
	(selftest::test_lexer_string_locations_macro): New function.
	(selftest::test_lexer_string_locations_stringified_macro_argument):
	New function.
	(selftest::test_lexer_string_locations_non_string): New function.
	(selftest::test_lexer_string_locations_long_line): New function.
	(selftest::test_lexer_char_constants): New function.
	(selftest::input_c_tests): Call the new test functions once per
	case within the line_table test matrix.
	* input.h (struct string_concat): New struct.
	(struct location_hash): New struct.
	(class string_concat_db): New class.
	* substring-locations.h: New header.

gcc/testsuite/ChangeLog:
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: New file.
	* gcc.dg/plugin/diagnostic-test-string-literals-2.c: New file.
	* gcc.dg/plugin/diagnostic_plugin_test_string_literals.c: New file.
	* gcc.dg/plugin/plugin.exp (plugin_test_list): Add the above new files.

libcpp/ChangeLog:
	* charset.c (cpp_substring_ranges::cpp_substring_ranges): New
	constructor.
	(cpp_substring_ranges::~cpp_substring_ranges): New destructor.
	(cpp_substring_ranges::add_range): New method.
	(cpp_substring_ranges::add_n_ranges): New method.
	(_cpp_valid_ucn): Add "char_range" and "loc_reader" params; if
	they are non-NULL, read position information from *loc_reader
	and update char_range->m_finish accordingly.
	(convert_ucn): Add "char_range", "loc_reader", and "ranges"
	params.  If loc_reader is non-NULL, read location information from
	it, and update *ranges accordingly, using char_range.
	Conditionalize the conversion into tbuf on tbuf being non-NULL.
	(convert_hex): Likewise, conditionalizing the call to
	emit_numeric_escape on tbuf.
	(convert_oct): Likewise.
	(convert_escape): Add params "loc_reader" and "ranges".  If
	loc_reader is non-NULL, read location information from it, and
	update *ranges accordingly.  Conditionalize the conversion into
	tbuf on tbuf being non-NULL.
	(cpp_interpret_string): Rename to...
	(cpp_interpret_string_1): ...this, adding params "loc_readers" and
	"out".  Use "to" to conditionalize the initialization and usage of
	"tbuf", such as running the converter.  If "loc_readers" is
	non-NULL, use the instances within it, reading location
	information from them, and passing them to convert_escape; likewise
	write to "out" if loc_readers is non-NULL.  Check for leading
	quote and issue an error if it is not present.  Update boundary
	check from "== limit" to ">= limit" to protect against erroneous
	location values to calls that are not parsing string literals.
	(cpp_interpret_string): Reimplement in terms to
	cpp_interpret_string_1.
	(noop_error_cb): New function.
	(cpp_interpret_string_ranges): New function.
	(cpp_string_location_reader::cpp_string_location_reader): New
	constructor.
	(cpp_string_location_reader::get_next): New method.
	* include/cpplib.h (class cpp_string_location_reader): New class.
	(class cpp_substring_ranges): New class.
	(cpp_interpret_string_ranges): New prototype.
	* internal.h (_cpp_valid_ucn): Add params "char_range" and
	"loc_reader".
	* lex.c (forms_identifier_p): Pass NULL for new params to
	_cpp_valid_ucn.
---
 gcc/c-family/c-common.c                            |   62 +
 gcc/c-family/c-common.h                            |   29 +
 gcc/c-family/c-lex.c                               |   24 +-
 gcc/c-family/c-opts.c                              |    3 +
 gcc/input.c                                        | 1547 ++++++++++++++++++++
 gcc/input.h                                        |   35 +
 gcc/substring-locations.h                          |   30 +
 .../plugin/diagnostic-test-string-literals-1.c     |  211 +++
 .../plugin/diagnostic-test-string-literals-2.c     |   53 +
 .../diagnostic_plugin_test_string_literals.c       |  212 +++
 gcc/testsuite/gcc.dg/plugin/plugin.exp             |    3 +
 libcpp/charset.c                                   |  432 +++++-
 libcpp/include/cpplib.h                            |   51 +
 libcpp/internal.h                                  |    4 +-
 libcpp/lex.c                                       |    2 +-
 15 files changed, 2642 insertions(+), 56 deletions(-)
 create mode 100644 gcc/substring-locations.h
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 27031b5..569f000 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-iterator.h"
 #include "opts.h"
 #include "gimplify.h"
+#include "substring-locations.h"
 
 cpp_reader *parse_in;		/* Declared in c-pragma.h.  */
 
@@ -1098,6 +1099,67 @@ fix_string_type (tree value)
   TREE_STATIC (value) = 1;
   return value;
 }
+
+/* Given a string of type STRING_TYPE, determine what kind of string
+   token would give an equivalent execution encoding: CPP_STRING,
+   CPP_STRING16, or CPP_STRING32.  Return CPP_OTHER in case of error.
+   This may not be exactly the string token type that initially created
+   the string, since CPP_WSTRING is indistinguishable from the 16/32 bit
+   string type at this point.
+
+   This effectively reverses part of the logic in lex_string and
+   fix_string_type.  */
+
+static enum cpp_ttype
+get_cpp_ttype_from_string_type (tree string_type)
+{
+  gcc_assert (string_type);
+  if (TREE_CODE (string_type) != ARRAY_TYPE)
+    return CPP_OTHER;
+
+  tree element_type = TREE_TYPE (string_type);
+  if (TREE_CODE (element_type) != INTEGER_TYPE)
+    return CPP_OTHER;
+
+  int bits_per_character = TYPE_PRECISION (element_type);
+  switch (bits_per_character)
+    {
+    case 8:
+      return CPP_STRING;  /* It could have also been CPP_UTF8STRING.  */
+    case 16:
+      return CPP_STRING16;
+    case 32:
+      return CPP_STRING32;
+    }
+
+  return CPP_OTHER;
+}
+
+/* The global record of string concatentations, for use in
+   extracting locations within string literals.  */
+
+GTY(()) string_concat_db *g_string_concat_db;
+
+/* Attempt to determine the source range of the substring.
+   If successful, return NULL and write the source range to *OUT_RANGE.
+   Otherwise return an error message.  Error messages are intended
+   for GCC developers (to help debugging) rather than for end-users.  */
+
+const char *
+substring_loc::get_range (source_range *out_range) const
+{
+  gcc_assert (out_range);
+
+  enum cpp_ttype tok_type = get_cpp_ttype_from_string_type (m_string_type);
+  if (tok_type == CPP_OTHER)
+    return "unrecognized string type";
+
+  return get_source_range_for_substring (parse_in, g_string_concat_db,
+					 m_fmt_string_loc, tok_type,
+					 m_start_idx, m_end_idx,
+					 out_range);
+}
+
 \f
 /* Fold X for consideration by one of the warning functions when checking
    whether an expression has a constant value.  */
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 8c80574..7b5da57 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1110,6 +1110,35 @@ extern time_t cb_get_source_date_epoch (cpp_reader *pfile);
    __TIME__ can store.  */
 #define MAX_SOURCE_DATE_EPOCH HOST_WIDE_INT_C (253402300799)
 
+extern GTY(()) string_concat_db *g_string_concat_db;
+
+/* libcpp can calculate location information about a range of characters
+   within a string literal, but doing so is non-trivial.
+
+   This class encapsulates such a source location, so that it can be
+   passed around (e.g. within c-format.c).  It is effectively a deferred
+   call into libcpp.  If needed by a diagnostic, the actual source_range
+   can be calculated by calling the get_range method.  */
+
+class substring_loc
+{
+ public:
+  substring_loc (location_t fmt_string_loc, tree string_type,
+		 int start_idx, int end_idx)
+  : m_fmt_string_loc (fmt_string_loc), m_string_type (string_type),
+    m_start_idx (start_idx), m_end_idx (end_idx) {}
+
+  const char *get_range (source_range *out_range) const;
+
+  location_t get_fmt_string_loc () const { return m_fmt_string_loc; }
+
+ private:
+  location_t m_fmt_string_loc;
+  tree m_string_type;
+  int m_start_idx;
+  int m_end_idx;
+};
+
 /* In c-gimplify.c  */
 extern void c_genericize (tree);
 extern int c_gimplify_expr (tree *, gimple_seq *, gimple_seq *);
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
index 8f33d86..4c7e385 100644
--- a/gcc/c-family/c-lex.c
+++ b/gcc/c-family/c-lex.c
@@ -1097,13 +1097,16 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   tree value;
   size_t concats = 0;
   struct obstack str_ob;
+  struct obstack loc_ob;
   cpp_string istr;
   enum cpp_ttype type = tok->type;
 
   /* Try to avoid the overhead of creating and destroying an obstack
      for the common case of just one string.  */
   cpp_string str = tok->val.str;
+  location_t init_loc = tok->src_loc;
   cpp_string *strs = &str;
+  location_t *locs = NULL;
 
   /* objc_at_sign_was_seen is only used when doing Objective-C string
      concatenation.  It is 'true' if we have seen an '@' before the
@@ -1142,16 +1145,21 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
 	  else
 	    error ("unsupported non-standard concatenation of string literals");
 	}
+      /* FALLTHROUGH */
 
     case CPP_STRING:
       if (!concats)
 	{
 	  gcc_obstack_init (&str_ob);
+	  gcc_obstack_init (&loc_ob);
 	  obstack_grow (&str_ob, &str, sizeof (cpp_string));
+	  obstack_grow (&loc_ob, &init_loc, sizeof (location_t));
 	}
 
       concats++;
       obstack_grow (&str_ob, &tok->val.str, sizeof (cpp_string));
+      obstack_grow (&loc_ob, &tok->src_loc, sizeof (location_t));
+
       if (objc_string)
 	objc_at_sign_was_seen = false;
       goto retry;
@@ -1164,7 +1172,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   /* We have read one more token than we want.  */
   _cpp_backup_tokens (parse_in, 1);
   if (concats)
-    strs = XOBFINISH (&str_ob, cpp_string *);
+    {
+      strs = XOBFINISH (&str_ob, cpp_string *);
+      locs = XOBFINISH (&loc_ob, location_t *);
+    }
 
   if (concats && !objc_string && !in_system_header_at (input_location))
     warning (OPT_Wtraditional,
@@ -1176,6 +1187,12 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
     {
       value = build_string (istr.len, (const char *) istr.text);
       free (CONST_CAST (unsigned char *, istr.text));
+      if (concats)
+	{
+	  gcc_assert (locs);
+	  gcc_assert (g_string_concat_db);
+	  g_string_concat_db->record_string_concatenation (concats + 1, locs);
+	}
     }
   else
     {
@@ -1227,7 +1244,10 @@ lex_string (const cpp_token *tok, tree *valp, bool objc_string, bool translate)
   *valp = fix_string_type (value);
 
   if (concats)
-    obstack_free (&str_ob, 0);
+    {
+      obstack_free (&str_ob, 0);
+      obstack_free (&loc_ob, 0);
+    }
 
   return objc_string ? CPP_OBJC_STRING : type;
 }
diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index c11e7e7..0715b2e 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -216,6 +216,9 @@ c_common_init_options (unsigned int decoded_options_count,
   unsigned int i;
   struct cpp_callbacks *cb;
 
+  g_string_concat_db
+    = new (ggc_alloc <string_concat_db> ()) string_concat_db ();
+
   parse_in = cpp_create_reader (c_dialect_cxx () ? CLK_GNUCXX: CLK_GNUC89,
 				ident_hash, line_table);
   cb = cpp_get_callbacks (parse_in);
diff --git a/gcc/input.c b/gcc/input.c
index f91a702..d058b8a 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1189,6 +1189,279 @@ dump_location_info (FILE *stream)
 				MAX_SOURCE_LOCATION + 1, UINT_MAX);
 }
 
+/* string_concat's constructor.  */
+
+string_concat::string_concat (int num, location_t *locs)
+  : m_num (num)
+{
+  m_locs = ggc_vec_alloc <location_t> (num);
+  for (int i = 0; i < num; i++)
+    m_locs[i] = locs[i];
+}
+
+/* string_concat_db's constructor.  */
+
+string_concat_db::string_concat_db ()
+{
+  m_table = hash_map <location_hash, string_concat *>::create_ggc (64);
+}
+
+/* Record that a string concatenation occurred, covering NUM
+   string literal tokens.  LOCS is an array of size NUM, containing the
+   locations of the tokens.  A copy of LOCS is taken.  */
+
+void
+string_concat_db::record_string_concatenation (int num, location_t *locs)
+{
+  gcc_assert (num > 1);
+  gcc_assert (locs);
+
+  location_t key_loc = get_key_loc (locs[0]);
+
+  string_concat *concat
+    = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
+  m_table->put (key_loc, concat);
+}
+
+/* Determine if LOC was the location of the the initial token of a
+   concatenation of string literal tokens.
+   If so, *OUT_NUM is written to with the number of tokens, and
+   *OUT_LOCS with the location of an array of locations of the
+   tokens, and return true.  *OUT_LOCS is a borrowed pointer to
+   storage owned by the string_concat_db.
+   Otherwise, return false.  */
+
+bool
+string_concat_db::get_string_concatenation (location_t loc,
+					    int *out_num,
+					    location_t **out_locs)
+{
+  gcc_assert (out_num);
+  gcc_assert (out_locs);
+
+  location_t key_loc = get_key_loc (loc);
+
+  string_concat **concat = m_table->get (key_loc);
+  if (!concat)
+    return false;
+
+  *out_num = (*concat)->m_num;
+  *out_locs =(*concat)->m_locs;
+  return true;
+}
+
+/* Internal function.  Canonicalize LOC into a form suitable for
+   use as a key within the database, stripping away macro expansion,
+   ad-hoc information, and range information, using the location of
+   the start of LOC within an ordinary linemap.  */
+
+location_t
+string_concat_db::get_key_loc (location_t loc)
+{
+  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
+				  NULL);
+
+  loc = get_range_from_loc (line_table, loc).m_start;
+
+  return loc;
+}
+
+/* Helper class for use within get_substring_ranges_for_loc.
+   An vec of cpp_string with responsibility for releasing all of the
+   str->text for each str in the vector.  */
+
+class auto_cpp_string_vec :  public auto_vec <cpp_string>
+{
+ public:
+  auto_cpp_string_vec (int alloc)
+    : auto_vec <cpp_string> (alloc) {}
+
+  ~auto_cpp_string_vec ()
+  {
+    /* Clean up the copies within this vec.  */
+    int i;
+    cpp_string *str;
+    FOR_EACH_VEC_ELT (*this, i, str)
+      free (const_cast <unsigned char *> (str->text));
+  }
+};
+
+/* Attempt to populate RANGES with source location information on the
+   individual characters within the string literal found at STRLOC.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC  was concatenated with are also added to RANGES.
+
+   Return NULL if successful, or an error message if any errors occurred (in
+   which case RANGES may be only partially populated and should not
+   be used).
+
+   This is implemented by re-parsing the relevant source line(s).  */
+
+static const char *
+get_substring_ranges_for_loc (cpp_reader *pfile,
+			      string_concat_db *concats,
+			      location_t strloc,
+			      enum cpp_ttype type,
+			      cpp_substring_ranges &ranges)
+{
+  gcc_assert (pfile);
+
+  if (strloc == UNKNOWN_LOCATION)
+    return "unknown location";
+
+  /* If string concatenation has occurred at STRLOC, get the locations
+     of all of the literal tokens making up the compound string.
+     Otherwise, just use STRLOC.  */
+  int num_locs = 1;
+  location_t *strlocs = &strloc;
+  if (concats)
+    concats->get_string_concatenation (strloc, &num_locs, &strlocs);
+
+  auto_cpp_string_vec strs (num_locs);
+  auto_vec <cpp_string_location_reader> loc_readers (num_locs);
+  for (int i = 0; i < num_locs; i++)
+    {
+      /* Get range of strloc.  We will use it to locate the start and finish
+	 of the literal token within the line.  */
+      source_range src_range = get_range_from_loc (line_table, strlocs[i]);
+
+      if (src_range.m_start >= LINEMAPS_MACRO_LOWEST_LOCATION (line_table))
+	/* If the string is within a macro expansion, we can't get at the
+	   end location.  */
+	return "macro expansion";
+
+      if (src_range.m_start >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token started within
+	   its line.  */
+	return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      if (src_range.m_finish >= LINE_MAP_MAX_LOCATION_WITH_COLS)
+	/* If so, we can't reliably determine where the token finished within
+	   its line.  */
+	return "range ends after LINE_MAP_MAX_LOCATION_WITH_COLS";
+
+      expanded_location start
+	= expand_location_to_spelling_point (src_range.m_start);
+      expanded_location finish
+	= expand_location_to_spelling_point (src_range.m_finish);
+      if (start.file != finish.file)
+	return "range endpoints are in different files";
+      if (start.line != finish.line)
+	return "range endpoints are on different lines";
+      if (start.column > finish.column)
+	return "range endpoints are reversed";
+
+      int line_width;
+      const char *line = location_get_source_line (start.file, start.line,
+						   &line_width);
+      if (line == NULL)
+	return "unable to read source line";
+
+      /* Determine the location of the literal (including quotes
+	 and leading prefix chars, such as the 'u' in a u""
+	 token).  */
+      const char *literal = line + start.column - 1;
+      int literal_length = finish.column - start.column + 1;
+
+      gcc_assert (line_width >= (start.column - 1 + literal_length));
+      cpp_string from;
+      from.len = literal_length;
+      /* Make a copy of the literal, to avoid having to rely on
+	 the lifetime of the copy of the line within the cache.
+	 This will be released by the auto_cpp_string_vec dtor.  */
+      from.text = XDUPVEC (unsigned char, literal, literal_length);
+      strs.safe_push (from);
+
+      /* For very long lines, a new linemap could have started
+	 halfway through the token.
+	 Ensure that the loc_reader uses the linemap of the
+	 *end* of the token for its start location.  */
+      const line_map_ordinary *final_ord_map;
+      linemap_resolve_location (line_table, src_range.m_finish,
+				LRK_MACRO_EXPANSION_POINT, &final_ord_map);
+      location_t start_loc
+	= linemap_position_for_line_and_column (line_table, final_ord_map,
+						start.line, start.column);
+
+      cpp_string_location_reader loc_reader (start_loc, line_table);
+      loc_readers.safe_push (loc_reader);
+    }
+
+  /* Rerun cpp_interpret_string, or rather, a modified version of it.  */
+  const char *err = cpp_interpret_string_ranges (pfile, strs.address (),
+						 loc_readers.address (),
+						 num_locs, &ranges, type);
+  if (err)
+    return err;
+
+  /* Success: "ranges" should now contain information on the string.  */
+  return NULL;
+}
+
+/* Attempt to populate *OUT_RANGE with source location information on the
+   range of given characters within the string literal found at STRLOC.
+   START_IDX and END_IDX refer to offsets within the execution character
+   set.
+   If CONCATS is non-NULL, then any string literals that the token at
+   STRLOC was concatenated with are also considered.
+
+   This is implemented by re-parsing the relevant source line(s).
+
+   Return NULL if successful, or an error message if any errors occurred.
+   Error messages are intended for GCC developers (to help debugging) rather
+   than for end-users.  */
+
+const char *
+get_source_range_for_substring (cpp_reader *pfile,
+				string_concat_db *concats,
+				location_t strloc,
+				enum cpp_ttype type,
+				int start_idx, int end_idx,
+				source_range *out_range)
+{
+  gcc_checking_assert (start_idx >= 0);
+  gcc_checking_assert (end_idx >= 0);
+  gcc_assert (out_range);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+  if (err)
+    return err;
+
+  if (start_idx >= ranges.get_num_ranges ())
+    return "start_idx out of range";
+  if (end_idx >= ranges.get_num_ranges ())
+    return "end_idx out of range";
+
+  out_range->m_start = ranges.get_range (start_idx).m_start;
+  out_range->m_finish = ranges.get_range (end_idx).m_finish;
+  return NULL;
+}
+
+/* As get_source_range_for_substring, but write to *OUT the number
+   of ranges that are available.  */
+
+const char *
+get_num_source_ranges_for_substring (cpp_reader *pfile,
+				     string_concat_db *concats,
+				     location_t strloc,
+				     enum cpp_ttype type,
+				     int *out)
+{
+  gcc_assert (out);
+
+  cpp_substring_ranges ranges;
+  const char *err
+    = get_substring_ranges_for_loc (pfile, concats, strloc, type, ranges);
+
+  if (err)
+    return err;
+
+  *out = ranges.get_num_ranges ();
+  return NULL;
+}
+
 #if CHECKING_P
 
 namespace selftest {
@@ -1541,6 +1814,1259 @@ test_lexer (const line_table_case &case_)
   cpp_destroy (parser);
 }
 
+/* Forward decls.  */
+
+struct lexer_test;
+class lexer_test_options;
+
+/* A class for specifying options of a lexer_test.
+   The "apply" vfunc is called during the lexer_test constructor.  */
+
+class lexer_test_options
+{
+ public:
+  virtual void apply (lexer_test &) = 0;
+};
+
+/* A struct for writing lexer tests.  */
+
+struct lexer_test
+{
+  lexer_test (const line_table_case &case_, const char *content,
+	      lexer_test_options *options);
+  ~lexer_test ();
+
+  const cpp_token *get_token ();
+
+  temp_source_file m_tempfile;
+  temp_line_table m_tmp_lt;
+  cpp_reader *m_parser;
+  string_concat_db m_concats;
+};
+
+/* Use an EBCDIC encoding for the execution charset, specifically
+   IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+
+   This exercises iconv integration within libcpp.
+   Not every build of iconv supports the given charset,
+   so we need to flag this error and handle it gracefully.  */
+
+class ebcdic_execution_charset : public lexer_test_options
+{
+ public:
+  ebcdic_execution_charset () : m_num_iconv_errors (0)
+    {
+      gcc_assert (s_singleton == NULL);
+      s_singleton = this;
+    }
+  ~ebcdic_execution_charset ()
+    {
+      gcc_assert (s_singleton == this);
+      s_singleton = NULL;
+    }
+
+  void apply (lexer_test &test) FINAL OVERRIDE
+  {
+    cpp_options *cpp_opts = cpp_get_options (test.m_parser);
+    cpp_opts->narrow_charset = "IBM1047";
+
+    cpp_callbacks *callbacks = cpp_get_callbacks (test.m_parser);
+    callbacks->error = on_error;
+  }
+
+  static bool on_error (cpp_reader *pfile ATTRIBUTE_UNUSED,
+			int level ATTRIBUTE_UNUSED,
+			int reason ATTRIBUTE_UNUSED,
+			rich_location *richloc ATTRIBUTE_UNUSED,
+			const char *msgid, va_list *ap ATTRIBUTE_UNUSED)
+    ATTRIBUTE_FPTR_PRINTF(5,0)
+  {
+    gcc_assert (s_singleton);
+    /* Detect and record errors emitted by libcpp/charset.c:init_iconv_desc
+       when the local iconv build doesn't support the conversion.  */
+    if (strstr (msgid, "not supported by iconv"))
+      {
+	s_singleton->m_num_iconv_errors++;
+	return true;
+      }
+
+    /* Otherwise, we have an unexpected error.  */
+    abort ();
+  }
+
+  bool iconv_errors_occurred_p () const { return m_num_iconv_errors > 0; }
+
+ private:
+  static ebcdic_execution_charset *s_singleton;
+  int m_num_iconv_errors;
+};
+
+ebcdic_execution_charset *ebcdic_execution_charset::s_singleton;
+
+/* Constructor.  Override line_table with a new instance based on CASE_,
+   and write CONTENT to a tempfile.  Create a cpp_reader, and use it to
+   start parsing the tempfile.  */
+
+lexer_test::lexer_test (const line_table_case &case_, const char *content,
+			lexer_test_options *options) :
+  /* Create a tempfile and write the text to it.  */
+  m_tempfile (SELFTEST_LOCATION, ".c", content),
+  m_tmp_lt (case_),
+  m_parser (cpp_create_reader (CLK_GNUC99, NULL, line_table)),
+  m_concats ()
+{
+  if (options)
+    options->apply (*this);
+
+  cpp_init_iconv (m_parser);
+
+  /* Parse the file.  */
+  const char *fname = cpp_read_main_file (m_parser,
+					  m_tempfile.get_filename ());
+  ASSERT_NE (fname, NULL);
+}
+
+/* Destructor.  Verify that the next token in m_parser is EOF.  */
+
+lexer_test::~lexer_test ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  ASSERT_EQ (tok->type, CPP_EOF);
+
+  cpp_finish (m_parser, NULL);
+  cpp_destroy (m_parser);
+}
+
+/* Get the next token from m_parser.  */
+
+const cpp_token *
+lexer_test::get_token ()
+{
+  location_t loc;
+  const cpp_token *tok;
+
+  tok = cpp_get_token_with_location (m_parser, &loc);
+  ASSERT_NE (tok, NULL);
+  return tok;
+}
+
+/* Verify that locations within string literals are correctly handled.  */
+
+/* Verify get_source_range_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the character at index IDX is on EXPECTED_LINE,
+   and that it begins at column EXPECTED_START_COL and ends at
+   EXPECTED_FINISH_COL (unless the locations are beyond
+   LINE_MAP_MAX_LOCATION_WITH_COLS, in which case don't check their
+   columns).  */
+
+static void
+assert_char_at_range (const location &loc,
+		      lexer_test& test,
+		      location_t strloc, enum cpp_ttype type, int idx,
+		      int expected_line, int expected_start_col,
+		      int expected_finish_col)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  source_range actual_range;
+  const char *err
+    = get_source_range_for_substring (pfile, concats, strloc, type,
+				      idx, idx, &actual_range);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+
+  int actual_start_line = LOCATION_LINE (actual_range.m_start);
+  ASSERT_EQ_AT (loc, expected_line, actual_start_line);
+  int actual_finish_line = LOCATION_LINE (actual_range.m_finish);
+  ASSERT_EQ_AT (loc, expected_line, actual_finish_line);
+
+  if (should_have_column_data_p (actual_range.m_start))
+    {
+      int actual_start_col = LOCATION_COLUMN (actual_range.m_start);
+      ASSERT_EQ_AT (loc, expected_start_col, actual_start_col);
+    }
+  if (should_have_column_data_p (actual_range.m_finish))
+    {
+      int actual_finish_col = LOCATION_COLUMN (actual_range.m_finish);
+      ASSERT_EQ_AT (loc, expected_finish_col, actual_finish_col);
+    }
+}
+
+/* Macro for calling assert_char_at_range, supplying SELFTEST_LOCATION for
+   the effective location of any errors.  */
+
+#define ASSERT_CHAR_AT_RANGE(LEXER_TEST, STRLOC, TYPE, IDX, EXPECTED_LINE, \
+			     EXPECTED_START_COL, EXPECTED_FINISH_COL)	\
+  assert_char_at_range (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), (TYPE), \
+			(IDX), (EXPECTED_LINE), (EXPECTED_START_COL), \
+			(EXPECTED_FINISH_COL))
+
+/* Verify get_num_source_ranges_for_substring for token(s) at STRLOC,
+   using the string concatenation database for TEST.
+
+   Assert that the token(s) at STRLOC contain EXPECTED_NUM_RANGES.  */
+
+static void
+assert_num_substring_ranges (const location &loc,
+			     lexer_test& test,
+			     location_t strloc,
+			     enum cpp_ttype type,
+			     int expected_num_ranges)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+
+  int actual_num_ranges;
+  const char *err
+    = get_num_source_ranges_for_substring (pfile, concats, strloc, type,
+					   &actual_num_ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_EQ_AT (loc, NULL, err);
+  else
+    {
+      ASSERT_STREQ_AT (loc,
+		       "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		       err);
+      return;
+    }
+  ASSERT_EQ_AT (loc, expected_num_ranges, actual_num_ranges);
+}
+
+/* Macro for calling assert_num_substring_ranges, supplying
+   SELFTEST_LOCATION for the effective location of any errors.  */
+
+#define ASSERT_NUM_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, \
+				    EXPECTED_NUM_RANGES)		\
+  assert_num_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), (STRLOC), \
+			       (TYPE), (EXPECTED_NUM_RANGES))
+
+
+/* Verify that get_num_source_ranges_for_substring for token(s) at STRLOC
+   returns an error (using the string concatenation database for TEST).  */
+
+static void
+assert_has_no_substring_ranges (const location &loc,
+				lexer_test& test,
+				location_t strloc,
+				enum cpp_ttype type,
+				const char *expected_err)
+{
+  cpp_reader *pfile = test.m_parser;
+  string_concat_db *concats = &test.m_concats;
+  cpp_substring_ranges ranges;
+  const char *actual_err
+    = get_substring_ranges_for_loc (pfile, concats, strloc,
+				    type, ranges);
+  if (should_have_column_data_p (strloc))
+    ASSERT_STREQ_AT (loc, expected_err, actual_err);
+  else
+    ASSERT_STREQ_AT (loc,
+		     "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS",
+		     actual_err);
+}
+
+#define ASSERT_HAS_NO_SUBSTRING_RANGES(LEXER_TEST, STRLOC, TYPE, ERR)    \
+    assert_has_no_substring_ranges (SELFTEST_LOCATION, (LEXER_TEST), \
+				    (STRLOC), (TYPE), (ERR))
+
+/* Lex a simple string literal.  Verify the substring location data, before
+   and after running cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_simple (const line_table_case &case_)
+{
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1,
+			  10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* As test_lexer_string_locations_simple, but use an EBCDIC execution
+   encoding.  */
+
+static void
+test_lexer_string_locations_ebcdic (const line_table_case &case_)
+{
+  /* EBCDIC support requires iconv.  */
+  if (!HAVE_ICONV)
+    return;
+
+  /* Digits 0-9 (with 0 at column 10), the simple way.
+     ....................000000000.11111111112.2222222223333333333
+     ....................123456789.01234567890.1234567890123456789
+     We add a trailing comment to ensure that we correctly locate
+     the end of the string literal token.  */
+  const char *content = "        \"0123456789\" /* not a string */\n";
+  ebcdic_execution_charset use_ebcdic;
+  lexer_test test (case_, content, &use_ebcdic);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 20);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+
+  ASSERT_EQ (tok->val.str.len, 12);
+
+  /* The remainder of the test requires an iconv implementation that
+     can convert from UTF-8 to the EBCDIC encoding requested above.  */
+  if (use_ebcdic.iconv_errors_occurred_p ())
+    return;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* We should now have EBCDIC-encoded text, specifically
+     IBM1047-encoded (aka "EBCDIC 1047", or "Code page 1047").
+     The digits 0-9 are encoded as 240-249 i.e. 0xf0-0xf9.  */
+  ASSERT_STREQ ("\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify that we don't attempt to record substring location information
+     for such cases.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a string literal containing a hex-escaped character.
+   Verify the substring location data, before and after running
+   cpp_interpret_string on it.  */
+
+static void
+test_lexer_string_locations_hex (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.
+     ....................000000000.111111.11112222.
+     ....................123456789.012345.67890123.  */
+  const char *content = "        \"01234\\x35 789\"\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\x35 789\"");
+  ASSERT_TOKEN_LOC_EQ (tok, test.m_tempfile.get_filename (), 1, 9, 23);
+
+  /* At this point in lexing, the quote characters are treated as part of
+     the string (they are stripped off by cpp_interpret_string).  */
+  ASSERT_EQ (tok->val.str.len, 15);
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Lex a string literal containing an octal-escaped character.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_oct (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.
+     ....................000000000.111111.11112222.2222223333333333444
+     ....................123456789.012345.67890123.4567890123456789012  */
+  const char *content = "        \"01234\\065 789\" /* not a string */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\065 789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("012345 789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i < 5; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, 5, 1, 15, 18);
+  for (int i = 6; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 10);
+}
+
+/* Test of string literal containing letter escapes.  */
+
+static void
+test_lexer_string_locations_letter_escape_1 (const line_table_case &case_)
+{
+  /* The string "\tfoo\\\nbar" i.e. tab, "foo", backslash, newline, bar.
+     .....................000000000.1.11111.1.1.11222.22222223333333
+     .....................123456789.0.12345.6.7.89012.34567890123456.  */
+  const char *content = ("        \"\\tfoo\\\\\\nbar\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"\\tfoo\\\\\\nbar\"");
+
+  /* Verify ranges of individual characters. */
+  /* "\t".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			0, 1, 10, 11);
+  /* "foo". */
+  for (int i = 1; i <= 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 11 + i, 11 + i);
+  /* "\\" and "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			4, 1, 15, 16);
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			5, 1, 17, 18);
+
+  /* "bar".  */
+  for (int i = 6; i <= 8; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 13 + i, 13 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 9);
+}
+
+/* Another test of a string literal containing a letter escape.
+   Based on string seen in
+     printf ("%-%\n");
+   in gcc.dg/format/c90-printf-1.c.  */
+
+static void
+test_lexer_string_locations_letter_escape_2 (const line_table_case &case_)
+{
+  /* .....................000000000.1111.11.1111.22222222223.
+     .....................123456789.0123.45.6789.01234567890.  */
+  const char *content = ("        \"%-%\\n\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"%-%\\n\"");
+
+  /* Verify ranges of individual characters. */
+  /* "%-%".  */
+  for (int i = 0; i < 3; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 10 + i, 10 + i);
+  /* "\n".  */
+  ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			3, 1, 13, 14);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 4);
+}
+
+/* Lex a string literal containing UCN 4 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn4 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     ....................000000000.111111.111122.222222223.33333333344444
+     ....................123456789.012345.678901.234567890.12345678901234  */
+  const char *content = "        \"01234\\u2174\\u2175789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"01234\\u2174\\u2175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The string should be encoded in the execution character
+     set.  Assuming that that is UTF-8, we should have the following:
+     -----------  ----  -----  -------  ----------------
+     Byte offset  Byte  Octal  Unicode  Source Column(s)
+     -----------  ----  -----  -------  ----------------
+     0            0x30         '0'      10
+     1            0x31         '1'      11
+     2            0x32         '2'      12
+     3            0x33         '3'      13
+     4            0x34         '4'      14
+     5            0xE2  \342   U+2174   15-20
+     6            0x85  \205    (cont)  15-20
+     7            0xB4  \264    (cont)  15-20
+     8            0xE2  \342   U+2175   21-26
+     9            0x85  \205    (cont)  21-26
+     10           0xB5  \265    (cont)  21-26
+     11           0x37         '7'      27
+     12           0x38         '8'      28
+     13           0x39         '9'      29
+     -----------  ----  -----  -------  ---------------.  */
+
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 20);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 21, 26);
+  /* '789'.  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 16 + i, 16 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Lex a string literal containing UCN 8 characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_ucn8 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     ....................000000000.111111.1111222222.2222333333333.344444
+     ....................123456789.012345.6789012345.6789012345678.901234  */
+  const char *content = "        \"01234\\U00002174\\U00002175789\" /* */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok,
+			   "\"01234\\U00002174\\U00002175789\"");
+
+  /* Verify that cpp_interpret_string works.
+     The UTF-8 encoding of the string is identical to that from
+     the ucn4 testcase above; the only difference is the column
+     locations.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("01234\342\205\264\342\205\265789",
+		(const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     '01234'.  */
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+  /* U+2174.  */
+  for (int i = 5; i <= 7; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 15, 24);
+  /* U+2175.  */
+  for (int i = 8; i <= 10; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 25, 34);
+  /* '789' at columns 35-37  */
+  for (int i = 11; i <= 13; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 24 + i, 24 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 14);
+}
+
+/* Fetch a big-endian 32-bit value and convert to host endianness.  */
+
+static uint32_t
+uint32_from_big_endian (const uint32_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return (((uint32_t) buf[0] << 24)
+	  | ((uint32_t) buf[1] << 16)
+	  | ((uint32_t) buf[2] << 8)
+	  | (uint32_t) buf[3]);
+}
+
+/* Lex a wide string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_wide_string (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       L\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_WSTRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "L\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_WSTRING.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_WSTRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  /* The cpp_reader defaults to big-endian with
+     CHAR_BIT * sizeof (int) for the wchar_precision, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Fetch a big-endian 16-bit value and convert to host endianness.  */
+
+static uint16_t
+uint16_from_big_endian (const uint16_t *ptr_be_value)
+{
+  const unsigned char *buf = (const unsigned char *)ptr_be_value;
+  return ((uint16_t) buf[0] << 8) | (uint16_t) buf[1];
+}
+
+/* Lex a u"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string16 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       u\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING16);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING16.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING16;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-16BE.  */
+  const uint16_t *be16_chars = (const uint16_t *)dst_string.text;
+  ASSERT_EQ ('0', uint16_from_big_endian (&be16_chars[0]));
+  ASSERT_EQ ('5', uint16_from_big_endian (&be16_chars[5]));
+  ASSERT_EQ ('9', uint16_from_big_endian (&be16_chars[9]));
+  ASSERT_EQ (0, uint16_from_big_endian (&be16_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a U"" string literal and verify that attempts to read substring
+   location data from it fail gracefully.  */
+
+static void
+test_lexer_string_locations_string32 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "       U\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING32);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "U\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works, using CPP_STRING32.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING32;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+
+  /* The cpp_reader defaults to big-endian, so dst_string should
+     now be encoded as UTF-32BE.  */
+  const uint32_t *be32_chars = (const uint32_t *)dst_string.text;
+  ASSERT_EQ ('0', uint32_from_big_endian (&be32_chars[0]));
+  ASSERT_EQ ('5', uint32_from_big_endian (&be32_chars[5]));
+  ASSERT_EQ ('9', uint32_from_big_endian (&be32_chars[9]));
+  ASSERT_EQ (0, uint32_from_big_endian (&be32_chars[10]));
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* We don't yet support generating substring location information
+     for L"" strings.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES
+    (test, tok->src_loc, type,
+     "execution character set != source character set");
+}
+
+/* Lex a u8-string literal.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_u8 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     ....................000000000.11111111112.22222222233333
+     ....................123456789.01234567890.12345678901234  */
+  const char *content = "      u8\"0123456789\" /* non-str */\n";
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_UTF8STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u8\"0123456789\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+}
+
+/* Lex a string literal containing UTF-8 source characters.
+   Verify the substring location data after running cpp_interpret_string
+   on it.  */
+
+static void
+test_lexer_string_locations_utf8_source (const line_table_case &case_)
+{
+ /* This string literal is written out to the source file as UTF-8,
+    and is of the form "before mojibake after", where "mojibake"
+    is written as the following four unicode code points:
+       U+6587 CJK UNIFIED IDEOGRAPH-6587
+       U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+       U+5316 CJK UNIFIED IDEOGRAPH-5316
+       U+3051 HIRAGANA LETTER KE.
+     Each of these is 3 bytes wide when encoded in UTF-8, whereas the
+     "before" and "after" are 1 byte per unicode character.
+
+     The numbering shown are "columns", which are *byte* numbers within
+     the line, rather than unicode character numbers.
+
+     .................... 000000000.1111111.
+     .................... 123456789.0123456.  */
+  const char *content = ("        \"before "
+			 /* U+6587 CJK UNIFIED IDEOGRAPH-6587
+			      UTF-8: 0xE6 0x96 0x87
+			      C octal escaped UTF-8: \346\226\207
+			    "column" numbers: 17-19.  */
+			 "\346\226\207"
+
+			 /* U+5B57 CJK UNIFIED IDEOGRAPH-5B57
+			      UTF-8: 0xE5 0xAD 0x97
+			      C octal escaped UTF-8: \345\255\227
+			    "column" numbers: 20-22.  */
+			 "\345\255\227"
+
+			 /* U+5316 CJK UNIFIED IDEOGRAPH-5316
+			      UTF-8: 0xE5 0x8C 0x96
+			      C octal escaped UTF-8: \345\214\226
+			    "column" numbers: 23-25.  */
+			 "\345\214\226"
+
+			 /* U+3051 HIRAGANA LETTER KE
+			      UTF-8: 0xE3 0x81 0x91
+			      C octal escaped UTF-8: \343\201\221
+			    "column" numbers: 26-28.  */
+			 "\343\201\221"
+
+			 /* column numbers 29 onwards
+			  2333333.33334444444444
+			  9012345.67890123456789. */
+			 " after\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back, with the correct
+     location information.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ
+    (test.m_parser, tok,
+     "\"before \346\226\207\345\255\227\345\214\226\343\201\221 after\"");
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser, &tok->val.str, 1,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ
+    ("before \346\226\207\345\255\227\345\214\226\343\201\221 after",
+     (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Verify ranges of individual characters.  This no longer includes the
+     quotes.
+     Assuming that both source and execution encodings are UTF-8, we have
+     a run of 25 octets in each.  */
+  for (int i = 0; i < 25; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, type, i, 1, 10 + i, 10 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, type, 25);
+}
+
+/* Test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_1 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111111.11112222222222
+     .....................123456789.012345.67890123456789.  */
+  const char *content = ("        \"01234\" /* non-str */\n"
+			 "        \"56789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  location_t input_locs[2];
+
+  /* Verify that we get the expected tokens back.  */
+  auto_vec <cpp_string> input_strings;
+  const cpp_token *tok_a = test.get_token ();
+  ASSERT_EQ (tok_a->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok_a, "\"01234\"");
+  input_strings.safe_push (tok_a->val.str);
+  input_locs[0] = tok_a->src_loc;
+
+  const cpp_token *tok_b = test.get_token ();
+  ASSERT_EQ (tok_b->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok_b, "\"56789\"");
+  input_strings.safe_push (tok_b->val.str);
+  input_locs[1] = tok_b->src_loc;
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 2,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (2, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  for (int i = 5; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 2, 5 + i, 5 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation.  */
+
+static void
+test_lexer_string_locations_concatenation_2 (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................000000000.111.11111112222222
+     .....................123456789.012.34567890123456.  */
+  const char *content = ("        \"01\" /* non-str */\n"
+			 "        \"23\" /* non-str */\n"
+			 "        \"45\" /* non-str */\n"
+			 "        \"67\" /* non-str */\n"
+			 "        \"89\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[5];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 5; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 5,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (5, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  /* Within ASSERT_CHAR_AT_RANGE (actually assert_char_at_range), we can
+     detect if the initial loc is after LINE_MAP_MAX_LOCATION_WITH_COLS
+     and expect get_source_range_for_substring to fail.
+     However, for a string concatenation test, we can have a case
+     where the initial string is fully before LINE_MAP_MAX_LOCATION_WITH_COLS,
+     but subsequent strings can be after it.
+     Attempting to detect this within assert_char_at_range
+     would overcomplicate the logic for the common test cases, so
+     we detect it here.  */
+  if (should_have_column_data_p (input_locs[0])
+      && !should_have_column_data_p (input_locs[4]))
+    {
+      /* Verify that get_source_range_for_substring gracefully rejects
+	 this case.  */
+      source_range actual_range;
+      const char *err
+	= get_source_range_for_substring (test.m_parser, &test.m_concats,
+					  initial_loc, type, 0, 0,
+					  &actual_range);
+      ASSERT_STREQ ("range starts after LINE_MAP_MAX_LOCATION_WITH_COLS", err);
+      return;
+    }
+
+  for (int i = 0; i < 5; i++)
+    for (int j = 0; j < 2; j++)
+      ASSERT_CHAR_AT_RANGE (test, initial_loc, type, (i * 2) + j,
+			    i + 1, 10 + j, 10 + j);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Another test of string literal concatenation, this time combined with
+   various kinds of escaped characters.  */
+
+static void
+test_lexer_string_locations_concatenation_3 (const line_table_case &case_)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  const char *content
+    /* .000000000.111111.111.1.2222.222.2.2233.333.3333.34444444444555
+       .123456789.012345.678.9.0123.456.7.8901.234.5678.90123456789012. */
+    = ("        \"01234\"  \"\\x35\"  \"\\066\"  \"789\" /* non-str */\n");
+  lexer_test test (case_, content, NULL);
+
+  auto_vec <cpp_string> input_strings;
+  location_t input_locs[4];
+
+  /* Verify that we get the expected tokens back.  */
+  for (int i = 0; i < 4; i++)
+    {
+      const cpp_token *tok = test.get_token ();
+      ASSERT_EQ (tok->type, CPP_STRING);
+      input_strings.safe_push (tok->val.str);
+      input_locs[i] = tok->src_loc;
+    }
+
+  /* Verify that cpp_interpret_string works.  */
+  cpp_string dst_string;
+  const enum cpp_ttype type = CPP_STRING;
+  bool result = cpp_interpret_string (test.m_parser,
+				      input_strings.address (), 4,
+				      &dst_string, type);
+  ASSERT_TRUE (result);
+  ASSERT_STREQ ("0123456789", (const char *)dst_string.text);
+  free (const_cast <unsigned char *> (dst_string.text));
+
+  /* Simulate c-lex.c's lex_string in order to record concatenation.  */
+  test.m_concats.record_string_concatenation (4, input_locs);
+
+  location_t initial_loc = input_locs[0];
+
+  for (int i = 0; i <= 4; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 10 + i, 10 + i);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 5, 1, 19, 22);
+  ASSERT_CHAR_AT_RANGE (test, initial_loc, type, 6, 1, 27, 30);
+  for (int i = 7; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, initial_loc, type, i, 1, 28 + i, 28 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, initial_loc, type, 10);
+}
+
+/* Test of string literal in a macro.  */
+
+static void
+test_lexer_string_locations_macro (const line_table_case &case_)
+{
+  /* Digits 0-9.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("#define MACRO     \"0123456789\" /* non-str */\n"
+			 "  MACRO");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"0123456789\"");
+
+  /* Verify ranges of individual characters.  We ought to
+     see columns within the macro definition.  */
+  for (int i = 0; i <= 9; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 1, 20 + i, 20 + i);
+
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 10);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
+/* Test of stringification of a macro argument.  */
+
+static void
+test_lexer_string_locations_stringified_macro_argument
+  (const line_table_case &case_)
+{
+  /* .....................000000000111111111122222222223.
+     .....................123456789012345678901234567890.  */
+  const char *content = ("#define MACRO(X) #X /* non-str */\n"
+			 "MACRO(foo)\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "\"foo\"");
+
+  /* We don't support getting the location of a stringified macro
+     argument.  Verify that it fails gracefully.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING,
+				  "cpp_interpret_string_1 failed");
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_PADDING);
+}
+
+/* Ensure that we are fail gracefully if something attempts to pass
+   in a location that isn't a string literal token.  Seen on this code:
+
+     const char a[] = " %d ";
+     __builtin_printf (a, 0.5);
+                       ^
+
+   when c-format.c erroneously used the indicated one-character
+   location as the format string location, leading to a read past the
+   end of a string buffer in cpp_interpret_string_1.  */
+
+static void
+test_lexer_string_locations_non_string (const line_table_case &case_)
+{
+  /* .....................000000000111111111122222222223.
+     .....................123456789012345678901234567890.  */
+  const char *content = ("         a\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_NAME);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "a");
+
+  /* At this point, libcpp is attempting to interpret the name as a
+     string literal, despite it not starting with a quote.  We don't detect
+     that, but we should at least fail gracefully.  */
+  ASSERT_HAS_NO_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING,
+				  "cpp_interpret_string_1 failed");
+}
+
+/* Ensure that we can read substring information for a token which
+   starts in one linemap and ends in another .  Adapted from
+   gcc.dg/cpp/pr69985.c.  */
+
+static void
+test_lexer_string_locations_long_line (const line_table_case &case_)
+{
+  /* .....................000000.000111111111
+     .....................123456.789012346789.  */
+  const char *content = ("/* A very long line, so that we start a new line map.  */\n"
+			 "     \"0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789012345678901234567890123456789"
+			 "0123456789\"\n");
+
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected token back.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_STRING);
+
+  if (!should_have_column_data_p (line_table->highest_location))
+    return;
+
+  /* Verify ranges of individual characters.  */
+  ASSERT_NUM_SUBSTRING_RANGES (test, tok->src_loc, CPP_STRING, 130);
+  for (int i = 0; i < 130; i++)
+    ASSERT_CHAR_AT_RANGE (test, tok->src_loc, CPP_STRING,
+			  i, 2, 7 + i, 7 + i);
+}
+
+/* Test of lexing char constants.  */
+
+static void
+test_lexer_char_constants (const line_table_case &case_)
+{
+  /* Various char constants.
+     .....................0000000001111111111.22222222223.
+     .....................1234567890123456789.01234567890.  */
+  const char *content = ("         'a'\n"
+			 "        u'a'\n"
+			 "        U'a'\n"
+			 "        L'a'\n"
+			 "         'abc'\n");
+  lexer_test test (case_, content, NULL);
+
+  /* Verify that we get the expected tokens back.  */
+  /* 'a'.  */
+  const cpp_token *tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "'a'");
+
+  unsigned int chars_seen;
+  int unsignedp;
+  cppchar_t cc = cpp_interpret_charconst (test.m_parser, tok,
+					  &chars_seen, &unsignedp);
+  ASSERT_EQ (cc, 'a');
+  ASSERT_EQ (chars_seen, 1);
+
+  /* u'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR16);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "u'a'");
+
+  /* U'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR32);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "U'a'");
+
+  /* L'a'.  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_WCHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "L'a'");
+
+  /* 'abc' (c-char-sequence).  */
+  tok = test.get_token ();
+  ASSERT_EQ (tok->type, CPP_CHAR);
+  ASSERT_TOKEN_AS_TEXT_EQ (test.m_parser, tok, "'abc'");
+}
 /* A table of interesting location_t values, giving one axis of our test
    matrix.  */
 
@@ -1599,6 +3125,27 @@ input_c_tests ()
 	  /* Run all tests for the given case within the test matrix.  */
 	  test_accessing_ordinary_linemaps (c);
 	  test_lexer (c);
+	  test_lexer_string_locations_simple (c);
+	  test_lexer_string_locations_ebcdic (c);
+	  test_lexer_string_locations_hex (c);
+	  test_lexer_string_locations_oct (c);
+	  test_lexer_string_locations_letter_escape_1 (c);
+	  test_lexer_string_locations_letter_escape_2 (c);
+	  test_lexer_string_locations_ucn4 (c);
+	  test_lexer_string_locations_ucn8 (c);
+	  test_lexer_string_locations_wide_string (c);
+	  test_lexer_string_locations_string16 (c);
+	  test_lexer_string_locations_string32 (c);
+	  test_lexer_string_locations_u8 (c);
+	  test_lexer_string_locations_utf8_source (c);
+	  test_lexer_string_locations_concatenation_1 (c);
+	  test_lexer_string_locations_concatenation_2 (c);
+	  test_lexer_string_locations_concatenation_3 (c);
+	  test_lexer_string_locations_macro (c);
+	  test_lexer_string_locations_stringified_macro_argument (c);
+	  test_lexer_string_locations_non_string (c);
+	  test_lexer_string_locations_long_line (c);
+	  test_lexer_char_constants (c);
 
 	  num_cases_tested++;
 	}
diff --git a/gcc/input.h b/gcc/input.h
index d51f950..c17e440 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -95,4 +95,39 @@ void dump_location_info (FILE *stream);
 
 void diagnostics_file_cache_fini (void);
 
+struct GTY(()) string_concat
+{
+  string_concat (int num, location_t *locs);
+
+  int m_num;
+  location_t * GTY ((atomic)) m_locs;
+};
+
+struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
+
+class GTY(()) string_concat_db
+{
+ public:
+  string_concat_db ();
+  void record_string_concatenation (int num, location_t *locs);
+
+  bool get_string_concatenation (location_t loc,
+				 int *out_num,
+				 location_t **out_locs);
+
+ private:
+  static location_t get_key_loc (location_t loc);
+
+  /* For the fields to be private, we must grant access to the
+     generated code in gtype-desc.c.  */
+
+  friend void ::gt_ggc_mx_string_concat_db (void *x_p);
+  friend void ::gt_pch_nx_string_concat_db (void *x_p);
+  friend void ::gt_pch_p_16string_concat_db (void *this_obj, void *x_p,
+					     gt_pointer_operator op,
+					     void *cookie);
+
+  hash_map <location_hash, string_concat *> *m_table;
+};
+
 #endif
diff --git a/gcc/substring-locations.h b/gcc/substring-locations.h
new file mode 100644
index 0000000..274ebbe
--- /dev/null
+++ b/gcc/substring-locations.h
@@ -0,0 +1,30 @@
+/* Source locations within string literals.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_SUBSTRING_LOCATIONS_H
+#define GCC_SUBSTRING_LOCATIONS_H
+
+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+						   string_concat_db *concats,
+						   location_t strloc,
+						   enum cpp_ttype type,
+						   int start_idx, int end_idx,
+						   source_range *out_range);
+
+#endif /* ! GCC_SUBSTRING_LOCATIONS_H */
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
new file mode 100644
index 0000000..82689b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -0,0 +1,211 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdiagnostics-show-caret" } */
+
+/* This is a collection of unittests for ranges within string literals,
+   using diagnostic_plugin_test_string_literals, which handles
+   "__emit_string_literal_range" by generating a warning at the given
+   subset of a string literal.
+
+   The indices are 0-based.  It's easiest to verify things using string
+   literals that are runs of 0-based digits (to avoid having to count
+   characters).
+
+   LITERAL is a const void * to allow testing the various kinds of wide
+   string literal, rather than just const char *.  */
+
+extern void __emit_string_literal_range (const void *literal,
+					 int start_idx, int end_idx);
+
+void
+test_simple_string_literal (void)
+{
+  __emit_string_literal_range ("0123456789", /* { dg-warning "range" } */
+			       6, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("0123456789",
+                                       ^~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_concatenated_string_literal (void)
+{
+  __emit_string_literal_range ("01234" "56789", /* { dg-warning "range" } */
+			       3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234" "56789",
+                                    ^~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiline_string_literal (void)
+{
+  __emit_string_literal_range ("01234" /* { dg-warning "range" } */
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~  
+   { dg-end-multiline-output "" } */
+  /* FIXME: why does the above need two trailing spaces?  */
+}
+
+/* Tests of various unicode encodings.
+
+   Digits 0 through 9 are unicode code points:
+      U+0030 DIGIT ZERO
+      ...
+      U+0039 DIGIT NINE
+   However, these are not always valid as UCN (see the comment in
+   libcpp/charset.c:_cpp_valid_ucn).
+
+   Hence we need to test UCN using an alternative unicode
+   representation of numbers; let's use Roman numerals,
+   (though these start at one, not zero):
+      U+2170 SMALL ROMAN NUMERAL ONE
+      ...
+      U+2174 SMALL ROMAN NUMERAL FIVE  ("v")
+      U+2175 SMALL ROMAN NUMERAL SIX   ("vi")
+      ...
+      U+2178 SMALL ROMAN NUMERAL NINE.  */
+
+void
+test_hex (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\x35"
+     and with a space in place of digit 6, to terminate the escaped
+     hex code.  */
+  __emit_string_literal_range ("01234\x35 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\x35 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_oct (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as "\065"
+     and with a space in place of digit 6, to terminate the escaped
+     octal code.  */
+  __emit_string_literal_range ("01234\065 789", /* { dg-warning "range" } */
+			       3, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\065 789"
+                                    ^~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_multiple (void)
+{
+  /* Digits 0-9, expressing digit 5 in ASCII as hex "\x35"
+     digit 6 in ASCII as octal "\066", concatenating multiple strings.  */
+  __emit_string_literal_range ("01234"  "\x35"  "\066"  "789", /* { dg-warning "range" } */
+			       3, 8);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234"  "\x35"  "\066"  "789",
+                                    ^~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn4 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals expressed
+     as UCN 4.
+     The resulting string is encoded as UTF-8.  Most of the digits are 1 byte
+     each, but digits 5 and 6 are encoded with 3 bytes each.
+     Hence to underline digits 4-7 we need to underling using bytes 4-11 in
+     the UTF-8 encoding.  */
+  __emit_string_literal_range ("01234\u2174\u2175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\u2174\u2175789",
+                                     ^~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_ucn8 (void)
+{
+  /* Digits 0-9, expressing digits 5 and 6 as Roman numerals as UCN 8.
+     The resulting string is the same as as in test_ucn4 above, and hence
+     has the same UTF-8 encoding, and so we again need to underline bytes
+     4-11 in the UTF-8 encoding in order to underline digits 4-7.  */
+  __emit_string_literal_range ("01234\U00002174\U00002175789", /* { dg-warning "range" } */
+			       4, 11);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range ("01234\U00002174\U00002175789",
+                                     ^~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u8 (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u8"0123456789", /* { dg-warning "range" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u8"0123456789",
+                                       ^~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_u (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (u"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (u"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_U (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (U"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (U"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_L (void)
+{
+  /* Digits 0-9.  */
+  __emit_string_literal_range (L"0123456789", /* { dg-error "unable to read substring range: execution character set != source character set" } */
+			       4, 7);
+/* { dg-begin-multiline-output "" }
+   __emit_string_literal_range (L"0123456789",
+                                ^~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void
+test_macro (void)
+{
+#define START "01234"  /* { dg-warning "range" } */
+  __emit_string_literal_range (START
+                               "56789",
+                               3, 6);
+/* { dg-begin-multiline-output "" }
+ #define START "01234"
+                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   __emit_string_literal_range (START
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+                                "56789",
+                                ~~~
+   { dg-end-multiline-output "" } */
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
new file mode 100644
index 0000000..7851c02
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-2.c
@@ -0,0 +1,53 @@
+/* { dg-do compile } */
+
+/* See the notes in diagnostic-test-string-literals-1.c.
+   This test case has caret-printing disabled.  */
+
+extern void __emit_string_literal_range (const void *literal,
+					 int start_idx, int end_idx);
+/* Test of a stringified macro argument, by itself.  */
+
+void
+test_stringified_token_1 (int x)
+{
+#define STRINGIFY(EXPR) #EXPR
+
+  __emit_string_literal_range (STRINGIFY(x > 0), /* { dg-error "unable to read substring range: macro expansion" } */
+                               0, 4);
+
+#undef STRINGIFY
+}
+
+/* Test of a stringified token within a concatenation.  */
+
+void
+test_stringized_token_2 (int x)
+{
+#define EXAMPLE(EXPR, START_IDX, END_IDX)			\
+  do {								\
+    __emit_string_literal_range ("  before " #EXPR " after \n",	\
+				 START_IDX, END_IDX);		\
+  } while (0)
+
+  EXAMPLE(x > 0, 1, 6);
+  /* { dg-error "unable to read substring range: cpp_interpret_string_1 failed" "" { target *-*-* } 28 } */
+
+#undef EXAMPLE
+}
+
+/* Test of a doubly-stringified macro argument (by itself).  */
+
+void
+test_stringified_token_3 (int x)
+{
+#define XSTR(s) STR(s)
+#define STR(s) #s
+#define FOO 123456789
+  __emit_string_literal_range (XSTR (FOO), /* { dg-error "unable to read substring range: macro expansion" } */
+                               2, 3);
+
+#undef XSTR
+#undef STR
+#undef FOO
+}
+
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
new file mode 100644
index 0000000..d44612a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_string_literals.c
@@ -0,0 +1,212 @@
+/* This plugin uses the diagnostics code to verify tracking of source code
+   locations within string literals.  */
+/* { dg-options "-O" } */
+
+#include "gcc-plugin.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "stringpool.h"
+#include "toplev.h"
+#include "basic-block.h"
+#include "hash-table.h"
+#include "vec.h"
+#include "ggc.h"
+#include "basic-block.h"
+#include "tree-ssa-alias.h"
+#include "internal-fn.h"
+#include "gimple-fold.h"
+#include "tree-eh.h"
+#include "gimple-expr.h"
+#include "is-a.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "intl.h"
+#include "plugin-version.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "context.h"
+#include "print-tree.h"
+#include "cpplib.h"
+#include "c-family/c-pragma.h"
+
+int plugin_is_GPL_compatible;
+
+/* A custom pass for printing string literal location information.  */
+
+const pass_data pass_data_test_string_literals =
+{
+  GIMPLE_PASS, /* type */
+  "test_string_literals", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_test_string_literals : public gimple_opt_pass
+{
+public:
+  pass_test_string_literals(gcc::context *ctxt)
+    : gimple_opt_pass(pass_data_test_string_literals, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_test_string_literals
+
+/* Determine if STMT is a call with NUM_ARGS arguments to a function
+   named FUNCNAME.
+   If so, return STMT as a gcall *.  Otherwise return NULL.  */
+
+static gcall *
+check_for_named_call (gimple *stmt,
+		      const char *funcname, unsigned int num_args)
+{
+  gcc_assert (funcname);
+
+  gcall *call = dyn_cast <gcall *> (stmt);
+  if (!call)
+    return NULL;
+
+  tree fndecl = gimple_call_fndecl (call);
+  if (!fndecl)
+    return NULL;
+
+  if (strcmp (IDENTIFIER_POINTER (DECL_NAME (fndecl)), funcname))
+    return NULL;
+
+  if (gimple_call_num_args (call) != num_args)
+    {
+      error_at (stmt->location, "expected number of args: %i (got %i)",
+		num_args, gimple_call_num_args (call));
+      return NULL;
+    }
+
+  return call;
+}
+
+/* Emit a warning covering SRC_RANGE, with the caret at the start of
+   SRC_RANGE.  */
+
+static void
+emit_warning (source_range src_range)
+{
+  location_t loc
+    = make_location (src_range.m_start, src_range.m_start, src_range.m_finish);
+  warning_at (loc, 0, "range %i:%i-%i:%i",
+	      LOCATION_LINE (src_range.m_start),
+	      LOCATION_COLUMN (src_range.m_start),
+	      LOCATION_LINE (src_range.m_finish),
+	      LOCATION_COLUMN (src_range.m_finish));
+}
+
+/* Support code for verifying that we are correctly tracking ranges
+   within string literals, for use by diagnostic-test-string-literals-*.c.
+   Emit a warning showing the range of a string literal, for each call to
+   a function named "__emit_string_literal_range".
+   The initial argument should be a string literal; arguments 2 and 3
+   should be integer constants, giving the range within the string
+   to be printed.  */
+
+static void
+test_string_literals (gimple *stmt)
+{
+  gcall *call = check_for_named_call (stmt, "__emit_string_literal_range", 3);
+  if (!call)
+    return;
+
+  /* We expect an ADDR_EXPR with a STRING_CST inside it for the
+     initial arg.  */
+  tree t_addr_string = gimple_call_arg (call, 0);
+  if (TREE_CODE (t_addr_string) != ADDR_EXPR)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_string = TREE_OPERAND (t_addr_string, 0);
+  if (TREE_CODE (t_string) != STRING_CST)
+    {
+      error_at (call->location, "string literal required for arg 1");
+      return;
+    }
+
+  tree t_start_idx = gimple_call_arg (call, 1);
+  if (TREE_CODE (t_start_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 2");
+      return;
+    }
+  int start_idx = TREE_INT_CST_LOW (t_start_idx);
+
+  tree t_end_idx = gimple_call_arg (call, 2);
+  if (TREE_CODE (t_end_idx) != INTEGER_CST)
+    {
+      error_at (call->location, "integer constant required for arg 3");
+      return;
+    }
+  int end_idx = TREE_INT_CST_LOW (t_end_idx);
+
+  /* A STRING_CST doesn't have a location, but the ADDR_EXPR does.  */
+  location_t strloc = EXPR_LOCATION (t_addr_string);
+  source_range src_range;
+  substring_loc substr_loc (strloc, TREE_TYPE (t_string),
+			    start_idx, end_idx);
+  const char *err = substr_loc.get_range (&src_range);
+  if (err)
+    error_at (strloc, "unable to read substring range: %s", err);
+  else
+    emit_warning (src_range);
+}
+
+/* Call test_string_literals on every statement within FUN.  */
+
+unsigned int
+pass_test_string_literals::execute (function *fun)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, fun)
+    for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	test_string_literals (stmt);
+      }
+
+  return 0;
+}
+
+/* Entrypoint for the plugin.  Create and register the custom pass.  */
+
+int
+plugin_init (struct plugin_name_args *plugin_info,
+	     struct plugin_gcc_version *version)
+{
+  struct register_pass_info pass_info;
+  const char *plugin_name = plugin_info->base_name;
+  int argc = plugin_info->argc;
+  struct plugin_argument *argv = plugin_info->argv;
+
+  if (!plugin_default_version_check (version, &gcc_version))
+    return 1;
+
+  pass_info.pass = new pass_test_string_literals (g);
+  pass_info.reference_pass_name = "ssa";
+  pass_info.ref_pass_instance_number = 1;
+  pass_info.pos_op = PASS_POS_INSERT_AFTER;
+  register_callback (plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,
+		     &pass_info);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/plugin/plugin.exp b/gcc/testsuite/gcc.dg/plugin/plugin.exp
index faebb75..715038a 100644
--- a/gcc/testsuite/gcc.dg/plugin/plugin.exp
+++ b/gcc/testsuite/gcc.dg/plugin/plugin.exp
@@ -70,6 +70,9 @@ set plugin_test_list [list \
 	  diagnostic-test-expressions-1.c } \
     { diagnostic_plugin_show_trees.c \
 	  diagnostic-test-show-trees-1.c } \
+    { diagnostic_plugin_test_string_literals.c \
+	  diagnostic-test-string-literals-1.c \
+	  diagnostic-test-string-literals-2.c } \
     { location_overflow_plugin.c \
 	  location-overflow-test-1.c \
 	  location-overflow-test-2.c } \
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 2d07942..3739d6c 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -812,6 +812,51 @@ cpp_host_to_exec_charset (cpp_reader *pfile, cppchar_t c)
 
 \f
 
+/* cpp_substring_ranges's constructor. */
+
+cpp_substring_ranges::cpp_substring_ranges () :
+  m_ranges (NULL),
+  m_num_ranges (0),
+  m_alloc_ranges (8)
+{
+  m_ranges = XNEWVEC (source_range, m_alloc_ranges);
+}
+
+/* cpp_substring_ranges's destructor. */
+
+cpp_substring_ranges::~cpp_substring_ranges ()
+{
+  free (m_ranges);
+}
+
+/* Add RANGE to the vector of source_range information.  */
+
+void
+cpp_substring_ranges::add_range (source_range range)
+{
+  if (m_num_ranges >= m_alloc_ranges)
+    {
+      m_alloc_ranges *= 2;
+      m_ranges
+	= (source_range *)xrealloc (m_ranges,
+				    sizeof (source_range) * m_alloc_ranges);
+    }
+  m_ranges[m_num_ranges++] = range;
+}
+
+/* Read NUM ranges from LOC_READER, adding them to the vector of source_range
+   information.  */
+
+void
+cpp_substring_ranges::add_n_ranges (int num,
+				    cpp_string_location_reader &loc_reader)
+{
+  for (int i = 0; i < num; i++)
+    add_range (loc_reader.get_next ());
+}
+
+\f
+
 /* Utility routine that computes a mask of the form 0000...111... with
    WIDTH 1-bits.  */
 static inline size_t
@@ -980,18 +1025,27 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
    one beyond the UCN, or to the syntactically invalid character.
 
    IDENTIFIER_POS is 0 when not in an identifier, 1 for the start of
-   an identifier, or 2 otherwise.  */
+   an identifier, or 2 otherwise.
+
+   If CHAR_RANGE and LOC_READER are non-NULL, then position information is
+   read from *LOC_READER and CHAR_RANGE->m_finish is updated accordingly.  */
 
 bool
 _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 		const uchar *limit, int identifier_pos,
-		struct normalize_state *nst, cppchar_t *cp)
+		struct normalize_state *nst, cppchar_t *cp,
+		source_range *char_range,
+		cpp_string_location_reader *loc_reader)
 {
   cppchar_t result, c;
   unsigned int length;
   const uchar *str = *pstr;
   const uchar *base = str - 2;
 
+  /* char_range and loc_reader must either be both NULL, or both be
+     non-NULL.  */
+  gcc_assert ((char_range != NULL) == (loc_reader != NULL));
+
   if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99))
     cpp_error (pfile, CPP_DL_WARNING,
 	       "universal character names are only valid in C++ and C99");
@@ -1021,6 +1075,8 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
       if (!ISXDIGIT (c))
 	break;
       str++;
+      if (loc_reader)
+	char_range->m_finish = loc_reader->get_next ().m_finish;
       result = (result << 4) + hex_value (c);
     }
   while (--length && str < limit);
@@ -1086,11 +1142,18 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 }
 
 /* Convert an UCN, pointed to by FROM, to UTF-8 encoding, then translate
-   it to the execution character set and write the result into TBUF.
-   An advanced pointer is returned.  Issues all relevant diagnostics.  */
+   it to the execution character set and write the result into TBUF,
+   if TBUF is non-NULL.
+   An advanced pointer is returned.  Issues all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t ucn;
   uchar buf[6];
@@ -1099,8 +1162,17 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
   int rval;
   struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   from++;  /* Skip u/U.  */
-  _cpp_valid_ucn (pfile, &from, limit, 0, &nst, &ucn);
+
+  if (loc_reader)
+    /* The u/U is part of the spelling of this character.  */
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
+  _cpp_valid_ucn (pfile, &from, limit, 0, &nst,
+		  &ucn, &char_range, loc_reader);
 
   rval = one_cppchar_to_utf8 (ucn, &bufp, &bytesleft);
   if (rval)
@@ -1109,9 +1181,20 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
       cpp_errno (pfile, CPP_DL_ERROR,
 		 "converting UCN to source character set");
     }
-  else if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting UCN to execution character set");
+  else
+    {
+      if (tbuf)
+	if (!APPLY_CONVERSION (cvt, buf, 6 - bytesleft, tbuf))
+	  cpp_errno (pfile, CPP_DL_ERROR,
+		     "converting UCN to execution character set");
+
+      if (loc_reader)
+	{
+	  int num_encoded_bytes = 6 - bytesleft;
+	  for (int i = 0; i < num_encoded_bytes; i++)
+	    ranges->add_range (char_range);
+	}
+    }
 
   return from;
 }
@@ -1167,31 +1250,48 @@ emit_numeric_escape (cpp_reader *pfile, cppchar_t n,
 }
 
 /* Convert a hexadecimal escape, pointed to by FROM, to the execution
-   character set and write it into the string buffer TBUF.  Returns an
-   advanced pointer, and issues diagnostics as necessary.
+   character set and write it into the string buffer TBUF (if non-NULL).
+   Returns an advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given hex
-   number.  You can, e.g. generate surrogate pairs this way.  */
+   number.  You can, e.g. generate surrogate pairs this way.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   cppchar_t c, n = 0, overflow = 0;
   int digits_found = 0;
   size_t width = cvt.width;
   size_t mask = width_to_mask (width);
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   if (CPP_WTRADITIONAL (pfile))
     cpp_warning (pfile, CPP_W_TRADITIONAL,
 	         "the meaning of '\\x' is different in traditional C");
 
-  from++;  /* Skip 'x'.  */
+  /* Skip 'x'.  */
+  from++;
+
+  /* The 'x' is part of the spelling of this character.  */
+  if (loc_reader)
+    char_range.m_finish = loc_reader->get_next ().m_finish;
+
   while (from < limit)
     {
       c = *from;
       if (! hex_p (c))
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 4 >> 4);
       n = (n << 4) + hex_value (c);
       digits_found = 1;
@@ -1211,7 +1311,10 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
@@ -1221,10 +1324,16 @@ convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit,
    advanced pointer, and issues diagnostics as necessary.
    No character set translation occurs; this routine always produces the
    execution-set character with numeric value equal to the given octal
-   number.  */
+   number.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL and CHAR_RANGE
+   contains the location of the character so far: location information
+   is read from *LOC_READER, and *RANGES is updated accordingly.  */
 static const uchar *
 convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
-	     struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+	     struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+	     source_range char_range,
+	     cpp_string_location_reader *loc_reader,
+	     cpp_substring_ranges *ranges)
 {
   size_t count = 0;
   cppchar_t c, n = 0;
@@ -1232,12 +1341,17 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
   size_t mask = width_to_mask (width);
   bool overflow = false;
 
+  /* loc_reader and ranges must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_reader != NULL) == (ranges != NULL));
+
   while (from < limit && count++ < 3)
     {
       c = *from;
       if (c < '0' || c > '7')
 	break;
       from++;
+      if (loc_reader)
+	char_range.m_finish = loc_reader->get_next ().m_finish;
       overflow |= n ^ (n << 3 >> 3);
       n = (n << 3) + c - '0';
     }
@@ -1249,18 +1363,26 @@ convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit,
       n &= mask;
     }
 
-  emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (tbuf)
+    emit_numeric_escape (pfile, n, tbuf, cvt);
+  if (ranges)
+    ranges->add_range (char_range);
 
   return from;
 }
 
 /* Convert an escape sequence (pointed to by FROM) to its value on
    the target, and to the execution character set.  Do not scan past
-   LIMIT.  Write the converted value into TBUF.  Returns an advanced
-   pointer.  Handles all relevant diagnostics.  */
+   LIMIT.  Write the converted value into TBUF, if TBUF is non-NULL.
+   Returns an advanced pointer.  Handles all relevant diagnostics.
+   If LOC_READER is non-NULL, then RANGES must be non-NULL: location
+   information is read from *LOC_READER, and *RANGES is updated
+   accordingly.  */
 static const uchar *
 convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
-		struct _cpp_strbuf *tbuf, struct cset_converter cvt)
+		struct _cpp_strbuf *tbuf, struct cset_converter cvt,
+		cpp_string_location_reader *loc_reader,
+		cpp_substring_ranges *ranges)
 {
   /* Values of \a \b \e \f \n \r \t \v respectively.  */
 #if HOST_CHARSET == HOST_CHARSET_ASCII
@@ -1273,20 +1395,28 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 
   uchar c;
 
+  /* Record the location of the backslash.  */
+  source_range char_range;
+  if (loc_reader)
+    char_range = loc_reader->get_next ();
+
   c = *from;
   switch (c)
     {
       /* UCNs, hex escapes, and octal escapes are processed separately.  */
     case 'u': case 'U':
-      return convert_ucn (pfile, from, limit, tbuf, cvt);
+      return convert_ucn (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
     case 'x':
-      return convert_hex (pfile, from, limit, tbuf, cvt);
+      return convert_hex (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
       break;
 
     case '0':  case '1':  case '2':  case '3':
     case '4':  case '5':  case '6':  case '7':
-      return convert_oct (pfile, from, limit, tbuf, cvt);
+      return convert_oct (pfile, from, limit, tbuf, cvt,
+			  char_range, loc_reader, ranges);
 
       /* Various letter escapes.  Get the appropriate host-charset
 	 value into C.  */
@@ -1338,10 +1468,17 @@ convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
 	}
     }
 
-  /* Now convert what we have to the execution character set.  */
-  if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
-    cpp_errno (pfile, CPP_DL_ERROR,
-	       "converting escape sequence to execution character set");
+  if (tbuf)
+    /* Now convert what we have to the execution character set.  */
+    if (!APPLY_CONVERSION (cvt, &c, 1, tbuf))
+      cpp_errno (pfile, CPP_DL_ERROR,
+		 "converting escape sequence to execution character set");
+
+  if (loc_reader)
+    {
+      char_range.m_finish = loc_reader->get_next ().m_finish;
+      ranges->add_range (char_range);
+    }
 
   return from + 1;
 }
@@ -1374,28 +1511,52 @@ converter_for_type (cpp_reader *pfile, enum cpp_ttype type)
    are to be converted from the source to the execution character set,
    escape sequences translated, and finally all are to be
    concatenated.  WIDE indicates whether or not to produce a wide
-   string.  The result is written into TO.  Returns true for success,
-   false for failure.  */
-bool
-cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
-		      cpp_string *to,  enum cpp_ttype type)
+   string.  If TO is non-NULL, the result is written into TO.
+   If LOC_READERS and OUT are non-NULL, then location information
+   is read from LOC_READERS (which must be an array of length COUNT),
+   and location information is written to *RANGES.
+
+   Returns true for success, false for failure.  */
+
+static bool
+cpp_interpret_string_1 (cpp_reader *pfile, const cpp_string *from, size_t count,
+			cpp_string *to,  enum cpp_ttype type,
+			cpp_string_location_reader *loc_readers,
+			cpp_substring_ranges *out)
 {
   struct _cpp_strbuf tbuf;
   const uchar *p, *base, *limit;
   size_t i;
   struct cset_converter cvt = converter_for_type (pfile, type);
 
-  tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
-  tbuf.text = XNEWVEC (uchar, tbuf.asize);
-  tbuf.len = 0;
+  /* loc_readers and out must either be both NULL, or both be non-NULL.  */
+  gcc_assert ((loc_readers != NULL) == (out != NULL));
+
+  if (to)
+    {
+      tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len);
+      tbuf.text = XNEWVEC (uchar, tbuf.asize);
+      tbuf.len = 0;
+    }
 
   for (i = 0; i < count; i++)
     {
+      cpp_string_location_reader *loc_reader = NULL;
+      if (loc_readers)
+	loc_reader = &loc_readers[i];
+
       p = from[i].text;
       if (*p == 'u')
 	{
-	  if (*++p == '8')
-	    p++;
+	  p++;
+	  if (loc_reader)
+	    loc_reader->get_next ();
+	  if (*p == '8')
+	    {
+	      p++;
+	      if (loc_reader)
+		loc_reader->get_next ();
+	    }
 	}
       else if (*p == 'L' || *p == 'U') p++;
       if (*p == 'R')
@@ -1414,13 +1575,43 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 
 	  /* Raw strings are all normal characters; these can be fed
 	     directly to convert_cset.  */
-	  if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
-	    goto fail;
+	  if (to)
+	    if (!APPLY_CONVERSION (cvt, p, limit - p, &tbuf))
+	      goto fail;
+
+	  if (loc_reader)
+	    {
+	      /* If generating source ranges, assume we have a 1:1
+		 correspondence between bytes in the source encoding and bytes
+		 in the execution encoding (e.g. if we have a UTF-8 to UTF-8
+		 conversion), so that this run of bytes in the source file
+		 corresponds to a run of bytes in the execution string.
+		 This requirement is guaranteed by an early-reject in
+		 cpp_interpret_string_ranges.  */
+	      gcc_assert (cvt.func == convert_no_conversion);
+	      out->add_n_ranges (limit - p, *loc_reader);
+	    }
 
 	  continue;
 	}
 
-      p++; /* Skip leading quote.  */
+      /* If we don't now have a leading quote, something has gone wrong.
+	 This can occur if cpp_interpret_string_ranges is handling a
+	 stringified macro argument, but should not be possible otherwise.  */
+      if (*p != '"' && *p != '\'')
+	{
+	  gcc_assert (out != NULL);
+	  cpp_error (pfile, CPP_DL_ERROR, "missing open quote");
+	  if (to)
+	    free (tbuf.text);
+	  return false;
+	}
+
+      /* Skip leading quote.  */
+      p++;
+      if (loc_reader)
+	loc_reader->get_next ();
+
       limit = from[i].text + from[i].len - 1; /* Skip trailing quote.  */
 
       for (;;)
@@ -1432,29 +1623,130 @@ cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
 	    {
 	      /* We have a run of normal characters; these can be fed
 		 directly to convert_cset.  */
-	      if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
-		goto fail;
+	      if (to)
+		if (!APPLY_CONVERSION (cvt, base, p - base, &tbuf))
+		  goto fail;
+	    /* Similar to above: assumes we have a 1:1 correspondence
+	       between bytes in the source encoding and bytes in the
+	       execution encoding.  */
+	      if (loc_reader)
+		{
+		  gcc_assert (cvt.func == convert_no_conversion);
+		  out->add_n_ranges (p - base, *loc_reader);
+		}
 	    }
-	  if (p == limit)
+	  if (p >= limit)
 	    break;
 
-	  p = convert_escape (pfile, p + 1, limit, &tbuf, cvt);
+	  struct _cpp_strbuf *tbuf_ptr = to ? &tbuf : NULL;
+	  p = convert_escape (pfile, p + 1, limit, tbuf_ptr, cvt,
+			      loc_reader, out);
 	}
     }
-  /* NUL-terminate the 'to' buffer and translate it to a cpp_string
-     structure.  */
-  emit_numeric_escape (pfile, 0, &tbuf, cvt);
-  tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
-  to->text = tbuf.text;
-  to->len = tbuf.len;
+
+  if (to)
+    {
+      /* NUL-terminate the 'to' buffer and translate it to a cpp_string
+	 structure.  */
+      emit_numeric_escape (pfile, 0, &tbuf, cvt);
+      tbuf.text = XRESIZEVEC (uchar, tbuf.text, tbuf.len);
+      to->text = tbuf.text;
+      to->len = tbuf.len;
+    }
+
   return true;
 
  fail:
   cpp_errno (pfile, CPP_DL_ERROR, "converting to execution character set");
-  free (tbuf.text);
+  if (to)
+    free (tbuf.text);
   return false;
 }
 
+/* FROM is an array of cpp_string structures of length COUNT.  These
+   are to be converted from the source to the execution character set,
+   escape sequences translated, and finally all are to be
+   concatenated.  WIDE indicates whether or not to produce a wide
+   string.  The result is written into TO.  Returns true for success,
+   false for failure.  */
+bool
+cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count,
+		      cpp_string *to,  enum cpp_ttype type)
+{
+  return cpp_interpret_string_1 (pfile, from, count, to, type, NULL, NULL);
+}
+
+/* A "do nothing" error-handling callback for use by
+   cpp_interpret_string_ranges, so that it can temporarily suppress
+   error-handling.  */
+
+static bool
+noop_error_cb (cpp_reader *, int, int, rich_location *,
+	       const char *, va_list *)
+{
+  /* no-op.  */
+  return true;
+}
+
+/* This function mimics the behavior of cpp_interpret_string, but
+   rather than generating a string in the execution character set,
+   *OUT is written to with the source code ranges of the characters
+   in such a string.
+   FROM and LOC_READERS should both be arrays of length COUNT.
+   Returns NULL for success, or an error message for failure.  */
+
+const char *
+cpp_interpret_string_ranges (cpp_reader *pfile, const cpp_string *from,
+			     cpp_string_location_reader *loc_readers,
+			     size_t count,
+			     cpp_substring_ranges *out,
+			     enum cpp_ttype type)
+{
+  /* There are a couple of cases in the range-handling in
+     cpp_interpret_string_1 that rely on there being a 1:1 correspondence
+     between bytes in the source encoding and bytes in the execution
+     encoding, so that each byte in the execution string can correspond
+     to the location of a byte in the source string.
+
+     This holds for the typical case of a UTF-8 to UTF-8 conversion.
+     Enforce this requirement by only attempting to track substring
+     locations if we have source encoding == execution encoding.
+
+     This is a stronger condition than we need, since we could e.g.
+     have ASCII to EBCDIC (with 1 byte per character before and after),
+     but it seems to be a reasonable restriction.  */
+  struct cset_converter cvt = converter_for_type (pfile, type);
+  if (cvt.func != convert_no_conversion)
+    return "execution character set != source character set";
+
+  /* For on-demand strings we have already lexed the strings, so there
+     should be no errors.  However, if we have bogus source location
+     data (or stringified macro arguments), the attempt to lex the
+     strings could fail with an error.  Temporarily install an
+     error-handler to catch the error, so that it can lead to this call
+     failing, rather than being emitted as a user-visible diagnostic.
+     If an error does occur, we should see it via the return value of
+     cpp_interpret_string_1.  */
+  bool (*saved_error_handler) (cpp_reader *, int, int, rich_location *,
+			       const char *, va_list *)
+    ATTRIBUTE_FPTR_PRINTF(5,0);
+
+  saved_error_handler = pfile->cb.error;
+  pfile->cb.error = noop_error_cb;
+
+  bool result = cpp_interpret_string_1 (pfile, from, count, NULL, type,
+					loc_readers, out);
+
+  /* Restore the saved error-handler.  */
+  pfile->cb.error = saved_error_handler;
+
+  if (!result)
+    return "cpp_interpret_string_1 failed";
+
+  /* Success.  */
+  return NULL;
+}
+
 /* Subroutine of do_line and do_linemarker.  Convert escape sequences
    in a string, but do not perform character set conversion.  */
 bool
@@ -1818,3 +2110,39 @@ _cpp_default_encoding (void)
 
   return current_encoding;
 }
+
+/* Implementation of class cpp_string_location_reader.  */
+
+/* Constructor for cpp_string_location_reader.  */
+
+cpp_string_location_reader::
+cpp_string_location_reader (source_location src_loc,
+			    line_maps *line_table)
+: m_line_table (line_table)
+{
+  src_loc = get_range_from_loc (line_table, src_loc).m_start;
+
+  /* SRC_LOC might be a macro location.  It only makes sense to do
+     column-by-column calculations on ordinary maps, so get the
+     corresponding location in an ordinary map.  */
+  m_loc
+    = linemap_resolve_location (line_table, src_loc,
+				LRK_SPELLING_LOCATION, NULL);
+
+  const line_map_ordinary *map
+    = linemap_check_ordinary (linemap_lookup (line_table, m_loc));
+  m_offset_per_column = (1 << map->m_range_bits);
+}
+
+/* Get the range of the next source byte.  */
+
+source_range
+cpp_string_location_reader::get_next ()
+{
+  source_range result;
+  result.m_start = m_loc;
+  result.m_finish = m_loc;
+  if (m_loc <= LINE_MAP_MAX_LOCATION_WITH_COLS)
+    m_loc += m_offset_per_column;
+  return result;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 4e0084c..659686b 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -743,6 +743,51 @@ struct GTY(()) cpp_hashnode {
   union _cpp_hashnode_value GTY ((desc ("CPP_HASHNODE_VALUE_IDX (%1)"))) value;
 };
 
+/* A class for iterating through the source locations within a
+   string token (before escapes are interpreted, and before
+   concatenation).  */
+
+class cpp_string_location_reader {
+ public:
+  cpp_string_location_reader (source_location src_loc,
+			      line_maps *line_table);
+
+  source_range get_next ();
+
+ private:
+  source_location m_loc;
+  int m_offset_per_column;
+  line_maps *m_line_table;
+};
+
+/* A class for storing the source ranges of all of the characters within
+   a string literal, after escapes are interpreted, and after
+   concatenation.
+
+   This is not GTY-marked, as instances are intended to be temporary.  */
+
+class cpp_substring_ranges
+{
+ public:
+  cpp_substring_ranges ();
+  ~cpp_substring_ranges ();
+
+  int get_num_ranges () const { return m_num_ranges; }
+  source_range get_range (int idx) const
+  {
+    linemap_assert (idx < m_num_ranges);
+    return m_ranges[idx];
+  }
+
+  void add_range (source_range range);
+  void add_n_ranges (int num, cpp_string_location_reader &loc_reader);
+
+ private:
+  source_range *m_ranges;
+  int m_num_ranges;
+  int m_alloc_ranges;
+};
+
 /* Call this first to get a handle to pass to other functions.
 
    If you want cpplib to manage its own hashtable, pass in a NULL
@@ -829,6 +874,12 @@ extern cppchar_t cpp_interpret_charconst (cpp_reader *, const cpp_token *,
 extern bool cpp_interpret_string (cpp_reader *,
 				  const cpp_string *, size_t,
 				  cpp_string *, enum cpp_ttype);
+extern const char *cpp_interpret_string_ranges (cpp_reader *pfile,
+						const cpp_string *from,
+						cpp_string_location_reader *,
+						size_t count,
+						cpp_substring_ranges *out,
+						enum cpp_ttype type);
 extern bool cpp_interpret_string_notranslate (cpp_reader *,
 					      const cpp_string *, size_t,
 					      cpp_string *, enum cpp_ttype);
diff --git a/libcpp/internal.h b/libcpp/internal.h
index ca2b498..4a5cd3c 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -754,7 +754,9 @@ struct normalize_state
 extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    const unsigned char *, int,
 			    struct normalize_state *state,
-			    cppchar_t *);
+			    cppchar_t *,
+			    source_range *char_range,
+			    cpp_string_location_reader *loc_reader);
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/lex.c b/libcpp/lex.c
index 236418d..4e71965 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -1247,7 +1247,7 @@ forms_identifier_p (cpp_reader *pfile, int first,
       cppchar_t s;
       buffer->cur += 2;
       if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
-			  state, &s))
+			  state, &s, NULL, NULL))
 	return true;
       buffer->cur -= 2;
     }
-- 
1.8.5.3


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] c-format.c: cleanup of check_format_info_main
  2016-08-04 20:22                   ` Jeff Law
@ 2016-08-06  0:56                     ` David Malcolm
  2016-08-08 17:20                       ` Jeff Law
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-06  0:56 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers, Martin Sebor, David Malcolm

On Thu, 2016-08-04 at 14:22 -0600, Jeff Law wrote:
> On 08/04/2016 01:24 PM, David Malcolm wrote:
>
> > > Do you realize that this isn't used for ~700 lines after this
> > > point?
> > >  Is
> > > there any sensible way to factor some code here to avoid the
> > > coding
> > > disconnect.  I realize the function was huge before you got in
> > > here,
> > > but
> > > if at all possible, I'd like to see a bit of cleanup.
> > >
> > > I think this is OK after that cleanup.
> >
> > format_chars can get modified in numerous places in the intervening
> > lines, which is why I stash the value there.
> Yea, I figured that was the case.  I first noticed the stashed value,
> but didn't see where it was used for far longer than I expected.
>
> >
> > I can do some kind of cleanup of check_format_info_main, maybe
> > splitting out the things in the body of loop, moving them to
> > support
> > functions.
> That's essentially what I was thinking.
>
> >
> > That said, I note that Martin's sprintf patch:
> >   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00056.html
> > also touches those ~700 lines in check_format_info_main in over a
> > dozen
> > places.  Given that, would you prefer I do the cleanup before or
> > after
> > the substring_loc patch?
> I think you should go first with the cleanup.  It'll cause Martin
> some
> heartburn, but that happens sometimes.
>
> And FWIW, if you hadn't needed to stash away that value I probably
> wouldn't have noticed how badly that function (and the loop in
> particular) needed some refactoring.
>
> jeff

Here's a cleanup of check_format_info_main, which introduces three
new classes to hold state, and moves code from the loop into
methods of those classes, reducing the loop from ~700 lines to
~100 lines.

Unfortunately, so much changes in this patch that the before/after
diff is hard to read.  If you like the end-result, but would prefer
better history I could try to split this up into a more readable set
of patches.  (I have a version of that, but they're messy)

Successfully bootstrapped&regrtested the updated patch on
x86_64-pc-linux-gnu.

OK for trunk?

gcc/c-family/ChangeLog:
	* c-format.c (class flag_chars_t): New class.
	(struct length_modifier): New struct.
	(class argument_parser): New class.
	(flag_chars_t::flag_chars_t): New ctor.
	(flag_chars_t::has_char_p): New method.
	(flag_chars_t::add_char): New method.
	(flag_chars_t::validate): New method.
	(flag_chars_t::get_alloc_flag): New method.
	(flag_chars_t::assignment_suppression_p): New method.
	(argument_parser::argument_parser): New ctor.
	(argument_parser::read_any_dollar): New method.
	(argument_parser::read_format_flags): New method.
	(argument_parser::read_any_format_width): New method.
	(argument_parser::read_any_format_left_precision): New method.
	(argument_parser::read_any_format_precision): New method.
	(argument_parser::handle_alloc_chars): New method.
	(argument_parser::read_any_length_modifier): New method.
	(argument_parser::read_any_other_modifier): New method.
	(argument_parser::find_format_char_info): New method.
	(argument_parser::validate_flag_pairs): New method.
	(argument_parser::give_y2k_warnings): New method.
	(argument_parser::parse_any_scan_set): New method.
	(argument_parser::handle_conversions): New method.
	(argument_parser::check_argument_type): New method.
	(check_format_info_main): Introduce classes argument_parser
	and flag_chars_t, moving the code within the loop into methods
	of these classes.  Make various locals "const".
---
 gcc/c-family/c-format.c | 1655 +++++++++++++++++++++++++++++------------------
 1 file changed, 1019 insertions(+), 636 deletions(-)

diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index c19c411..92d2c1c 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -1688,740 +1688,1123 @@ check_format_arg (void *ctx, tree format_tree,
 			  params, arg_num, fwt_pool);
 }
 
+/* Support class for argument_parser and check_format_info_main.
+   Tracks any flag characters that have been applied to the
+   current argument.  */
 
-/* Do the main part of checking a call to a format function.  FORMAT_CHARS
-   is the NUL-terminated format string (which at this point may contain
-   internal NUL characters); FORMAT_LENGTH is its length (excluding the
-   terminating NUL character).  ARG_NUM is one less than the number of
-   the first format argument to check; PARAMS points to that format
-   argument in the list of arguments.  */
+class flag_chars_t
+{
+ public:
+  flag_chars_t ();
+  bool has_char_p (char ch) const;
+  void add_char (char ch);
+  void validate (const format_kind_info *fki,
+		 const format_char_info *fci,
+		 const format_flag_spec *flag_specs,
+		 const char * const format_chars,
+		 location_t format_string_loc,
+		 const char * const orig_format_chars,
+		 char format_char);
+  int get_alloc_flag (const format_kind_info *fki);
+  int assignment_suppression_p (const format_kind_info *fki);
+
+ private:
+  char m_flag_chars[256];
+};
 
-static void
-check_format_info_main (format_check_results *res,
-			function_format_info *info, const char *format_chars,
-			int format_length, tree params,
-			unsigned HOST_WIDE_INT arg_num,
-			object_allocator <format_wanted_type> &fwt_pool)
+/* Support struct for argument_parser and check_format_info_main.
+   Encapsulates any length modifier applied to the current argument.  */
+
+struct length_modifier
 {
-  const char *orig_format_chars = format_chars;
-  tree first_fillin_param = params;
+  length_modifier ()
+  : chars (NULL), val (FMT_LEN_none), std (STD_C89),
+    scalar_identity_flag (0)
+  {
+  }
 
-  const format_kind_info *fki = &format_types[info->format_type];
-  const format_flag_spec *flag_specs = fki->flag_specs;
-  const format_flag_pair *bad_flag_pairs = fki->bad_flag_pairs;
-  location_t format_string_loc = res->format_string_loc;
+  length_modifier (const char *chars_,
+		   enum format_lengths val_,
+		   enum format_std_version std_,
+		   int scalar_identity_flag_)
+  : chars (chars_), val (val_), std (std_),
+    scalar_identity_flag (scalar_identity_flag_)
+  {
+  }
 
-  /* -1 if no conversions taking an operand have been found; 0 if one has
-     and it didn't use $; 1 if $ formats are in use.  */
-  int has_operand_number = -1;
+  const char *chars;
+  enum format_lengths val;
+  enum format_std_version std;
+  int scalar_identity_flag;
+};
 
-  init_dollar_format_checking (info->first_arg_num, first_fillin_param);
+/* Parsing one argument within a format string.  */
 
-  while (*format_chars != 0)
-    {
-      int i;
-      int suppressed = FALSE;
-      const char *length_chars = NULL;
-      enum format_lengths length_chars_val = FMT_LEN_none;
-      enum format_std_version length_chars_std = STD_C89;
-      int format_char;
-      tree cur_param;
-      tree wanted_type;
-      int main_arg_num = 0;
-      tree main_arg_params = 0;
-      enum format_std_version wanted_type_std;
-      const char *wanted_type_name;
-      format_wanted_type width_wanted_type;
-      format_wanted_type precision_wanted_type;
-      format_wanted_type main_wanted_type;
-      format_wanted_type *first_wanted_type = NULL;
-      format_wanted_type *last_wanted_type = NULL;
-      const format_length_info *fli = NULL;
-      const format_char_info *fci = NULL;
-      char flag_chars[256];
-      int alloc_flag = 0;
-      int scalar_identity_flag = 0;
-      const char *format_start;
+class argument_parser
+{
+ public:
+  argument_parser (function_format_info *info, const char *&format_chars,
+		   const char * const orig_format_chars,
+		   location_t format_string_loc, flag_chars_t &flag_chars,
+		   int &has_operand_number, tree first_fillin_param,
+		   object_allocator <format_wanted_type> &fwt_pool_);
+
+  bool read_any_dollar ();
+
+  bool read_format_flags ();
+
+  bool
+  read_any_format_width (tree &params,
+			 unsigned HOST_WIDE_INT &arg_num);
+
+  void
+  read_any_format_left_precision ();
+
+  bool
+  read_any_format_precision (tree &params,
+			     unsigned HOST_WIDE_INT &arg_num);
+
+  void handle_alloc_chars ();
+
+  length_modifier read_any_length_modifier ();
+
+  void read_any_other_modifier ();
+
+  const format_char_info *find_format_char_info (char format_char);
+
+  void
+  validate_flag_pairs (const format_char_info *fci,
+		       char format_char);
+
+  void
+  give_y2k_warnings (const format_char_info *fci,
+		     char format_char);
+
+  void parse_any_scan_set (const format_char_info *fci);
+
+  bool handle_conversions (const format_char_info *fci,
+			   const length_modifier &len_modifier,
+			   tree &wanted_type,
+			   const char *&wanted_type_name,
+			   unsigned HOST_WIDE_INT &arg_num,
+			   tree &params,
+			   char format_char);
+
+  bool
+  check_argument_type (const format_char_info *fci,
+		       const length_modifier &len_modifier,
+		       tree &wanted_type,
+		       const char *&wanted_type_name,
+		       const bool suppressed,
+		       unsigned HOST_WIDE_INT &arg_num,
+		       tree &params,
+		       const int alloc_flag,
+		       const char * const format_start);
+
+ private:
+  const function_format_info *const info;
+  const format_kind_info * const fki;
+  const format_flag_spec * const flag_specs;
+  const char *&format_chars;
+  const char * const orig_format_chars;
+  const location_t format_string_loc;
+  object_allocator <format_wanted_type> &fwt_pool;
+  flag_chars_t &flag_chars;
+  int main_arg_num;
+  tree main_arg_params;
+  int &has_operand_number;
+  const tree first_fillin_param;
+  format_wanted_type width_wanted_type;
+  format_wanted_type precision_wanted_type;
+ public:
+  format_wanted_type main_wanted_type;
+ private:
+  format_wanted_type *first_wanted_type;
+  format_wanted_type *last_wanted_type;
+};
 
-      if (*format_chars++ != '%')
+/* flag_chars_t's constructor.  */
+
+flag_chars_t::flag_chars_t ()
+{
+  m_flag_chars[0] = 0;
+}
+
+/* Has CH been seen as a flag within the current argument?  */
+
+bool
+flag_chars_t::has_char_p (char ch) const
+{
+  return strchr (m_flag_chars, ch) != 0;
+}
+
+/* Add CH to the flags seen within the current argument.  */
+
+void
+flag_chars_t::add_char (char ch)
+{
+  int i = strlen (m_flag_chars);
+  m_flag_chars[i++] = ch;
+  m_flag_chars[i] = 0;
+}
+
+/* Validate the individual flags used, removing any that are invalid.  */
+
+void
+flag_chars_t::validate (const format_kind_info *fki,
+			const format_char_info *fci,
+			const format_flag_spec *flag_specs,
+			const char * const format_chars,
+			location_t format_string_loc,
+			const char * const orig_format_chars,
+			char format_char)
+{
+  int i;
+  int d = 0;
+  for (i = 0; m_flag_chars[i] != 0; i++)
+    {
+      const format_flag_spec *s = get_flag_spec (flag_specs,
+						 m_flag_chars[i], NULL);
+      m_flag_chars[i - d] = m_flag_chars[i];
+      if (m_flag_chars[i] == fki->length_code_char)
 	continue;
-      if (*format_chars == 0)
+      if (strchr (fci->flag_chars, m_flag_chars[i]) == 0)
 	{
-          warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "spurious trailing %<%%%> in format");
+	  warning_at (location_from_offset (format_string_loc,
+					    format_chars
+					    - orig_format_chars),
+		      OPT_Wformat_, "%s used with %<%%%c%> %s format",
+		      _(s->name), format_char, fki->name);
+	  d++;
 	  continue;
 	}
-      if (*format_chars == '%')
+      if (pedantic)
+	{
+	  const format_flag_spec *t;
+	  if (ADJ_STD (s->std) > C_STD_VER)
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"%s does not support %s",
+			C_STD_NAME (s->std), _(s->long_name));
+	  t = get_flag_spec (flag_specs, m_flag_chars[i], fci->flags2);
+	  if (t != NULL && ADJ_STD (t->std) > ADJ_STD (s->std))
+	    {
+	      const char *long_name = (t->long_name != NULL
+				       ? t->long_name
+				       : s->long_name);
+	      if (ADJ_STD (t->std) > C_STD_VER)
+		warning_at (format_string_loc, OPT_Wformat_,
+			    "%s does not support %s with"
+			    " the %<%%%c%> %s format",
+			    C_STD_NAME (t->std), _(long_name),
+			    format_char, fki->name);
+	    }
+	}
+    }
+  m_flag_chars[i - d] = 0;
+}
+
+/* Determine if an assignment-allocation has been set, requiring
+   an extra char ** for writing back a dynamically-allocated char *.
+   This is for handling the optional 'm' character in scanf.  */
+
+int
+flag_chars_t::get_alloc_flag (const format_kind_info *fki)
+{
+  if ((fki->flags & (int) FMT_FLAG_SCANF_A_KLUDGE)
+      && has_char_p ('a'))
+    return 1;
+  if (fki->alloc_char && has_char_p (fki->alloc_char))
+    return 1;
+  return 0;
+}
+
+/* Determine if an assignment-suppression character was seen.
+   ('*' in scanf, for discarding the converted input).  */
+
+int
+flag_chars_t::assignment_suppression_p (const format_kind_info *fki)
+{
+  if (fki->suppression_char
+      && has_char_p (fki->suppression_char))
+    return 1;
+  return 0;
+}
+
+/* Constructor for argument_parser.  Initialize for parsing one
+   argument within a format string.  */
+
+argument_parser::
+argument_parser (function_format_info *info_, const char *&format_chars_,
+		 const char * const orig_format_chars_,
+		 location_t format_string_loc_,
+		 flag_chars_t &flag_chars_,
+		 int &has_operand_number_,
+		 tree first_fillin_param_,
+		 object_allocator <format_wanted_type> &fwt_pool_)
+: info (info_),
+  fki (&format_types[info->format_type]),
+  flag_specs (fki->flag_specs),
+  format_chars (format_chars_),
+  orig_format_chars (orig_format_chars_),
+  format_string_loc (format_string_loc_),
+  fwt_pool (fwt_pool_),
+  flag_chars (flag_chars_),
+  main_arg_num (0),
+  main_arg_params (NULL),
+  has_operand_number (has_operand_number_),
+  first_fillin_param (first_fillin_param_),
+  first_wanted_type (NULL),
+  last_wanted_type (NULL)
+{
+}
+
+/* Handle dollars at the start of format arguments, setting up main_arg_params
+   and main_arg_num.
+
+   Return true if format parsing is to continue, false otherwise.  */
+
+bool
+argument_parser::read_any_dollar ()
+{
+  if ((fki->flags & (int) FMT_FLAG_USE_DOLLAR) && has_operand_number != 0)
+    {
+      /* Possibly read a $ operand number at the start of the format.
+	 If one was previously used, one is required here.  If one
+	 is not used here, we can't immediately conclude this is a
+	 format without them, since it could be printf %m or scanf %*.  */
+      int opnum;
+      opnum = maybe_read_dollar_number (&format_chars, 0,
+					first_fillin_param,
+					&main_arg_params, fki);
+      if (opnum == -1)
+	return false;
+      else if (opnum > 0)
+	{
+	  has_operand_number = 1;
+	  main_arg_num = opnum + info->first_arg_num - 1;
+	}
+    }
+  else if (fki->flags & FMT_FLAG_USE_DOLLAR)
+    {
+      if (avoid_dollar_number (format_chars))
+	return false;
+    }
+  return true;
+}
+
+/* Read any format flags, but do not yet validate them beyond removing
+   duplicates, since in general validation depends on the rest of
+   the format.
+
+   Return true if format parsing is to continue, false otherwise.  */
+
+bool
+argument_parser::read_format_flags ()
+{
+  while (*format_chars != 0
+	 && strchr (fki->flag_chars, *format_chars) != 0)
+    {
+      const format_flag_spec *s = get_flag_spec (flag_specs,
+						 *format_chars, NULL);
+      if (flag_chars.has_char_p (*format_chars))
+	{
+	  warning_at (location_from_offset (format_string_loc,
+					    format_chars + 1
+					    - orig_format_chars),
+		      OPT_Wformat_,
+		      "repeated %s in format", _(s->name));
+	}
+      else
+	flag_chars.add_char (*format_chars);
+
+      if (s->skip_next_char)
 	{
 	  ++format_chars;
-	  continue;
+	  if (*format_chars == 0)
+	    {
+	      warning_at (format_string_loc, OPT_Wformat_,
+			  "missing fill character at end of strfmon format");
+	      return false;
+	    }
 	}
-      flag_chars[0] = 0;
+      ++format_chars;
+    }
+
+  return true;
+}
+
+/* Read any format width, possibly * or *m$.
+
+   Return true if format parsing is to continue, false otherwise.  */
+
+bool
+argument_parser::
+read_any_format_width (tree &params,
+		       unsigned HOST_WIDE_INT &arg_num)
+{
+  if (!fki->width_char)
+    return true;
 
-      if ((fki->flags & (int) FMT_FLAG_USE_DOLLAR) && has_operand_number != 0)
+  if (fki->width_type != NULL && *format_chars == '*')
+    {
+      flag_chars.add_char (fki->width_char);
+      /* "...a field width...may be indicated by an asterisk.
+	 In this case, an int argument supplies the field width..."  */
+      ++format_chars;
+      if (has_operand_number != 0)
 	{
-	  /* Possibly read a $ operand number at the start of the format.
-	     If one was previously used, one is required here.  If one
-	     is not used here, we can't immediately conclude this is a
-	     format without them, since it could be printf %m or scanf %*.  */
 	  int opnum;
-	  opnum = maybe_read_dollar_number (&format_chars, 0,
+	  opnum = maybe_read_dollar_number (&format_chars,
+					    has_operand_number == 1,
 					    first_fillin_param,
-					    &main_arg_params, fki);
+					    &params, fki);
 	  if (opnum == -1)
-	    return;
+	    return false;
 	  else if (opnum > 0)
 	    {
 	      has_operand_number = 1;
-	      main_arg_num = opnum + info->first_arg_num - 1;
+	      arg_num = opnum + info->first_arg_num - 1;
 	    }
+	  else
+	    has_operand_number = 0;
 	}
-      else if (fki->flags & FMT_FLAG_USE_DOLLAR)
+      else
 	{
 	  if (avoid_dollar_number (format_chars))
-	    return;
+	    return false;
 	}
-
-      /* Read any format flags, but do not yet validate them beyond removing
-	 duplicates, since in general validation depends on the rest of
-	 the format.  */
-      while (*format_chars != 0
-	     && strchr (fki->flag_chars, *format_chars) != 0)
+      if (info->first_arg_num != 0)
 	{
-	  const format_flag_spec *s = get_flag_spec (flag_specs,
-						     *format_chars, NULL);
-	  if (strchr (flag_chars, *format_chars) != 0)
-	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars + 1
-						- orig_format_chars),
-			  OPT_Wformat_,
-			  "repeated %s in format", _(s->name));
-	    }
+	  tree cur_param;
+	  if (params == 0)
+	    cur_param = NULL;
 	  else
 	    {
-	      i = strlen (flag_chars);
-	      flag_chars[i++] = *format_chars;
-	      flag_chars[i] = 0;
-	    }
-	  if (s->skip_next_char)
-	    {
-	      ++format_chars;
-	      if (*format_chars == 0)
+	      cur_param = TREE_VALUE (params);
+	      if (has_operand_number <= 0)
 		{
-		  warning_at (format_string_loc, OPT_Wformat_,
-			      "missing fill character at end of strfmon format");
-		  return;
+		  params = TREE_CHAIN (params);
+		  ++arg_num;
 		}
 	    }
+	  width_wanted_type.wanted_type = *fki->width_type;
+	  width_wanted_type.wanted_type_name = NULL;
+	  width_wanted_type.pointer_count = 0;
+	  width_wanted_type.char_lenient_flag = 0;
+	  width_wanted_type.scalar_identity_flag = 0;
+	  width_wanted_type.writing_in_flag = 0;
+	  width_wanted_type.reading_from_flag = 0;
+	  width_wanted_type.kind = CF_KIND_FIELD_WIDTH;
+	  width_wanted_type.format_start = format_chars - 1;
+	  width_wanted_type.format_length = 1;
+	  width_wanted_type.param = cur_param;
+	  width_wanted_type.arg_num = arg_num;
+	  width_wanted_type.offset_loc =
+	    format_chars - orig_format_chars;
+	  width_wanted_type.next = NULL;
+	  if (last_wanted_type != 0)
+	    last_wanted_type->next = &width_wanted_type;
+	  if (first_wanted_type == 0)
+	    first_wanted_type = &width_wanted_type;
+	  last_wanted_type = &width_wanted_type;
+	}
+    }
+  else
+    {
+      /* Possibly read a numeric width.  If the width is zero,
+	 we complain if appropriate.  */
+      int non_zero_width_char = FALSE;
+      int found_width = FALSE;
+      while (ISDIGIT (*format_chars))
+	{
+	  found_width = TRUE;
+	  if (*format_chars != '0')
+	    non_zero_width_char = TRUE;
 	  ++format_chars;
 	}
+      if (found_width && !non_zero_width_char &&
+	  (fki->flags & (int) FMT_FLAG_ZERO_WIDTH_BAD))
+	warning_at (format_string_loc, OPT_Wformat_,
+		    "zero width in %s format", fki->name);
+      if (found_width)
+	flag_chars.add_char (fki->width_char);
+    }
 
-      /* Read any format width, possibly * or *m$.  */
-      if (fki->width_char != 0)
+  return true;
+}
+
+/* Read any format left precision (must be a number, not *).  */
+void
+argument_parser::read_any_format_left_precision ()
+{
+  if (fki->left_precision_char == 0)
+    return;
+  if (*format_chars != '#')
+    return;
+
+  ++format_chars;
+  flag_chars.add_char (fki->left_precision_char);
+  if (!ISDIGIT (*format_chars))
+    warning_at (location_from_offset (format_string_loc,
+				      format_chars - orig_format_chars),
+		OPT_Wformat_,
+		"empty left precision in %s format", fki->name);
+  while (ISDIGIT (*format_chars))
+    ++format_chars;
+}
+
+/* Read any format precision, possibly * or *m$.
+
+   Return true if format parsing is to continue, false otherwise.  */
+
+bool
+argument_parser::
+read_any_format_precision (tree &params,
+			   unsigned HOST_WIDE_INT &arg_num)
+{
+  if (fki->precision_char == 0)
+    return true;
+  if (*format_chars != '.')
+    return true;
+
+  ++format_chars;
+  flag_chars.add_char (fki->precision_char);
+  if (fki->precision_type != NULL && *format_chars == '*')
+    {
+      /* "...a...precision...may be indicated by an asterisk.
+	 In this case, an int argument supplies the...precision."  */
+      ++format_chars;
+      if (has_operand_number != 0)
 	{
-	  if (fki->width_type != NULL && *format_chars == '*')
+	  int opnum;
+	  opnum = maybe_read_dollar_number (&format_chars,
+					    has_operand_number == 1,
+					    first_fillin_param,
+					    &params, fki);
+	  if (opnum == -1)
+	    return false;
+	  else if (opnum > 0)
 	    {
-	      i = strlen (flag_chars);
-	      flag_chars[i++] = fki->width_char;
-	      flag_chars[i] = 0;
-	      /* "...a field width...may be indicated by an asterisk.
-		 In this case, an int argument supplies the field width..."  */
-	      ++format_chars;
-	      if (has_operand_number != 0)
-		{
-		  int opnum;
-		  opnum = maybe_read_dollar_number (&format_chars,
-						    has_operand_number == 1,
-						    first_fillin_param,
-						    &params, fki);
-		  if (opnum == -1)
-		    return;
-		  else if (opnum > 0)
-		    {
-		      has_operand_number = 1;
-		      arg_num = opnum + info->first_arg_num - 1;
-		    }
-		  else
-		    has_operand_number = 0;
-		}
-	      else
-		{
-		  if (avoid_dollar_number (format_chars))
-		    return;
-		}
-	      if (info->first_arg_num != 0)
-		{
-		  if (params == 0)
-                    cur_param = NULL;
-                  else
-                    {
-                      cur_param = TREE_VALUE (params);
-                      if (has_operand_number <= 0)
-                        {
-                          params = TREE_CHAIN (params);
-                          ++arg_num;
-                        }
-                    }
-		  width_wanted_type.wanted_type = *fki->width_type;
-		  width_wanted_type.wanted_type_name = NULL;
-		  width_wanted_type.pointer_count = 0;
-		  width_wanted_type.char_lenient_flag = 0;
-		  width_wanted_type.scalar_identity_flag = 0;
-		  width_wanted_type.writing_in_flag = 0;
-		  width_wanted_type.reading_from_flag = 0;
-                  width_wanted_type.kind = CF_KIND_FIELD_WIDTH;
-		  width_wanted_type.format_start = format_chars - 1;
-		  width_wanted_type.format_length = 1;
-		  width_wanted_type.param = cur_param;
-		  width_wanted_type.arg_num = arg_num;
-		  width_wanted_type.offset_loc =
-		    format_chars - orig_format_chars;
-		  width_wanted_type.next = NULL;
-		  if (last_wanted_type != 0)
-		    last_wanted_type->next = &width_wanted_type;
-		  if (first_wanted_type == 0)
-		    first_wanted_type = &width_wanted_type;
-		  last_wanted_type = &width_wanted_type;
-		}
+	      has_operand_number = 1;
+	      arg_num = opnum + info->first_arg_num - 1;
 	    }
 	  else
-	    {
-	      /* Possibly read a numeric width.  If the width is zero,
-		 we complain if appropriate.  */
-	      int non_zero_width_char = FALSE;
-	      int found_width = FALSE;
-	      while (ISDIGIT (*format_chars))
-		{
-		  found_width = TRUE;
-		  if (*format_chars != '0')
-		    non_zero_width_char = TRUE;
-		  ++format_chars;
-		}
-	      if (found_width && !non_zero_width_char &&
-		  (fki->flags & (int) FMT_FLAG_ZERO_WIDTH_BAD))
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "zero width in %s format", fki->name);
-	      if (found_width)
-		{
-		  i = strlen (flag_chars);
-		  flag_chars[i++] = fki->width_char;
-		  flag_chars[i] = 0;
-		}
-	    }
+	    has_operand_number = 0;
 	}
-
-      /* Read any format left precision (must be a number, not *).  */
-      if (fki->left_precision_char != 0 && *format_chars == '#')
+      else
 	{
-	  ++format_chars;
-	  i = strlen (flag_chars);
-	  flag_chars[i++] = fki->left_precision_char;
-	  flag_chars[i] = 0;
-	  if (!ISDIGIT (*format_chars))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"empty left precision in %s format", fki->name);
-	  while (ISDIGIT (*format_chars))
-	    ++format_chars;
+	  if (avoid_dollar_number (format_chars))
+	    return false;
 	}
-
-      /* Read any format precision, possibly * or *m$.  */
-      if (fki->precision_char != 0 && *format_chars == '.')
+      if (info->first_arg_num != 0)
 	{
-	  ++format_chars;
-	  i = strlen (flag_chars);
-	  flag_chars[i++] = fki->precision_char;
-	  flag_chars[i] = 0;
-	  if (fki->precision_type != NULL && *format_chars == '*')
+	  tree cur_param;
+	  if (params == 0)
+	    cur_param = NULL;
+	  else
 	    {
-	      /* "...a...precision...may be indicated by an asterisk.
-		 In this case, an int argument supplies the...precision."  */
-	      ++format_chars;
-	      if (has_operand_number != 0)
+	      cur_param = TREE_VALUE (params);
+	      if (has_operand_number <= 0)
 		{
-		  int opnum;
-		  opnum = maybe_read_dollar_number (&format_chars,
-						    has_operand_number == 1,
-						    first_fillin_param,
-						    &params, fki);
-		  if (opnum == -1)
-		    return;
-		  else if (opnum > 0)
-		    {
-		      has_operand_number = 1;
-		      arg_num = opnum + info->first_arg_num - 1;
-		    }
-		  else
-		    has_operand_number = 0;
-		}
-	      else
-		{
-		  if (avoid_dollar_number (format_chars))
-		    return;
-		}
-	      if (info->first_arg_num != 0)
-		{
-		  if (params == 0)
-                    cur_param = NULL;
-                  else
-                    {
-                      cur_param = TREE_VALUE (params);
-                      if (has_operand_number <= 0)
-                        {
-                          params = TREE_CHAIN (params);
-                          ++arg_num;
-                        }
-                    }
-		  precision_wanted_type.wanted_type = *fki->precision_type;
-		  precision_wanted_type.wanted_type_name = NULL;
-		  precision_wanted_type.pointer_count = 0;
-		  precision_wanted_type.char_lenient_flag = 0;
-		  precision_wanted_type.scalar_identity_flag = 0;
-		  precision_wanted_type.writing_in_flag = 0;
-		  precision_wanted_type.reading_from_flag = 0;
-                  precision_wanted_type.kind = CF_KIND_FIELD_PRECISION;
-		  precision_wanted_type.param = cur_param;
-		  precision_wanted_type.format_start = format_chars - 2;
-		  precision_wanted_type.format_length = 2;
-		  precision_wanted_type.arg_num = arg_num;
-		  precision_wanted_type.offset_loc =
-		    format_chars - orig_format_chars;
-		  precision_wanted_type.next = NULL;
-		  if (last_wanted_type != 0)
-		    last_wanted_type->next = &precision_wanted_type;
-		  if (first_wanted_type == 0)
-		    first_wanted_type = &precision_wanted_type;
-		  last_wanted_type = &precision_wanted_type;
+		  params = TREE_CHAIN (params);
+		  ++arg_num;
 		}
 	    }
-	  else
-	    {
-	      if (!(fki->flags & (int) FMT_FLAG_EMPTY_PREC_OK)
-		  && !ISDIGIT (*format_chars))
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "empty precision in %s format", fki->name);
-	      while (ISDIGIT (*format_chars))
-		++format_chars;
-	    }
+	  precision_wanted_type.wanted_type = *fki->precision_type;
+	  precision_wanted_type.wanted_type_name = NULL;
+	  precision_wanted_type.pointer_count = 0;
+	  precision_wanted_type.char_lenient_flag = 0;
+	  precision_wanted_type.scalar_identity_flag = 0;
+	  precision_wanted_type.writing_in_flag = 0;
+	  precision_wanted_type.reading_from_flag = 0;
+	  precision_wanted_type.kind = CF_KIND_FIELD_PRECISION;
+	  precision_wanted_type.param = cur_param;
+	  precision_wanted_type.format_start = format_chars - 2;
+	  precision_wanted_type.format_length = 2;
+	  precision_wanted_type.arg_num = arg_num;
+	  precision_wanted_type.offset_loc =
+	    format_chars - orig_format_chars;
+	  precision_wanted_type.next = NULL;
+	  if (last_wanted_type != 0)
+	    last_wanted_type->next = &precision_wanted_type;
+	  if (first_wanted_type == 0)
+	    first_wanted_type = &precision_wanted_type;
+	  last_wanted_type = &precision_wanted_type;
 	}
+    }
+  else
+    {
+      if (!(fki->flags & (int) FMT_FLAG_EMPTY_PREC_OK)
+	  && !ISDIGIT (*format_chars))
+	warning_at (location_from_offset (format_string_loc,
+					  format_chars - orig_format_chars),
+		    OPT_Wformat_,
+		    "empty precision in %s format", fki->name);
+      while (ISDIGIT (*format_chars))
+	++format_chars;
+    }
 
-      format_start = format_chars;
-      if (fki->alloc_char && fki->alloc_char == *format_chars)
-	{
-	  i = strlen (flag_chars);
-	  flag_chars[i++] = fki->alloc_char;
-	  flag_chars[i] = 0;
-	  format_chars++;
-	}
+  return true;
+}
 
-      /* Handle the scanf allocation kludge.  */
-      if (fki->flags & (int) FMT_FLAG_SCANF_A_KLUDGE)
+/* Parse any assignment-allocation flags, which request an extra
+   char ** for writing back a dynamically-allocated char *.
+   This is for handling the optional 'm' character in scanf,
+   and, before C99, 'a' (for compatibility with a non-standard
+   GNU libc extension).  */
+
+void
+argument_parser::handle_alloc_chars ()
+{
+  if (fki->alloc_char && fki->alloc_char == *format_chars)
+    {
+      flag_chars.add_char (fki->alloc_char);
+      format_chars++;
+    }
+
+  /* Handle the scanf allocation kludge.  */
+  if (fki->flags & (int) FMT_FLAG_SCANF_A_KLUDGE)
+    {
+      if (*format_chars == 'a' && !flag_isoc99)
 	{
-	  if (*format_chars == 'a' && !flag_isoc99)
+	  if (format_chars[1] == 's' || format_chars[1] == 'S'
+	      || format_chars[1] == '[')
 	    {
-	      if (format_chars[1] == 's' || format_chars[1] == 'S'
-		  || format_chars[1] == '[')
-		{
-		  /* 'a' is used as a flag.  */
-		  i = strlen (flag_chars);
-		  flag_chars[i++] = 'a';
-		  flag_chars[i] = 0;
-		  format_chars++;
-		}
+	      /* 'a' is used as a flag.  */
+	      flag_chars.add_char ('a');
+	      format_chars++;
 	    }
 	}
+    }
+}
 
-      /* Read any length modifier, if this kind of format has them.  */
-      fli = fki->length_char_specs;
-      length_chars = NULL;
-      length_chars_val = FMT_LEN_none;
-      length_chars_std = STD_C89;
-      scalar_identity_flag = 0;
-      if (fli)
+/* Look for length modifiers within the current format argument,
+   returning a length_modifier instance describing it (or the
+   default if one is not found).
+
+   Issue warnings about non-standard modifiers.  */
+
+length_modifier
+argument_parser::read_any_length_modifier ()
+{
+  length_modifier result;
+
+  const format_length_info *fli = fki->length_char_specs;
+  if (!fli)
+    return result;
+
+  while (fli->name != 0
+	 && strncmp (fli->name, format_chars, strlen (fli->name)))
+    fli++;
+  if (fli->name != 0)
+    {
+      format_chars += strlen (fli->name);
+      if (fli->double_name != 0 && fli->name[0] == *format_chars)
 	{
-	  while (fli->name != 0
- 		 && strncmp (fli->name, format_chars, strlen (fli->name)))
-	      fli++;
-	  if (fli->name != 0)
-	    {
- 	      format_chars += strlen (fli->name);
-	      if (fli->double_name != 0 && fli->name[0] == *format_chars)
-		{
-		  format_chars++;
-		  length_chars = fli->double_name;
-		  length_chars_val = fli->double_index;
-		  length_chars_std = fli->double_std;
-		}
-	      else
-		{
-		  length_chars = fli->name;
-		  length_chars_val = fli->index;
-		  length_chars_std = fli->std;
-		  scalar_identity_flag = fli->scalar_identity_flag;
-		}
-	      i = strlen (flag_chars);
-	      flag_chars[i++] = fki->length_code_char;
-	      flag_chars[i] = 0;
-	    }
-	  if (pedantic)
-	    {
-	      /* Warn if the length modifier is non-standard.  */
-	      if (ADJ_STD (length_chars_std) > C_STD_VER)
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "%s does not support the %qs %s length modifier",
-			    C_STD_NAME (length_chars_std), length_chars,
-			    fki->name);
-	    }
+	  format_chars++;
+	  result = length_modifier (fli->double_name, fli->double_index,
+				    fli->double_std, 0);
 	}
-
-      /* Read any modifier (strftime E/O).  */
-      if (fki->modifier_chars != NULL)
+      else
 	{
-	  while (*format_chars != 0
-		 && strchr (fki->modifier_chars, *format_chars) != 0)
-	    {
-	      if (strchr (flag_chars, *format_chars) != 0)
-		{
-		  const format_flag_spec *s = get_flag_spec (flag_specs,
-							     *format_chars, NULL);
-		  warning_at (location_from_offset (format_string_loc,
-						    format_chars 
-						    - orig_format_chars),
-			      OPT_Wformat_,
-			      "repeated %s in format", _(s->name));
-		}
-	      else
-		{
-		  i = strlen (flag_chars);
-		  flag_chars[i++] = *format_chars;
-		  flag_chars[i] = 0;
-		}
-	      ++format_chars;
-	    }
+	  result = length_modifier (fli->name, fli->index, fli->std,
+				    fli->scalar_identity_flag);
 	}
+      flag_chars.add_char (fki->length_code_char);
+    }
+  if (pedantic)
+    {
+      /* Warn if the length modifier is non-standard.  */
+      if (ADJ_STD (result.std) > C_STD_VER)
+	warning_at (format_string_loc, OPT_Wformat_,
+		    "%s does not support the %qs %s length modifier",
+		    C_STD_NAME (result.std), result.chars,
+		    fki->name);
+    }
 
-      format_char = *format_chars;
-      if (format_char == 0
-	  || (!(fki->flags & (int) FMT_FLAG_FANCY_PERCENT_OK)
-	      && format_char == '%'))
+  return result;
+}
+
+/* Read any other modifier (strftime E/O).  */
+
+void
+argument_parser::read_any_other_modifier ()
+{
+  if (fki->modifier_chars == NULL)
+    return;
+
+  while (*format_chars != 0
+	 && strchr (fki->modifier_chars, *format_chars) != 0)
+    {
+      if (flag_chars.has_char_p (*format_chars))
 	{
+	  const format_flag_spec *s = get_flag_spec (flag_specs,
+						     *format_chars, NULL);
 	  warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
+					    format_chars
+					    - orig_format_chars),
 		      OPT_Wformat_,
-		      "conversion lacks type at end of format");
-	  continue;
+		      "repeated %s in format", _(s->name));
 	}
-      format_chars++;
-      fci = fki->conversion_specs;
-      while (fci->format_chars != 0
-	     && strchr (fci->format_chars, format_char) == 0)
-	  ++fci;
-      if (fci->format_chars == 0)
+      else
+	flag_chars.add_char (*format_chars);
+      ++format_chars;
+    }
+}
+
+/* Return the format_char_info corresponding to FORMAT_CHAR,
+   potentially issuing a warning if the format char is
+   not supported in the C standard version we are checking
+   against.
+
+   Issue a warning and return NULL if it is not found.
+
+   Issue warnings about non-standard modifiers.  */
+
+const format_char_info *
+argument_parser::find_format_char_info (char format_char)
+{
+  const format_char_info *fci = fki->conversion_specs;
+
+  while (fci->format_chars != 0
+	 && strchr (fci->format_chars, format_char) == 0)
+    ++fci;
+  if (fci->format_chars == 0)
+    {
+      if (ISGRAPH (format_char))
+	warning_at (location_from_offset (format_string_loc,
+					  format_chars - orig_format_chars),
+		    OPT_Wformat_,
+		    "unknown conversion type character %qc in format",
+		    format_char);
+      else
+	warning_at (location_from_offset (format_string_loc,
+					  format_chars - orig_format_chars),
+		    OPT_Wformat_,
+		    "unknown conversion type character 0x%x in format",
+		    format_char);
+      return NULL;
+    }
+
+  if (pedantic)
+    {
+      if (ADJ_STD (fci->std) > C_STD_VER)
+	warning_at (location_from_offset (format_string_loc,
+					  format_chars - orig_format_chars),
+		    OPT_Wformat_,
+		    "%s does not support the %<%%%c%> %s format",
+		    C_STD_NAME (fci->std), format_char, fki->name);
+    }
+
+  return fci;
+}
+
+/* Validate the pairs of flags used.
+   Issue warnings about incompatible combinations of flags.  */
+
+void
+argument_parser::validate_flag_pairs (const format_char_info *fci,
+				      char format_char)
+{
+  const format_flag_pair * const bad_flag_pairs = fki->bad_flag_pairs;
+
+  for (int i = 0; bad_flag_pairs[i].flag_char1 != 0; i++)
+    {
+      const format_flag_spec *s, *t;
+      if (!flag_chars.has_char_p (bad_flag_pairs[i].flag_char1))
+	continue;
+      if (!flag_chars.has_char_p (bad_flag_pairs[i].flag_char2))
+	continue;
+      if (bad_flag_pairs[i].predicate != 0
+	  && strchr (fci->flags2, bad_flag_pairs[i].predicate) == 0)
+	continue;
+      s = get_flag_spec (flag_specs, bad_flag_pairs[i].flag_char1, NULL);
+      t = get_flag_spec (flag_specs, bad_flag_pairs[i].flag_char2, NULL);
+      if (bad_flag_pairs[i].ignored)
 	{
-	  if (ISGRAPH (format_char))
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character %qc in format",
-			format_char);
+	  if (bad_flag_pairs[i].predicate != 0)
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"%s ignored with %s and %<%%%c%> %s format",
+			_(s->name), _(t->name), format_char,
+			fki->name);
 	  else
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"unknown conversion type character 0x%x in format",
-			format_char);
-	  continue;
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"%s ignored with %s in %s format",
+			_(s->name), _(t->name), fki->name);
 	}
-      if (pedantic)
+      else
 	{
-	  if (ADJ_STD (fci->std) > C_STD_VER)
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"%s does not support the %<%%%c%> %s format",
-			C_STD_NAME (fci->std), format_char, fki->name);
+	  if (bad_flag_pairs[i].predicate != 0)
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"use of %s and %s together with %<%%%c%> %s format",
+			_(s->name), _(t->name), format_char,
+			fki->name);
+	  else
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"use of %s and %s together in %s format",
+			_(s->name), _(t->name), fki->name);
 	}
+    }
+}
 
-      /* Validate the individual flags used, removing any that are invalid.  */
-      {
-	int d = 0;
-	for (i = 0; flag_chars[i] != 0; i++)
-	  {
-	    const format_flag_spec *s = get_flag_spec (flag_specs,
-						       flag_chars[i], NULL);
-	    flag_chars[i - d] = flag_chars[i];
-	    if (flag_chars[i] == fki->length_code_char)
-	      continue;
-	    if (strchr (fci->flag_chars, flag_chars[i]) == 0)
-	      {
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars 
-						  - orig_format_chars),
-			    OPT_Wformat_, "%s used with %<%%%c%> %s format",
-			    _(s->name), format_char, fki->name);
-		d++;
-		continue;
-	      }
-	    if (pedantic)
-	      {
-		const format_flag_spec *t;
-		if (ADJ_STD (s->std) > C_STD_VER)
-		  warning_at (format_string_loc, OPT_Wformat_,
-			      "%s does not support %s",
-                              C_STD_NAME (s->std), _(s->long_name));
-		t = get_flag_spec (flag_specs, flag_chars[i], fci->flags2);
-		if (t != NULL && ADJ_STD (t->std) > ADJ_STD (s->std))
-		  {
-		    const char *long_name = (t->long_name != NULL
-					     ? t->long_name
-					     : s->long_name);
-		    if (ADJ_STD (t->std) > C_STD_VER)
-		      warning_at (format_string_loc, OPT_Wformat_,
-				  "%s does not support %s with the %<%%%c%> %s format",
-				  C_STD_NAME (t->std), _(long_name),
-				  format_char, fki->name);
-		  }
-	      }
-	  }
-	flag_chars[i - d] = 0;
-      }
-
-      if ((fki->flags & (int) FMT_FLAG_SCANF_A_KLUDGE)
-	  && strchr (flag_chars, 'a') != 0)
-	alloc_flag = 1;
-      if (fki->alloc_char && strchr (flag_chars, fki->alloc_char) != 0)
-	alloc_flag = 1;
-
-      if (fki->suppression_char
-	  && strchr (flag_chars, fki->suppression_char) != 0)
-	suppressed = 1;
+/* Give Y2K warnings.  */
 
-      /* Validate the pairs of flags used.  */
-      for (i = 0; bad_flag_pairs[i].flag_char1 != 0; i++)
+void
+argument_parser::give_y2k_warnings (const format_char_info *fci,
+				    char format_char)
+{
+  if (!warn_format_y2k)
+    return;
+
+  int y2k_level = 0;
+  if (strchr (fci->flags2, '4') != 0)
+    if (flag_chars.has_char_p ('E'))
+      y2k_level = 3;
+    else
+      y2k_level = 2;
+  else if (strchr (fci->flags2, '3') != 0)
+    y2k_level = 3;
+  else if (strchr (fci->flags2, '2') != 0)
+    y2k_level = 2;
+  if (y2k_level == 3)
+    warning_at (format_string_loc, OPT_Wformat_y2k,
+		"%<%%%c%> yields only last 2 digits of "
+		"year in some locales", format_char);
+  else if (y2k_level == 2)
+    warning_at (format_string_loc, OPT_Wformat_y2k,
+		"%<%%%c%> yields only last 2 digits of year",
+		format_char);
+}
+
+/* Parse any "scan sets" enclosed in square brackets, e.g.
+   for scanf-style calls.  */
+
+void
+argument_parser::parse_any_scan_set (const format_char_info *fci)
+{
+  if (strchr (fci->flags2, '[') == NULL)
+    return;
+
+  /* Skip over scan set, in case it happens to have '%' in it.  */
+  if (*format_chars == '^')
+    ++format_chars;
+  /* Find closing bracket; if one is hit immediately, then
+     it's part of the scan set rather than a terminator.  */
+  if (*format_chars == ']')
+    ++format_chars;
+  while (*format_chars && *format_chars != ']')
+    ++format_chars;
+  if (*format_chars != ']')
+    /* The end of the format string was reached.  */
+    warning_at (location_from_offset (format_string_loc,
+				      format_chars - orig_format_chars),
+		OPT_Wformat_,
+		"no closing %<]%> for %<%%[%> format");
+}
+
+/* Return true if this argument is to be continued to be parsed,
+   false to skip to next argument.  */
+
+bool
+argument_parser::handle_conversions (const format_char_info *fci,
+				     const length_modifier &len_modifier,
+				     tree &wanted_type,
+				     const char *&wanted_type_name,
+				     unsigned HOST_WIDE_INT &arg_num,
+				     tree &params,
+				     char format_char)
+{
+  enum format_std_version wanted_type_std;
+
+  if (!(fki->flags & (int) FMT_FLAG_ARG_CONVERT))
+    return true;
+
+  wanted_type = (fci->types[len_modifier.val].type
+		 ? *fci->types[len_modifier.val].type : 0);
+  wanted_type_name = fci->types[len_modifier.val].name;
+  wanted_type_std = fci->types[len_modifier.val].std;
+  if (wanted_type == 0)
+    {
+      warning_at (location_from_offset (format_string_loc,
+					format_chars - orig_format_chars),
+		  OPT_Wformat_,
+		  "use of %qs length modifier with %qc type character"
+		  " has either no effect or undefined behavior",
+		  len_modifier.chars, format_char);
+      /* Heuristic: skip one argument when an invalid length/type
+	 combination is encountered.  */
+      arg_num++;
+      if (params != 0)
+	params = TREE_CHAIN (params);
+      return false;
+    }
+  else if (pedantic
+	   /* Warn if non-standard, provided it is more non-standard
+	      than the length and type characters that may already
+	      have been warned for.  */
+	   && ADJ_STD (wanted_type_std) > ADJ_STD (len_modifier.std)
+	   && ADJ_STD (wanted_type_std) > ADJ_STD (fci->std))
+    {
+      if (ADJ_STD (wanted_type_std) > C_STD_VER)
+	warning_at (location_from_offset (format_string_loc,
+					  format_chars - orig_format_chars),
+		    OPT_Wformat_,
+		    "%s does not support the %<%%%s%c%> %s format",
+		    C_STD_NAME (wanted_type_std), len_modifier.chars,
+		    format_char, fki->name);
+    }
+
+  return true;
+}
+
+/* Check type of argument against desired type.
+
+   Return true if format parsing is to continue, false otherwise.  */
+
+bool
+argument_parser::
+check_argument_type (const format_char_info *fci,
+		     const length_modifier &len_modifier,
+		     tree &wanted_type,
+		     const char *&wanted_type_name,
+		     const bool suppressed,
+		     unsigned HOST_WIDE_INT &arg_num,
+		     tree &params,
+		     const int alloc_flag,
+		     const char * const format_start)
+{
+  if (info->first_arg_num == 0)
+    return true;
+
+  if ((fci->pointer_count == 0 && wanted_type == void_type_node)
+      || suppressed)
+    {
+      if (main_arg_num != 0)
 	{
-	  const format_flag_spec *s, *t;
-	  if (strchr (flag_chars, bad_flag_pairs[i].flag_char1) == 0)
-	    continue;
-	  if (strchr (flag_chars, bad_flag_pairs[i].flag_char2) == 0)
-	    continue;
-	  if (bad_flag_pairs[i].predicate != 0
-	      && strchr (fci->flags2, bad_flag_pairs[i].predicate) == 0)
-	    continue;
-	  s = get_flag_spec (flag_specs, bad_flag_pairs[i].flag_char1, NULL);
-	  t = get_flag_spec (flag_specs, bad_flag_pairs[i].flag_char2, NULL);
-	  if (bad_flag_pairs[i].ignored)
-	    {
-	      if (bad_flag_pairs[i].predicate != 0)
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "%s ignored with %s and %<%%%c%> %s format",
-			    _(s->name), _(t->name), format_char,
-			    fki->name);
-	      else
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "%s ignored with %s in %s format",
-			    _(s->name), _(t->name), fki->name);
-	    }
+	  if (suppressed)
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"operand number specified with "
+			"suppressed assignment");
 	  else
-	    {
-	      if (bad_flag_pairs[i].predicate != 0)
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "use of %s and %s together with %<%%%c%> %s format",
-			    _(s->name), _(t->name), format_char,
-			    fki->name);
-	      else
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "use of %s and %s together in %s format",
-			    _(s->name), _(t->name), fki->name);
-	    }
+	    warning_at (format_string_loc, OPT_Wformat_,
+			"operand number specified for format "
+			"taking no argument");
 	}
+    }
+  else
+    {
+      format_wanted_type *wanted_type_ptr;
 
-      /* Give Y2K warnings.  */
-      if (warn_format_y2k)
+      if (main_arg_num != 0)
 	{
-	  int y2k_level = 0;
-	  if (strchr (fci->flags2, '4') != 0)
-	    if (strchr (flag_chars, 'E') != 0)
-	      y2k_level = 3;
-	    else
-	      y2k_level = 2;
-	  else if (strchr (fci->flags2, '3') != 0)
-	    y2k_level = 3;
-	  else if (strchr (fci->flags2, '2') != 0)
-	    y2k_level = 2;
-	  if (y2k_level == 3)
-	    warning_at (format_string_loc, OPT_Wformat_y2k,
-			"%<%%%c%> yields only last 2 digits of "
-			"year in some locales", format_char);
-	  else if (y2k_level == 2)
-	    warning_at (format_string_loc, OPT_Wformat_y2k,
-			"%<%%%c%> yields only last 2 digits of year",
-			format_char);
+	  arg_num = main_arg_num;
+	  params = main_arg_params;
 	}
-
-      if (strchr (fci->flags2, '[') != 0)
+      else
 	{
-	  /* Skip over scan set, in case it happens to have '%' in it.  */
-	  if (*format_chars == '^')
-	    ++format_chars;
-	  /* Find closing bracket; if one is hit immediately, then
-	     it's part of the scan set rather than a terminator.  */
-	  if (*format_chars == ']')
-	    ++format_chars;
-	  while (*format_chars && *format_chars != ']')
-	    ++format_chars;
-	  if (*format_chars != ']')
-	    /* The end of the format string was reached.  */
-	    warning_at (location_from_offset (format_string_loc,
-					      format_chars - orig_format_chars),
-			OPT_Wformat_,
-			"no closing %<]%> for %<%%[%> format");
+	  ++arg_num;
+	  if (has_operand_number > 0)
+	    {
+	      warning_at (format_string_loc, OPT_Wformat_,
+			  "missing $ operand number in format");
+	      return false;
+	    }
+	  else
+	    has_operand_number = 0;
 	}
 
-      wanted_type = 0;
-      wanted_type_name = 0;
-      if (fki->flags & (int) FMT_FLAG_ARG_CONVERT)
+      wanted_type_ptr = &main_wanted_type;
+      while (fci)
 	{
-	  wanted_type = (fci->types[length_chars_val].type
-			 ? *fci->types[length_chars_val].type : 0);
-	  wanted_type_name = fci->types[length_chars_val].name;
-	  wanted_type_std = fci->types[length_chars_val].std;
-	  if (wanted_type == 0)
+	  tree cur_param;
+	  if (params == 0)
+	    cur_param = NULL;
+	  else
 	    {
-	      warning_at (location_from_offset (format_string_loc,
-						format_chars - orig_format_chars),
-			  OPT_Wformat_,
-			  "use of %qs length modifier with %qc type character"
-			  " has either no effect or undefined behavior",
-			  length_chars, format_char);
-	      /* Heuristic: skip one argument when an invalid length/type
-		 combination is encountered.  */
-	      arg_num++;
-	      if (params != 0)
-                params = TREE_CHAIN (params);
-	      continue;
+	      cur_param = TREE_VALUE (params);
+	      params = TREE_CHAIN (params);
 	    }
-	  else if (pedantic
-		   /* Warn if non-standard, provided it is more non-standard
-		      than the length and type characters that may already
-		      have been warned for.  */
-		   && ADJ_STD (wanted_type_std) > ADJ_STD (length_chars_std)
-		   && ADJ_STD (wanted_type_std) > ADJ_STD (fci->std))
+
+	  wanted_type_ptr->wanted_type = wanted_type;
+	  wanted_type_ptr->wanted_type_name = wanted_type_name;
+	  wanted_type_ptr->pointer_count = fci->pointer_count + alloc_flag;
+	  wanted_type_ptr->char_lenient_flag = 0;
+	  if (strchr (fci->flags2, 'c') != 0)
+	    wanted_type_ptr->char_lenient_flag = 1;
+	  wanted_type_ptr->scalar_identity_flag = 0;
+	  if (len_modifier.scalar_identity_flag)
+	    wanted_type_ptr->scalar_identity_flag = 1;
+	  wanted_type_ptr->writing_in_flag = 0;
+	  wanted_type_ptr->reading_from_flag = 0;
+	  if (alloc_flag)
+	    wanted_type_ptr->writing_in_flag = 1;
+	  else
 	    {
-	      if (ADJ_STD (wanted_type_std) > C_STD_VER)
-		warning_at (location_from_offset (format_string_loc,
-						  format_chars - orig_format_chars),
-			    OPT_Wformat_,
-			    "%s does not support the %<%%%s%c%> %s format",
-			    C_STD_NAME (wanted_type_std), length_chars,
-			    format_char, fki->name);
+	      if (strchr (fci->flags2, 'W') != 0)
+		wanted_type_ptr->writing_in_flag = 1;
+	      if (strchr (fci->flags2, 'R') != 0)
+		wanted_type_ptr->reading_from_flag = 1;
+	    }
+	  wanted_type_ptr->kind = CF_KIND_FORMAT;
+	  wanted_type_ptr->param = cur_param;
+	  wanted_type_ptr->arg_num = arg_num;
+	  wanted_type_ptr->format_start = format_start;
+	  wanted_type_ptr->format_length = format_chars - format_start;
+	  wanted_type_ptr->offset_loc = format_chars - orig_format_chars;
+	  wanted_type_ptr->next = NULL;
+	  if (last_wanted_type != 0)
+	    last_wanted_type->next = wanted_type_ptr;
+	  if (first_wanted_type == 0)
+	    first_wanted_type = wanted_type_ptr;
+	  last_wanted_type = wanted_type_ptr;
+
+	  fci = fci->chain;
+	  if (fci)
+	    {
+	      wanted_type_ptr = fwt_pool.allocate ();
+	      arg_num++;
+	      wanted_type = *fci->types[len_modifier.val].type;
+	      wanted_type_name = fci->types[len_modifier.val].name;
 	    }
 	}
+    }
 
-      main_wanted_type.next = NULL;
+  if (first_wanted_type != 0)
+    check_format_types (format_string_loc, first_wanted_type);
 
-      /* Finally. . .check type of argument against desired type!  */
-      if (info->first_arg_num == 0)
+  return true;
+}
+
+/* Do the main part of checking a call to a format function.  FORMAT_CHARS
+   is the NUL-terminated format string (which at this point may contain
+   internal NUL characters); FORMAT_LENGTH is its length (excluding the
+   terminating NUL character).  ARG_NUM is one less than the number of
+   the first format argument to check; PARAMS points to that format
+   argument in the list of arguments.  */
+
+static void
+check_format_info_main (format_check_results *res,
+			function_format_info *info, const char *format_chars,
+			int format_length, tree params,
+			unsigned HOST_WIDE_INT arg_num,
+			object_allocator <format_wanted_type> &fwt_pool)
+{
+  const char * const orig_format_chars = format_chars;
+  const tree first_fillin_param = params;
+
+  const format_kind_info * const fki = &format_types[info->format_type];
+  const format_flag_spec * const flag_specs = fki->flag_specs;
+  const location_t format_string_loc = res->format_string_loc;
+
+  /* -1 if no conversions taking an operand have been found; 0 if one has
+     and it didn't use $; 1 if $ formats are in use.  */
+  int has_operand_number = -1;
+
+  init_dollar_format_checking (info->first_arg_num, first_fillin_param);
+
+  while (*format_chars != 0)
+    {
+      if (*format_chars++ != '%')
 	continue;
-      if ((fci->pointer_count == 0 && wanted_type == void_type_node)
-	  || suppressed)
+      if (*format_chars == 0)
 	{
-	  if (main_arg_num != 0)
-	    {
-	      if (suppressed)
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "operand number specified with "
-			    "suppressed assignment");
-	      else
-		warning_at (format_string_loc, OPT_Wformat_,
-			    "operand number specified for format "
-			    "taking no argument");
-	    }
+          warning_at (location_from_offset (format_string_loc,
+					    format_chars - orig_format_chars),
+		      OPT_Wformat_,
+		      "spurious trailing %<%%%> in format");
+	  continue;
 	}
-      else
+      if (*format_chars == '%')
 	{
-	  format_wanted_type *wanted_type_ptr;
+	  ++format_chars;
+	  continue;
+	}
 
-	  if (main_arg_num != 0)
-	    {
-	      arg_num = main_arg_num;
-	      params = main_arg_params;
-	    }
-	  else
-	    {
-	      ++arg_num;
-	      if (has_operand_number > 0)
-		{
-		  warning_at (format_string_loc, OPT_Wformat_,
-			      "missing $ operand number in format");
-		  return;
-		}
-	      else
-		has_operand_number = 0;
-	    }
+      flag_chars_t flag_chars;
+      argument_parser arg_parser (info, format_chars, orig_format_chars,
+				  format_string_loc,
+				  flag_chars, has_operand_number,
+				  first_fillin_param, fwt_pool);
 
-	  wanted_type_ptr = &main_wanted_type;
-	  while (fci)
-	    {
-	      if (params == 0)
-                cur_param = NULL;
-              else
-                {
-                  cur_param = TREE_VALUE (params);
-                  params = TREE_CHAIN (params);
-                }
-
-	      wanted_type_ptr->wanted_type = wanted_type;
-	      wanted_type_ptr->wanted_type_name = wanted_type_name;
-	      wanted_type_ptr->pointer_count = fci->pointer_count + alloc_flag;
-	      wanted_type_ptr->char_lenient_flag = 0;
-	      if (strchr (fci->flags2, 'c') != 0)
-		wanted_type_ptr->char_lenient_flag = 1;
-	      wanted_type_ptr->scalar_identity_flag = 0;
-	      if (scalar_identity_flag)
-		wanted_type_ptr->scalar_identity_flag = 1;
-	      wanted_type_ptr->writing_in_flag = 0;
-	      wanted_type_ptr->reading_from_flag = 0;
-	      if (alloc_flag)
-		wanted_type_ptr->writing_in_flag = 1;
-	      else
-		{
-		  if (strchr (fci->flags2, 'W') != 0)
-		    wanted_type_ptr->writing_in_flag = 1;
-		  if (strchr (fci->flags2, 'R') != 0)
-		    wanted_type_ptr->reading_from_flag = 1;
-		}
-              wanted_type_ptr->kind = CF_KIND_FORMAT;
-	      wanted_type_ptr->param = cur_param;
-	      wanted_type_ptr->arg_num = arg_num;
-	      wanted_type_ptr->format_start = format_start;
-	      wanted_type_ptr->format_length = format_chars - format_start;
-	      wanted_type_ptr->offset_loc = format_chars - orig_format_chars;
-	      wanted_type_ptr->next = NULL;
-	      if (last_wanted_type != 0)
-		last_wanted_type->next = wanted_type_ptr;
-	      if (first_wanted_type == 0)
-		first_wanted_type = wanted_type_ptr;
-	      last_wanted_type = wanted_type_ptr;
-
-	      fci = fci->chain;
-	      if (fci)
-		{
-		  wanted_type_ptr = fwt_pool.allocate ();
-		  arg_num++;
-		  wanted_type = *fci->types[length_chars_val].type;
-		  wanted_type_name = fci->types[length_chars_val].name;
-		}
-	    }
+      if (!arg_parser.read_any_dollar ())
+	return;
+
+      if (!arg_parser.read_format_flags ())
+	return;
+
+      /* Read any format width, possibly * or *m$.  */
+      if (!arg_parser.read_any_format_width (params, arg_num))
+	return;
+
+      /* Read any format left precision (must be a number, not *).  */
+      arg_parser.read_any_format_left_precision ();
+
+      /* Read any format precision, possibly * or *m$.  */
+      if (!arg_parser.read_any_format_precision (params, arg_num))
+	return;
+
+      const char *format_start = format_chars;
+
+      arg_parser.handle_alloc_chars ();
+
+      /* Read any length modifier, if this kind of format has them.  */
+      const length_modifier len_modifier
+	= arg_parser.read_any_length_modifier ();
+
+      /* Read any modifier (strftime E/O).  */
+      arg_parser.read_any_other_modifier ();
+
+      char format_char = *format_chars;
+      if (format_char == 0
+	  || (!(fki->flags & (int) FMT_FLAG_FANCY_PERCENT_OK)
+	      && format_char == '%'))
+	{
+	  warning_at (location_from_offset (format_string_loc,
+					    format_chars - orig_format_chars),
+		      OPT_Wformat_,
+		      "conversion lacks type at end of format");
+	  continue;
 	}
+      format_chars++;
 
-      if (first_wanted_type != 0)
-        check_format_types (format_string_loc, first_wanted_type);
+      const format_char_info * const fci
+	= arg_parser.find_format_char_info (format_char);
+      if (!fci)
+	continue;
+
+      flag_chars.validate (fki, fci, flag_specs, format_chars,
+			   format_string_loc, orig_format_chars, format_char);
+
+      const int alloc_flag = flag_chars.get_alloc_flag (fki);
+      const bool suppressed = flag_chars.assignment_suppression_p (fki);
+
+      /* Validate the pairs of flags used.  */
+      arg_parser.validate_flag_pairs (fci, format_char);
+
+      arg_parser.give_y2k_warnings (fci, format_char);
+
+      arg_parser.parse_any_scan_set (fci);
+
+      tree wanted_type = NULL;
+      const char *wanted_type_name = NULL;
+
+      if (!arg_parser.handle_conversions (fci, len_modifier,
+					  wanted_type, wanted_type_name,
+					  arg_num,
+					  params,
+					  format_char))
+	continue;
+
+      arg_parser.main_wanted_type.next = NULL;
+
+      /* Finally. . .check type of argument against desired type!  */
+      if (!arg_parser.check_argument_type (fci, len_modifier,
+					   wanted_type, wanted_type_name,
+					   suppressed,
+					   arg_num, params,
+					   alloc_flag,
+					   format_start))
+	return;
     }
 
   if (format_chars - orig_format_chars != format_length)
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2016-08-05 18:17                     ` [Committed] [PATCH 2/4] (v4) " David Malcolm
@ 2016-08-06  5:48                       ` Markus Trippelsdorf
  2016-08-06  5:59                         ` Prathamesh Kulkarni
  2021-09-02 13:59                       ` [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Thomas Schwinge
  1 sibling, 1 reply; 61+ messages in thread
From: Markus Trippelsdorf @ 2016-08-06  5:48 UTC (permalink / raw)
  To: David Malcolm; +Cc: Jeff Law, gcc-patches, Joseph Myers

On 2016.08.05 at 14:16 -0400, David Malcolm wrote:
> Successfully bootstrapped&regrtested the updated patch on x86_64-pc
> -linux-gnu, and successfully ran the stage 1 selftests on powerpc-ibm
> -aix7.1.3.0 (gcc111)
> 
> Committed to trunk as r239175; I'm attaching the final version of the
> patch for reference.

It breaks the build on ppc64le (gcc112):

/home/trippels/gcc_build_dir/./gcc/xgcc -B/home/trippels/gcc_build_dir/./gcc/ -xc -S -c /dev/null -fself-test
cc1: internal compiler error: Segmentation fault
0x1088b293 crash_signal
        ../../gcc/gcc/toplev.c:335
0x1115c694 cpp_string_location_reader::get_next()
        ../../gcc/libcpp/charset.c:2143
0x1115c694 _cpp_valid_ucn
        ../../gcc/libcpp/charset.c:1079
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
Makefile:1898: recipe for target 's-selftest' failed

-- 
Markus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2016-08-06  5:48                       ` Markus Trippelsdorf
@ 2016-08-06  5:59                         ` Prathamesh Kulkarni
  2016-08-06 18:10                           ` [committed] Fix crash in selftest::test_lexer_string_locations_ucn4 (PR bootstrap/72823) David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Prathamesh Kulkarni @ 2016-08-06  5:59 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: David Malcolm, Jeff Law, gcc Patches, Joseph Myers

On 6 August 2016 at 11:16, Markus Trippelsdorf <markus@trippelsdorf.de> wrote:
> On 2016.08.05 at 14:16 -0400, David Malcolm wrote:
>> Successfully bootstrapped&regrtested the updated patch on x86_64-pc
>> -linux-gnu, and successfully ran the stage 1 selftests on powerpc-ibm
>> -aix7.1.3.0 (gcc111)
>>
>> Committed to trunk as r239175; I'm attaching the final version of the
>> patch for reference.
>
> It breaks the build on ppc64le (gcc112):
FWIW I am observing the same error on x86_64-unknown-linux-gnu:
http://pastebin.com/63k4CRVY

Thanks,
Prathamesh
>
> /home/trippels/gcc_build_dir/./gcc/xgcc -B/home/trippels/gcc_build_dir/./gcc/ -xc -S -c /dev/null -fself-test
> cc1: internal compiler error: Segmentation fault
> 0x1088b293 crash_signal
>         ../../gcc/gcc/toplev.c:335
> 0x1115c694 cpp_string_location_reader::get_next()
>         ../../gcc/libcpp/charset.c:2143
> 0x1115c694 _cpp_valid_ucn
>         ../../gcc/libcpp/charset.c:1079
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See <http://gcc.gnu.org/bugs.html> for instructions.
> Makefile:1898: recipe for target 's-selftest' failed
>
> --
> Markus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [committed] Fix crash in selftest::test_lexer_string_locations_ucn4 (PR bootstrap/72823)
  2016-08-06  5:59                         ` Prathamesh Kulkarni
@ 2016-08-06 18:10                           ` David Malcolm
  0 siblings, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-06 18:10 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Markus Trippelsdorf
  Cc: Jeff Law, gcc Patches, Joseph Myers

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Sat, 2016-08-06 at 11:29 +0530, Prathamesh Kulkarni wrote:
> On 6 August 2016 at 11:16, Markus Trippelsdorf <
> markus@trippelsdorf.de> wrote:
> > On 2016.08.05 at 14:16 -0400, David Malcolm wrote:
> > > Successfully bootstrapped&regrtested the updated patch on x86_64
> > > -pc
> > > -linux-gnu, and successfully ran the stage 1 selftests on powerpc
> > > -ibm
> > > -aix7.1.3.0 (gcc111)
> > > 
> > > Committed to trunk as r239175; I'm attaching the final version of
> > > the
> > > patch for reference.
> > 
> > It breaks the build on ppc64le (gcc112):
> FWIW I am observing the same error on x86_64-unknown-linux-gnu:
> http://pastebin.com/63k4CRVY
> 
> Thanks,
> Prathamesh
> > 
> > /home/trippels/gcc_build_dir/./gcc/xgcc 
> > -B/home/trippels/gcc_build_dir/./gcc/ -xc -S -c /dev/null -fself
> > -test
> > cc1: internal compiler error: Segmentation fault
> > 0x1088b293 crash_signal
> >         ../../gcc/gcc/toplev.c:335
> > 0x1115c694 cpp_string_location_reader::get_next()
> >         ../../gcc/libcpp/charset.c:2143
> > 0x1115c694 _cpp_valid_ucn
> >         ../../gcc/libcpp/charset.c:1079
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > Please include the complete backtrace with any bug report.
> > See <http://gcc.gnu.org/bugs.html> for instructions.
> > Makefile:1898: recipe for target 's-selftest' failed
> > 
> > --
> > Markus

Sorry about the breakage.  Looks like gcc_assert in libcpp/system.h can
sometimes enforce the assertion, and sometimes be a no-op, depending on
the host compiler.

I was able to reliably reproduce the crash on hacking up my
libcpp/system.h so that gcc_assert was enforced.

I've committed the attached patch to trunk as r239211.
(survives selftest with hacked-up gcc_assert; reported as fixing the
issue on IRC; survived bootstrap also).

I'll have a look at improving the libcpp assert situation on Monday.

Sorry again about the breakage

Dave

[-- Attachment #2: 0003-Fix-crash-in-selftest-test_lexer_string_locations_uc.patch --]
[-- Type: text/x-patch, Size: 1895 bytes --]

From 0a35b0da798bb2dd4e5af23505075b74558fe956 Mon Sep 17 00:00:00 2001
From: David Malcolm <dmalcolm@redhat.com>
Date: Sat, 6 Aug 2016 14:05:24 -0400
Subject: [PATCH] Fix crash in selftest::test_lexer_string_locations_ucn4 (PR
 bootstrap/72823)

libcpp/ChangeLog:
	PR bootstrap/72823
	* charset.c (_cpp_valid_ucn): Replace overzealous assert with one
	that allows for char_range to be non-NULL when loc_reader is NULL.
---
 libcpp/charset.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/libcpp/charset.c b/libcpp/charset.c
index 3739d6c..6a92ade 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -1027,7 +1027,7 @@ ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c,
    IDENTIFIER_POS is 0 when not in an identifier, 1 for the start of
    an identifier, or 2 otherwise.
 
-   If CHAR_RANGE and LOC_READER are non-NULL, then position information is
+   If LOC_READER is non-NULL, then position information is
    read from *LOC_READER and CHAR_RANGE->m_finish is updated accordingly.  */
 
 bool
@@ -1042,10 +1042,6 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
   const uchar *str = *pstr;
   const uchar *base = str - 2;
 
-  /* char_range and loc_reader must either be both NULL, or both be
-     non-NULL.  */
-  gcc_assert ((char_range != NULL) == (loc_reader != NULL));
-
   if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99))
     cpp_error (pfile, CPP_DL_WARNING,
 	       "universal character names are only valid in C++ and C99");
@@ -1076,7 +1072,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
 	break;
       str++;
       if (loc_reader)
-	char_range->m_finish = loc_reader->get_next ().m_finish;
+	{
+	  gcc_assert (char_range);
+	  char_range->m_finish = loc_reader->get_next ().m_finish;
+	}
       result = (result << 4) + hex_value (c);
     }
   while (--length && str < limit);
-- 
1.8.5.3


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] c-format.c: cleanup of check_format_info_main
  2016-08-06  0:56                     ` [PATCH] c-format.c: cleanup of check_format_info_main David Malcolm
@ 2016-08-08 17:20                       ` Jeff Law
  0 siblings, 0 replies; 61+ messages in thread
From: Jeff Law @ 2016-08-08 17:20 UTC (permalink / raw)
  To: David Malcolm, gcc-patches; +Cc: Joseph Myers, Martin Sebor

On 08/05/2016 07:24 PM, David Malcolm wrote:
> On Thu, 2016-08-04 at 14:22 -0600, Jeff Law wrote:
>> On 08/04/2016 01:24 PM, David Malcolm wrote:
>>
>>>> Do you realize that this isn't used for ~700 lines after this
>>>> point?
>>>>  Is
>>>> there any sensible way to factor some code here to avoid the
>>>> coding
>>>> disconnect.  I realize the function was huge before you got in
>>>> here,
>>>> but
>>>> if at all possible, I'd like to see a bit of cleanup.
>>>>
>>>> I think this is OK after that cleanup.
>>>
>>> format_chars can get modified in numerous places in the intervening
>>> lines, which is why I stash the value there.
>> Yea, I figured that was the case.  I first noticed the stashed value,
>> but didn't see where it was used for far longer than I expected.
>>
>>>
>>> I can do some kind of cleanup of check_format_info_main, maybe
>>> splitting out the things in the body of loop, moving them to
>>> support
>>> functions.
>> That's essentially what I was thinking.
>>
>>>
>>> That said, I note that Martin's sprintf patch:
>>>   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00056.html
>>> also touches those ~700 lines in check_format_info_main in over a
>>> dozen
>>> places.  Given that, would you prefer I do the cleanup before or
>>> after
>>> the substring_loc patch?
>> I think you should go first with the cleanup.  It'll cause Martin
>> some
>> heartburn, but that happens sometimes.
>>
>> And FWIW, if you hadn't needed to stash away that value I probably
>> wouldn't have noticed how badly that function (and the loop in
>> particular) needed some refactoring.
>>
>> jeff
>
> Here's a cleanup of check_format_info_main, which introduces three
> new classes to hold state, and moves code from the loop into
> methods of those classes, reducing the loop from ~700 lines to
> ~100 lines.
>
> Unfortunately, so much changes in this patch that the before/after
> diff is hard to read.  If you like the end-result, but would prefer
> better history I could try to split this up into a more readable set
> of patches.  (I have a version of that, but they're messy)
>
> Successfully bootstrapped&regrtested the updated patch on
> x86_64-pc-linux-gnu.
>
> OK for trunk?
>
> gcc/c-family/ChangeLog:
> 	* c-format.c (class flag_chars_t): New class.
> 	(struct length_modifier): New struct.
> 	(class argument_parser): New class.
> 	(flag_chars_t::flag_chars_t): New ctor.
> 	(flag_chars_t::has_char_p): New method.
> 	(flag_chars_t::add_char): New method.
> 	(flag_chars_t::validate): New method.
> 	(flag_chars_t::get_alloc_flag): New method.
> 	(flag_chars_t::assignment_suppression_p): New method.
> 	(argument_parser::argument_parser): New ctor.
> 	(argument_parser::read_any_dollar): New method.
> 	(argument_parser::read_format_flags): New method.
> 	(argument_parser::read_any_format_width): New method.
> 	(argument_parser::read_any_format_left_precision): New method.
> 	(argument_parser::read_any_format_precision): New method.
> 	(argument_parser::handle_alloc_chars): New method.
> 	(argument_parser::read_any_length_modifier): New method.
> 	(argument_parser::read_any_other_modifier): New method.
> 	(argument_parser::find_format_char_info): New method.
> 	(argument_parser::validate_flag_pairs): New method.
> 	(argument_parser::give_y2k_warnings): New method.
> 	(argument_parser::parse_any_scan_set): New method.
> 	(argument_parser::handle_conversions): New method.
> 	(argument_parser::check_argument_type): New method.
> 	(check_format_info_main): Introduce classes argument_parser
> 	and flag_chars_t, moving the code within the loop into methods
> 	of these classes.  Make various locals "const".
OK.  Thanks for cleaning this up.

jeff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952)
  2016-08-04 18:09               ` Jeff Law
  2016-08-04 19:25                 ` David Malcolm
@ 2016-08-08 20:16                 ` David Malcolm
  1 sibling, 0 replies; 61+ messages in thread
From: David Malcolm @ 2016-08-08 20:16 UTC (permalink / raw)
  To: Jeff Law, gcc-patches; +Cc: Joseph Myers

[-- Attachment #1: Type: text/plain, Size: 3340 bytes --]

On Thu, 2016-08-04 at 12:08 -0600, Jeff Law wrote:
> On 08/03/2016 09:45 AM, David Malcolm wrote:
> > This patch updates c-format.c to use the new class substring_loc,
> > added
> > in the previous patch, replacing location_column_from_byte_offset.
> > Hence with this patch, Wformat can underline the precise erroneous
> > format string in many more cases.
> > 
> > The patch also introduces two new functions for emitting Wformat
> > warnings: format_warning_at_substring and format_warning_at_char,
> > providing an inform in the face of macros where the pertinent part
> > of
> > the format string may be separate from the function call.
> > 
> > Successfully bootstrapped&regrtested in conjunction with the rest
> > of the
> > patch kit on x86_64-pc-linux-gnu.
> > 
> > (The v2 version of the patch had a successful selftest run for
> > stage 1 on
> > powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the
> > patch
> > kit, and a successful build of stage1 for all targets via config
> > -list.mk;
> > the patch has only been rebased since)
> > 
> > OK for trunk if it passes individual testing? (on top of patches 1
> > -2)
> > 
> > gcc/c-family/ChangeLog:
> > 	PR c/52952
> > 	* c-format.c: Include "diagnostic.h".
> > 	(location_column_from_byte_offset): Delete.
> > 	(location_from_offset): Delete.
> > 	(format_warning_va): New function.
> > 	(format_warning_at_substring): New function.
> > 	(format_warning_at_char): New function.
> > 	(check_format_arg): Capture location of format_tree and pass to
> > 	check_format_info_main.
> > 	(check_format_info_main): Add params FMT_PARAM_LOC and
> > 	FORMAT_STRING_CST.  Convert calls to warning_at to calls to
> > 	format_warning_at_char.  Pass a substring_loc instance to
> > 	check_format_types.
> > 	(check_format_types): Convert first param from a location_t
> > 	to a const substring_loc & and rename to "fmt_loc".  Attempt
> > 	to extract the range of the relevant parameter and pass it
> > 	to format_type_warning.
> > 	(format_type_warning): Convert first param from a location_t
> > 	to a const substring_loc & and rename to "fmt_loc".  Add
> > 	params "param_range" and "type".  Replace calls to warning_at
> > 	with calls to format_warning_at_substring.
> > 
> > gcc/testsuite/ChangeLog:
> > 	PR c/52952
> > 	* gcc.dg/cpp/pr66415-1.c: Likewise.
> > 	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
> > 	* gcc.dg/format/c90-printf-1.c: Likewise.
> > 	* gcc.dg/format/diagnostic-ranges.c: New test case.
> > ---
> > 
> 
> > @@ -1758,6 +1859,7 @@ check_format_info_main (format_check_results
> > *res,
> >  	  ++format_chars;
> >  	  continue;
> >  	}
> > +      const char *start_of_this_format = format_chars;
> Do you realize that this isn't used for ~700 lines after this point? 
>  Is 
> there any sensible way to factor some code here to avoid the coding 
> disconnect.  I realize the function was huge before you got in here,
> but 
> if at all possible, I'd like to see a bit of cleanup.
> 
> I think this is OK after that cleanup.

Thanks.  The patch needed obvious fixups to apply after the cleanup,
given the breakup of check_format_info_main.

I'm attaching what I actually committed to trunk (r239253).

(Successfully bootstrapped&regrtested on x86_64-pc-linux-gnu; 
successful selftest run for stage 1 on powerpc-ibm-aix7.1.3.0, on
gcc111)

[-- Attachment #2: r239253.patch --]
[-- Type: text/x-patch, Size: 42467 bytes --]

Index: gcc/c-family/ChangeLog
===================================================================
--- gcc/c-family/ChangeLog	(revision 239252)
+++ gcc/c-family/ChangeLog	(revision 239253)
@@ -1,5 +1,50 @@
 2016-08-08  David Malcolm  <dmalcolm@redhat.com>
 
+	PR c/52952
+	* c-format.c: Include "diagnostic.h".
+	(location_column_from_byte_offset): Delete.
+	(location_from_offset): Delete.
+	(format_warning_va): New function.
+	(format_warning_at_substring): New function.
+	(format_warning_at_char): New function.
+	(check_format_arg): Capture location of format_tree and pass to
+	check_format_info_main.
+	(argument_parser): Add fields "start_of_this_format" and
+	"format_string_cst".
+	(flag_chars_t::validate): Add param "format_string_cst".  Convert
+	warning_at call using location_from_offset to call to
+	format_warning_at_char.
+	(argument_parser::argument_parser): Add param "format_string_cst_"
+	and use use it to initialize field "format_string_cst".
+	Initialize new field "start_of_this_format".
+	(argument_parser::read_format_flags): Convert warning_at call
+	using location_from_offset to a call to format_warning_at_char.
+	(argument_parser::read_any_format_left_precision): Likewise.
+	(argument_parser::read_any_format_precision): Likewise.
+	(argument_parser::read_any_other_modifier): Likewise.
+	(argument_parser::find_format_char_info): Likewise, in three places.
+	(argument_parser::parse_any_scan_set): Likewise, in one place.
+	(argument_parser::handle_conversions): Likewise, in two places.
+	(argument_parser::check_argument_type): Add param "fmt_param_loc"
+	and use it to make a substring_loc.  Pass the latter to
+	check_format_types.
+	(check_format_info_main): Add params "fmt_param_loc" and
+	"format_string_cst".  Convert warning_at calls using
+	location_from_offset to calls to format_warning_at_char.  Pass the
+	new params to the arg_parser ctor.  Pass "format_string_cst" to
+	flag_chars.validate.  Pass "fmt_param_loc" to
+	arg_parser.check_argument_type.
+	(check_format_types): Convert first param from a location_t
+	to a const substring_loc & and rename to "fmt_loc".  Attempt
+	to extract the range of the relevant parameter and pass it
+	to format_type_warning.
+	(format_type_warning): Convert first param from a location_t
+	to a const substring_loc & and rename to "fmt_loc".  Add
+	params "param_range" and "type".  Replace calls to warning_at
+	with calls to format_warning_at_substring.
+
+2016-08-08  David Malcolm  <dmalcolm@redhat.com>
+
 	* c-format.c (class flag_chars_t): New class.
 	(struct length_modifier): New struct.
 	(class argument_parser): New class.
Index: gcc/c-family/c-format.c
===================================================================
--- gcc/c-family/c-format.c	(revision 239252)
+++ gcc/c-family/c-format.c	(revision 239253)
@@ -29,6 +29,7 @@
 #include "intl.h"
 #include "langhooks.h"
 #include "c-format.h"
+#include "diagnostic.h"
 
 /* Handle attributes associated with format checking.  */
 
@@ -65,78 +66,169 @@
 static const char *format_name (int format_num);
 static int format_flags (int format_num);
 
-/* Given a string S of length LINE_WIDTH, find the visual column
-   corresponding to OFFSET bytes.   */
+/* Emit a warning governed by option OPT, using GMSGID as the format
+   string and AP as its arguments.
 
-static unsigned int
-location_column_from_byte_offset (const char *s, int line_width,
-				  unsigned int offset)
+   Attempt to obtain precise location information within a string
+   literal from FMT_LOC.
+
+   Case 1: if substring location is available, and is within the range of
+   the format string itself, the primary location of the
+   diagnostic is the substring range obtained from FMT_LOC, with the
+   caret at the *end* of the substring range.
+
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf ("hello %i", msg);
+                    ~^
+
+   Case 2: if the substring location is available, but is not within
+   the range of the format string, the primary location is that of the
+   format string, and an note is emitted showing the substring location.
+
+   For example:
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf("hello " INT_FMT " world", msg);
+            ^~~~~~~~~~~~~~~~~~~~~~~~~
+     test.c:19: note: format string is defined here
+     #define INT_FMT "%i"
+                      ~^
+
+   Case 3: if precise substring information is unavailable, the primary
+   location is that of the whole string passed to FMT_LOC's constructor.
+   For example:
+
+     test.c:90:10: warning: problem with '%i' here [-Wformat=]
+     printf(fmt, msg);
+            ^~~
+
+   For each of cases 1-3, if param_range is non-NULL, then it is used
+   as a secondary range within the warning.  For example, here it
+   is used with case 1:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo %s bar", long_i + long_j);
+                  ~^       ~~~~~~~~~~~~~~~
+
+   and here with case 2:
+
+     test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
+     printf ("foo " STR_FMT " bar", long_i + long_j);
+             ^~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~
+     test.c:89:16: note: format string is defined here
+     #define STR_FMT "%s"
+                      ~^
+
+   and with case 3:
+
+     test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
+     printf(fmt, msg);
+            ^~~  ~~~
+
+   Return true if a warning was emitted, false otherwise.  */
+
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_va (const substring_loc &fmt_loc, source_range *param_range,
+		   int opt, const char *gmsgid, va_list *ap)
 {
-  const char * c = s;
-  if (*c != '"')
-    return 0;
-
-  c++, offset--;
-  while (offset > 0)
+  bool substring_within_range = false;
+  location_t primary_loc;
+  location_t substring_loc = UNKNOWN_LOCATION;
+  source_range fmt_loc_range
+    = get_range_from_loc (line_table, fmt_loc.get_fmt_string_loc ());
+  source_range fmt_substring_range;
+  const char *err = fmt_loc.get_range (&fmt_substring_range);
+  if (err)
+    /* Case 3: unable to get substring location.  */
+    primary_loc = fmt_loc.get_fmt_string_loc ();
+  else
     {
-      if (c - s >= line_width)
-	return 0;
+      substring_loc = make_location (fmt_substring_range.m_finish,
+				     fmt_substring_range.m_start,
+				     fmt_substring_range.m_finish);
 
-      switch (*c)
+      if (fmt_substring_range.m_start >= fmt_loc_range.m_start
+	  && fmt_substring_range.m_finish <= fmt_loc_range.m_finish)
+	/* Case 1.  */
 	{
-	case '\\':
-	  c++;
-	  if (c - s >= line_width)
-	    return 0;
-	  switch (*c)
-	    {
-	    case '\\': case '\'': case '"': case '?':
-	    case '(': case '{': case '[': case '%':
-	    case 'a': case 'b': case 'f': case 'n':
-	    case 'r': case 't': case 'v': 
-	    case 'e': case 'E':
-	      c++, offset--;
-	      break;
+	  substring_within_range = true;
+	  primary_loc = substring_loc;
+	}
+      else
+	/* Case 2.  */
+	{
+	  substring_within_range = false;
+	  primary_loc = fmt_loc.get_fmt_string_loc ();
+	}
+    }
 
-	    default:
-	      return 0;
-	    }
-	  break;
+  rich_location richloc (line_table, primary_loc);
 
-	case '"':
-	  /* We found the end of the string too early.  */
-	  return 0;
-	  
-	default:
-	  c++, offset--;
-	  break;
-	}
+  if (param_range)
+    {
+      location_t param_loc = make_location (param_range->m_start,
+					    param_range->m_start,
+					    param_range->m_finish);
+      richloc.add_range (param_loc, false);
     }
-  return c - s;
+
+  diagnostic_info diagnostic;
+  diagnostic_set_info (&diagnostic, gmsgid, ap, &richloc, DK_WARNING);
+  diagnostic.option_index = opt;
+  bool warned = report_diagnostic (&diagnostic);
+
+  if (!err && substring_loc && !substring_within_range)
+    /* Case 2.  */
+    if (warned)
+      inform (substring_loc, "format string is defined here");
+
+  return warned;
 }
 
-/* Return a location that encodes the same location as LOC but shifted
-   by OFFSET bytes.  */
+/* Variadic call to format_warning_va.  */
 
-static location_t
-location_from_offset (location_t loc, int offset)
+ATTRIBUTE_GCC_DIAG (4,0)
+static bool
+format_warning_at_substring (const substring_loc &fmt_loc,
+			     source_range *param_range,
+			     int opt, const char *gmsgid, ...)
 {
-  gcc_checking_assert (offset >= 0);
-  if (linemap_location_from_macro_expansion_p (line_table, loc)
-      || offset < 0)
-    return loc;
+  va_list ap;
+  va_start (ap, gmsgid);
+  bool warned = format_warning_va (fmt_loc, param_range, opt, gmsgid, &ap);
+  va_end (ap);
 
-  expanded_location s = expand_location_to_spelling_point (loc);
-  int line_width;
-  const char *line = location_get_source_line (s.file, s.line, &line_width);
-  if (line == NULL)
-    return loc;
-  line += s.column - 1 ;
-  line_width -= s.column - 1;
-  unsigned int column =
-    location_column_from_byte_offset (line, line_width, (unsigned) offset);
+  return warned;
+}
 
-  return linemap_position_for_loc_and_offset (line_table, loc, column);
+/* Emit a warning as per format_warning_va, but construct the substring_loc
+   for the character at offset (CHAR_IDX - 1) within a string constant
+   FORMAT_STRING_CST at FMT_STRING_LOC.  */
+
+ATTRIBUTE_GCC_DIAG (5,6)
+static bool
+format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
+			int char_idx, int opt, const char *gmsgid, ...)
+{
+  va_list ap;
+  va_start (ap, gmsgid);
+  tree string_type = TREE_TYPE (format_string_cst);
+
+  /* The callers are of the form:
+       format_warning (format_string_loc, format_string_cst,
+		       format_chars - orig_format_chars,
+      where format_chars has already been incremented, so that
+      CHAR_IDX is one character beyond where the warning should
+      be emitted.  Fix it.  */
+  char_idx -= 1;
+
+  substring_loc fmt_loc (fmt_string_loc, string_type, char_idx, char_idx);
+  bool warned = format_warning_va (fmt_loc, NULL, opt, gmsgid, &ap);
+  va_end (ap);
+
+  return warned;
 }
 
 /* Check that we have a pointer to a string suitable for use as a format.
@@ -1018,8 +1110,9 @@
 static void check_format_info (function_format_info *, tree);
 static void check_format_arg (void *, tree, unsigned HOST_WIDE_INT);
 static void check_format_info_main (format_check_results *,
-				    function_format_info *,
-				    const char *, int, tree,
+				    function_format_info *, const char *,
+				    location_t, tree,
+				    int, tree,
 				    unsigned HOST_WIDE_INT,
 				    object_allocator<format_wanted_type> &);
 
@@ -1032,8 +1125,12 @@
 static const format_flag_spec *get_flag_spec (const format_flag_spec *,
 					      int, const char *);
 
-static void check_format_types (location_t, format_wanted_type *);
-static void format_type_warning (location_t, format_wanted_type *, tree, tree);
+static void check_format_types (const substring_loc &fmt_loc,
+				format_wanted_type *);
+static void format_type_warning (const substring_loc &fmt_loc,
+				 source_range *param_range,
+				 format_wanted_type *, tree,
+				 tree);
 
 /* Decode a format type from a string, returning the type, or
    format_type_error if not valid, in which case the caller should print an
@@ -1509,6 +1606,8 @@
   tree array_size = 0;
   tree array_init;
 
+  location_t fmt_param_loc = EXPR_LOC_OR_LOC (format_tree, input_location);
+
   if (VAR_P (format_tree))
     {
       /* Pull out a constant value if the front end didn't.  */
@@ -1684,8 +1783,8 @@
      need not adjust it for every return.  */
   res->number_other++;
   object_allocator <format_wanted_type> fwt_pool ("format_wanted_type pool");
-  check_format_info_main (res, info, format_chars, format_length,
-			  params, arg_num, fwt_pool);
+  check_format_info_main (res, info, format_chars, fmt_param_loc, format_tree,
+			  format_length, params, arg_num, fwt_pool);
 }
 
 /* Support class for argument_parser and check_format_info_main.
@@ -1702,6 +1801,7 @@
 		 const format_char_info *fci,
 		 const format_flag_spec *flag_specs,
 		 const char * const format_chars,
+		 tree format_string_cst,
 		 location_t format_string_loc,
 		 const char * const orig_format_chars,
 		 char format_char);
@@ -1744,6 +1844,7 @@
 {
  public:
   argument_parser (function_format_info *info, const char *&format_chars,
+		   tree format_string_cst,
 		   const char * const orig_format_chars,
 		   location_t format_string_loc, flag_chars_t &flag_chars,
 		   int &has_operand_number, tree first_fillin_param,
@@ -1799,13 +1900,16 @@
 		       unsigned HOST_WIDE_INT &arg_num,
 		       tree &params,
 		       const int alloc_flag,
-		       const char * const format_start);
+		       const char * const format_start,
+		       location_t fmt_param_loc);
 
  private:
   const function_format_info *const info;
   const format_kind_info * const fki;
   const format_flag_spec * const flag_specs;
+  const char *start_of_this_format;
   const char *&format_chars;
+  const tree format_string_cst;
   const char * const orig_format_chars;
   const location_t format_string_loc;
   object_allocator <format_wanted_type> &fwt_pool;
@@ -1855,6 +1959,7 @@
 			const format_char_info *fci,
 			const format_flag_spec *flag_specs,
 			const char * const format_chars,
+			tree format_string_cst,
 			location_t format_string_loc,
 			const char * const orig_format_chars,
 			char format_char)
@@ -1870,11 +1975,11 @@
 	continue;
       if (strchr (fci->flag_chars, m_flag_chars[i]) == 0)
 	{
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars
-					    - orig_format_chars),
-		      OPT_Wformat_, "%s used with %<%%%c%> %s format",
-		      _(s->name), format_char, fki->name);
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "%s used with %<%%%c%> %s format",
+				  _(s->name), format_char, fki->name);
 	  d++;
 	  continue;
 	}
@@ -1935,6 +2040,7 @@
 
 argument_parser::
 argument_parser (function_format_info *info_, const char *&format_chars_,
+		 tree format_string_cst_,
 		 const char * const orig_format_chars_,
 		 location_t format_string_loc_,
 		 flag_chars_t &flag_chars_,
@@ -1944,7 +2050,9 @@
 : info (info_),
   fki (&format_types[info->format_type]),
   flag_specs (fki->flag_specs),
+  start_of_this_format (format_chars_),
   format_chars (format_chars_),
+  format_string_cst (format_string_cst_),
   orig_format_chars (orig_format_chars_),
   format_string_loc (format_string_loc_),
   fwt_pool (fwt_pool_),
@@ -2008,11 +2116,10 @@
 						 *format_chars, NULL);
       if (flag_chars.has_char_p (*format_chars))
 	{
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars + 1
-					    - orig_format_chars),
-		      OPT_Wformat_,
-		      "repeated %s in format", _(s->name));
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars + 1 - orig_format_chars,
+				  OPT_Wformat_,
+				  "repeated %s in format", _(s->name));
 	}
       else
 	flag_chars.add_char (*format_chars);
@@ -2145,10 +2252,10 @@
   ++format_chars;
   flag_chars.add_char (fki->left_precision_char);
   if (!ISDIGIT (*format_chars))
-    warning_at (location_from_offset (format_string_loc,
-				      format_chars - orig_format_chars),
-		OPT_Wformat_,
-		"empty left precision in %s format", fki->name);
+    format_warning_at_char (format_string_loc, format_string_cst,
+			    format_chars - orig_format_chars,
+			    OPT_Wformat_,
+			    "empty left precision in %s format", fki->name);
   while (ISDIGIT (*format_chars))
     ++format_chars;
 }
@@ -2236,10 +2343,10 @@
     {
       if (!(fki->flags & (int) FMT_FLAG_EMPTY_PREC_OK)
 	  && !ISDIGIT (*format_chars))
-	warning_at (location_from_offset (format_string_loc,
-					  format_chars - orig_format_chars),
-		    OPT_Wformat_,
-		    "empty precision in %s format", fki->name);
+	format_warning_at_char (format_string_loc, format_string_cst,
+				format_chars - orig_format_chars,
+				OPT_Wformat_,
+				"empty precision in %s format", fki->name);
       while (ISDIGIT (*format_chars))
 	++format_chars;
     }
@@ -2340,11 +2447,10 @@
 	{
 	  const format_flag_spec *s = get_flag_spec (flag_specs,
 						     *format_chars, NULL);
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars
-					    - orig_format_chars),
-		      OPT_Wformat_,
-		      "repeated %s in format", _(s->name));
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "repeated %s in format", _(s->name));
 	}
       else
 	flag_chars.add_char (*format_chars);
@@ -2372,17 +2478,19 @@
   if (fci->format_chars == 0)
     {
       if (ISGRAPH (format_char))
-	warning_at (location_from_offset (format_string_loc,
-					  format_chars - orig_format_chars),
-		    OPT_Wformat_,
-		    "unknown conversion type character %qc in format",
-		    format_char);
+	format_warning_at_char (format_string_loc, format_string_cst,
+				format_chars - orig_format_chars,
+				OPT_Wformat_,
+				"unknown conversion type character"
+				" %qc in format",
+				format_char);
       else
-	warning_at (location_from_offset (format_string_loc,
-					  format_chars - orig_format_chars),
-		    OPT_Wformat_,
-		    "unknown conversion type character 0x%x in format",
-		    format_char);
+	format_warning_at_char (format_string_loc, format_string_cst,
+				format_chars - orig_format_chars,
+				OPT_Wformat_,
+				"unknown conversion type character"
+				" 0x%x in format",
+				format_char);
       return NULL;
     }
 
@@ -2389,11 +2497,11 @@
   if (pedantic)
     {
       if (ADJ_STD (fci->std) > C_STD_VER)
-	warning_at (location_from_offset (format_string_loc,
-					  format_chars - orig_format_chars),
-		    OPT_Wformat_,
-		    "%s does not support the %<%%%c%> %s format",
-		    C_STD_NAME (fci->std), format_char, fki->name);
+	format_warning_at_char (format_string_loc, format_string_cst,
+				format_chars - orig_format_chars,
+				OPT_Wformat_,
+				"%s does not support the %<%%%c%> %s format",
+				C_STD_NAME (fci->std), format_char, fki->name);
     }
 
   return fci;
@@ -2496,10 +2604,10 @@
     ++format_chars;
   if (*format_chars != ']')
     /* The end of the format string was reached.  */
-    warning_at (location_from_offset (format_string_loc,
-				      format_chars - orig_format_chars),
-		OPT_Wformat_,
-		"no closing %<]%> for %<%%[%> format");
+    format_warning_at_char (format_string_loc, format_string_cst,
+			    format_chars - orig_format_chars,
+			    OPT_Wformat_,
+			    "no closing %<]%> for %<%%[%> format");
 }
 
 /* Return true if this argument is to be continued to be parsed,
@@ -2525,12 +2633,13 @@
   wanted_type_std = fci->types[len_modifier.val].std;
   if (wanted_type == 0)
     {
-      warning_at (location_from_offset (format_string_loc,
-					format_chars - orig_format_chars),
-		  OPT_Wformat_,
-		  "use of %qs length modifier with %qc type character"
-		  " has either no effect or undefined behavior",
-		  len_modifier.chars, format_char);
+      format_warning_at_char (format_string_loc, format_string_cst,
+			      format_chars - orig_format_chars,
+			      OPT_Wformat_,
+			      "use of %qs length modifier with %qc type"
+			      " character has either no effect"
+			      " or undefined behavior",
+			      len_modifier.chars, format_char);
       /* Heuristic: skip one argument when an invalid length/type
 	 combination is encountered.  */
       arg_num++;
@@ -2546,12 +2655,13 @@
 	   && ADJ_STD (wanted_type_std) > ADJ_STD (fci->std))
     {
       if (ADJ_STD (wanted_type_std) > C_STD_VER)
-	warning_at (location_from_offset (format_string_loc,
-					  format_chars - orig_format_chars),
-		    OPT_Wformat_,
-		    "%s does not support the %<%%%s%c%> %s format",
-		    C_STD_NAME (wanted_type_std), len_modifier.chars,
-		    format_char, fki->name);
+	format_warning_at_char (format_string_loc, format_string_cst,
+				format_chars - orig_format_chars,
+				OPT_Wformat_,
+				"%s does not support the %<%%%s%c%> %s format",
+				C_STD_NAME (wanted_type_std),
+				len_modifier.chars,
+				format_char, fki->name);
     }
 
   return true;
@@ -2571,7 +2681,8 @@
 		     unsigned HOST_WIDE_INT &arg_num,
 		     tree &params,
 		     const int alloc_flag,
-		     const char * const format_start)
+		     const char * const format_start,
+		     location_t fmt_param_loc)
 {
   if (info->first_arg_num == 0)
     return true;
@@ -2670,7 +2781,13 @@
     }
 
   if (first_wanted_type != 0)
-    check_format_types (format_string_loc, first_wanted_type);
+    {
+      ptrdiff_t offset_to_format_start = (start_of_this_format - 1) - orig_format_chars;
+      ptrdiff_t offset_to_format_end = (format_chars - 1) - orig_format_chars;
+      substring_loc fmt_loc (fmt_param_loc, TREE_TYPE (format_string_cst),
+			     offset_to_format_start, offset_to_format_end);
+      check_format_types (fmt_loc, first_wanted_type);
+    }
 
   return true;
 }
@@ -2685,6 +2802,7 @@
 static void
 check_format_info_main (format_check_results *res,
 			function_format_info *info, const char *format_chars,
+			location_t fmt_param_loc, tree format_string_cst,
 			int format_length, tree params,
 			unsigned HOST_WIDE_INT arg_num,
 			object_allocator <format_wanted_type> &fwt_pool)
@@ -2708,10 +2826,10 @@
 	continue;
       if (*format_chars == 0)
 	{
-          warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "spurious trailing %<%%%> in format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+				  format_chars - orig_format_chars,
+				  OPT_Wformat_,
+				  "spurious trailing %<%%%> in format");
 	  continue;
 	}
       if (*format_chars == '%')
@@ -2721,8 +2839,8 @@
 	}
 
       flag_chars_t flag_chars;
-      argument_parser arg_parser (info, format_chars, orig_format_chars,
-				  format_string_loc,
+      argument_parser arg_parser (info, format_chars, format_string_cst,
+				  orig_format_chars, format_string_loc,
 				  flag_chars, has_operand_number,
 				  first_fillin_param, fwt_pool);
 
@@ -2759,10 +2877,10 @@
 	  || (!(fki->flags & (int) FMT_FLAG_FANCY_PERCENT_OK)
 	      && format_char == '%'))
 	{
-	  warning_at (location_from_offset (format_string_loc,
-					    format_chars - orig_format_chars),
-		      OPT_Wformat_,
-		      "conversion lacks type at end of format");
+	  format_warning_at_char (format_string_loc, format_string_cst,
+			     format_chars - orig_format_chars,
+			     OPT_Wformat_,
+			     "conversion lacks type at end of format");
 	  continue;
 	}
       format_chars++;
@@ -2773,6 +2891,7 @@
 	continue;
 
       flag_chars.validate (fki, fci, flag_specs, format_chars,
+			   format_string_cst,
 			   format_string_loc, orig_format_chars, format_char);
 
       const int alloc_flag = flag_chars.get_alloc_flag (fki);
@@ -2803,15 +2922,15 @@
 					   suppressed,
 					   arg_num, params,
 					   alloc_flag,
-					   format_start))
+					   format_start, fmt_param_loc))
 	return;
     }
 
   if (format_chars - orig_format_chars != format_length)
-    warning_at (location_from_offset (format_string_loc,
-				      format_chars + 1 - orig_format_chars),
-		OPT_Wformat_contains_nul,
-		"embedded %<\\0%> in format");
+    format_warning_at_char (format_string_loc, format_string_cst,
+			    format_chars + 1 - orig_format_chars,
+			    OPT_Wformat_contains_nul,
+			    "embedded %<\\0%> in format");
   if (info->first_arg_num != 0 && params != 0
       && has_operand_number <= 0)
     {
@@ -2822,12 +2941,12 @@
     finish_dollar_format_checking (res, fki->flags & (int) FMT_FLAG_DOLLAR_GAP_POINTER_OK);
 }
 
-
 /* Check the argument types from a single format conversion (possibly
-   including width and precision arguments).  LOC is the location of
-   the format string.  */
+   including width and precision arguments).  FMT_LOC is the
+   location of the format conversion.  */
 static void
-check_format_types (location_t loc, format_wanted_type *types)
+check_format_types (const substring_loc &fmt_loc,
+		    format_wanted_type *types)
 {
   for (; types != 0; types = types->next)
     {
@@ -2854,7 +2973,7 @@
       cur_param = types->param;
       if (!cur_param)
         {
-          format_type_warning (loc, types, wanted_type, NULL);
+          format_type_warning (fmt_loc, NULL, types, wanted_type, NULL);
           continue;
         }
 
@@ -2864,6 +2983,16 @@
       orig_cur_type = cur_type;
       char_type_flag = 0;
 
+      source_range param_range;
+      source_range *param_range_ptr;
+      if (CAN_HAVE_LOCATION_P (cur_param))
+	{
+	  param_range = EXPR_LOCATION_RANGE (cur_param);
+	  param_range_ptr = &param_range;
+	}
+      else
+	param_range_ptr = NULL;
+
       STRIP_NOPS (cur_param);
 
       /* Check the types of any additional pointer arguments
@@ -2928,7 +3057,8 @@
 	    }
 	  else
 	    {
-              format_type_warning (loc, types, wanted_type, orig_cur_type);
+	      format_type_warning (fmt_loc, param_range_ptr,
+				   types, wanted_type, orig_cur_type);
 	      break;
 	    }
 	}
@@ -2996,20 +3126,24 @@
 	  && TYPE_PRECISION (cur_type) == TYPE_PRECISION (wanted_type))
 	continue;
       /* Now we have a type mismatch.  */
-      format_type_warning (loc, types, wanted_type, orig_cur_type);
+      format_type_warning (fmt_loc, param_range_ptr, types,
+			   wanted_type, orig_cur_type);
     }
 }
 
 
-/* Give a warning at LOC about a format argument of different type from that
-   expected.  WANTED_TYPE is the type the argument should have, possibly
-   stripped of pointer dereferences.  The description (such as "field
+/* Give a warning at FMT_LOC about a format argument of different type
+   from that expected.  If non-NULL, PARAM_RANGE is the source range of the
+   relevant argument.  WANTED_TYPE is the type the argument should have,
+   possibly stripped of pointer dereferences.  The description (such as "field
    precision"), the placement in the format string, a possibly more
    friendly name of WANTED_TYPE, and the number of pointer dereferences
    are taken from TYPE.  ARG_TYPE is the type of the actual argument,
    or NULL if it is missing.  */
 static void
-format_type_warning (location_t loc, format_wanted_type *type,
+format_type_warning (const substring_loc &fmt_loc,
+		     source_range *param_range,
+		     format_wanted_type *type,
 		     tree wanted_type, tree arg_type)
 {
   int kind = type->kind;
@@ -3018,7 +3152,6 @@
   int format_length = type->format_length;
   int pointer_count = type->pointer_count;
   int arg_num = type->arg_num;
-  unsigned int offset_loc = type->offset_loc;
 
   char *p;
   /* If ARG_TYPE is a typedef with a misleading name (for example,
@@ -3052,41 +3185,47 @@
       p[pointer_count + 1] = 0;
     }
 
-  loc = location_from_offset (loc, offset_loc);
-		      
   if (wanted_type_name)
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type_name, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%s%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type_name, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type_name, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%s%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type_name, p);
     }
   else
     {
       if (arg_type)
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
-		    "but argument %d has type %qT",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, 
-		    wanted_type, p, arg_num, arg_type);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects argument of type %<%T%s%>, "
+	   "but argument %d has type %qT",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start,
+	   wanted_type, p, arg_num, arg_type);
       else
-        warning_at (loc, OPT_Wformat_,
-		    "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
-		    gettext (kind_descriptions[kind]),
-		    (kind == CF_KIND_FORMAT ? "%" : ""),
-		    format_length, format_start, wanted_type, p);
+	format_warning_at_substring
+	  (fmt_loc, param_range,
+	   OPT_Wformat_,
+	   "%s %<%s%.*s%> expects a matching %<%T%s%> argument",
+	   gettext (kind_descriptions[kind]),
+	   (kind == CF_KIND_FORMAT ? "%" : ""),
+	   format_length, format_start, wanted_type, p);
     }
 }
 
Index: gcc/testsuite/gcc.dg/cpp/pr66415-1.c
===================================================================
--- gcc/testsuite/gcc.dg/cpp/pr66415-1.c	(revision 239252)
+++ gcc/testsuite/gcc.dg/cpp/pr66415-1.c	(revision 239253)
@@ -1,9 +1,15 @@
 /* PR c/66415 */
 /* { dg-do compile } */
-/* { dg-options "-Wformat" } */
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
 
 void
 fn1 (void)
 {
   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"); /* { dg-warning "71:format" } */
+
+/* { dg-begin-multiline-output "" }
+   __builtin_printf                                ("xxxxxxxxxxxxxxxxx%dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx");
+                                                                      ~^
+   { dg-end-multiline-output "" } */
+
 }
Index: gcc/testsuite/gcc.dg/format/diagnostic-ranges.c
===================================================================
--- gcc/testsuite/gcc.dg/format/diagnostic-ranges.c	(revision 0)
+++ gcc/testsuite/gcc.dg/format/diagnostic-ranges.c	(revision 239253)
@@ -0,0 +1,222 @@
+/* { dg-options "-Wformat -fdiagnostics-show-caret" } */
+
+/* See PR 52952. */
+
+#include "format.h"
+
+void test_mismatching_types (const char *msg)
+{
+  printf("hello %i", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello %i", msg);
+                 ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments (void)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, 101, 102);
+/* TODO: ideally would also underline "101".  */
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple_arguments_2 (int i, int j)
+{
+  printf ("arg0: %i  arg1: %s arg 2: %i", /* { dg-warning "29: format '%s'" } */
+          100, i + j, 102);
+/* { dg-begin-multiline-output "" }
+   printf ("arg0: %i  arg1: %s arg 2: %i",
+                            ~^
+           100, i + j, 102);
+                ~~~~~         
+   { dg-end-multiline-output "" } */
+}
+
+void multiline_format_string (void) {
+  printf ("before the fmt specifier" /* { dg-warning "11: format '%d' expects a matching 'int' argument" } */
+/* { dg-begin-multiline-output "" }
+   printf ("before the fmt specifier"
+           ^~~~~~~~~~~~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+
+          "%"
+          "d" /* { dg-message "12: format string is defined here" } */
+          "after the fmt specifier");
+
+/* { dg-begin-multiline-output "" }
+           "%"
+            ~~
+           "d"
+           ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_hex (const char *msg)
+{
+  /* "%" is \x25
+     "i" is \x69 */
+  printf("hello \x25\x69", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \x25\x69", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_oct (const char *msg)
+{
+  /* "%" is octal 045
+     "i" is octal 151.  */
+  printf("hello \045\151", msg);  /* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("hello \045\151", msg);
+                 ~~~~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_multiple (const char *msg)
+{
+  /* "%" is \x25 in hex
+     "i" is \151 in octal.  */
+  printf("prefix"  "\x25"  "\151"  "suffix",  /* { dg-warning "format '%i'" } */
+         msg);
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+          ^~~~~~~~
+  { dg-end-multiline-output "" } */
+
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf("prefix"  "\x25"  "\151"  "suffix",
+                     ~~~~~~~~~~~^
+  { dg-end-multiline-output "" } */
+}
+
+void test_u8 (const char *msg)
+{
+  printf(u8"hello %i", msg);/* { dg-warning "format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* TODO: ideally would also underline "msg".  */
+/* { dg-begin-multiline-output "" }
+   printf(u8"hello %i", msg);
+                   ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_param (long long_i, long long_j)
+{
+  printf ("foo %s bar", long_i + long_j); /* { dg-warning "17: format '%s' expects argument of type 'char \\*', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf ("foo %s bar", long_i + long_j);
+                ~^       ~~~~~~~~~~~~~~~
+   { dg-end-multiline-output "" } */
+}
+
+void test_field_width_specifier (long l, int i1, int i2)
+{
+  printf (" %*.*d ", l, i1, i2); /* { dg-warning "17: field width specifier '\\*' expects argument of type 'int', but argument 2 has type 'long int'" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %*.*d ", l, i1, i2);
+             ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_spurious_percent (void)
+{
+  printf("hello world %"); /* { dg-warning "23: spurious trailing" } */
+
+/* { dg-begin-multiline-output "" }
+   printf("hello world %");
+                       ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_empty_precision (char *s, size_t m, double d)
+{
+  strfmon (s, m, "%#.5n", d); /* { dg-warning "20: empty left precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#.5n", d);
+                    ^
+   { dg-end-multiline-output "" } */
+
+  strfmon (s, m, "%#5.n", d); /* { dg-warning "22: empty precision in gnu_strfmon format" } */
+/* { dg-begin-multiline-output "" }
+   strfmon (s, m, "%#5.n", d);
+                      ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_repeated (int i)
+{
+  printf ("%++d", i); /* { dg-warning "14: repeated '\\+' flag in format" } */
+/* { dg-begin-multiline-output "" }
+   printf ("%++d", i);
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_conversion_lacks_type (void)
+{
+  printf (" %h"); /* { dg-warning "14:conversion lacks type at end of format" } */
+/* { dg-begin-multiline-output "" }
+   printf (" %h");
+              ^
+   { dg-end-multiline-output "" } */
+}
+
+void test_embedded_nul (void)
+{
+  printf (" \0 "); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+/* { dg-begin-multiline-output "" }
+   printf (" \0 ");
+             ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_macro (const char *msg)
+{
+#define INT_FMT "%i" /* { dg-message "19: format string is defined here" } */
+  printf("hello " INT_FMT " world", msg);  /* { dg-warning "10: format '%i' expects argument of type 'int', but argument 2 has type 'const char \\*' " } */
+/* { dg-begin-multiline-output "" }
+   printf("hello " INT_FMT " world", msg);
+          ^~~~~~~~
+   { dg-end-multiline-output "" } */
+/* { dg-begin-multiline-output "" }
+ #define INT_FMT "%i"
+                  ~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_non_contiguous_strings (void)
+{
+  __builtin_printf(" %" "d ", 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+                                    /* { dg-message "26: format string is defined here" "" { target *-*-* } 200 } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                    ^~~~
+   { dg-end-multiline-output "" } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(" %" "d ", 0.5);
+                      ~~~~^
+   { dg-end-multiline-output "" } */
+}
+
+void test_const_arrays (void)
+{
+  /* TODO: ideally we'd highlight both the format string *and* the use of
+     it here.  For now, just verify that we gracefully handle this case.  */
+  const char a[] = " %d ";
+  __builtin_printf(a, 0.5); /* { dg-warning "20: format .%d. expects argument of type .int., but argument 2 has type .double." } */
+  /* { dg-begin-multiline-output "" }
+   __builtin_printf(a, 0.5);
+                    ^
+   { dg-end-multiline-output "" } */
+}
Index: gcc/testsuite/gcc.dg/format/asm_fprintf-1.c
===================================================================
--- gcc/testsuite/gcc.dg/format/asm_fprintf-1.c	(revision 239252)
+++ gcc/testsuite/gcc.dg/format/asm_fprintf-1.c	(revision 239253)
@@ -66,9 +66,9 @@
   asm_fprintf ("%d", i, i); /* { dg-warning "16:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   asm_fprintf (""); /* { dg-warning "16:zero-length" "warning for empty format" } */
-  asm_fprintf ("\0"); /* { dg-warning "17:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0", i); /* { dg-warning "19:embedded" "warning for embedded NUL" } */
-  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "19:embedded|too many" "warning for embedded NUL" } */
+  asm_fprintf ("\0"); /* { dg-warning "18:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0", i); /* { dg-warning "20:embedded" "warning for embedded NUL" } */
+  asm_fprintf ("%d\0%d", i, i); /* { dg-warning "20:embedded|too many" "warning for embedded NUL" } */
   asm_fprintf (NULL); /* { dg-warning "null" "null format string warning" } */
   asm_fprintf ("%"); /* { dg-warning "17:trailing" "trailing % warning" } */
   asm_fprintf ("%++d", i); /* { dg-warning "19:repeated" "repeated flag warning" } */
Index: gcc/testsuite/gcc.dg/format/c90-printf-1.c
===================================================================
--- gcc/testsuite/gcc.dg/format/c90-printf-1.c	(revision 239252)
+++ gcc/testsuite/gcc.dg/format/c90-printf-1.c	(revision 239253)
@@ -58,11 +58,11 @@
   printf ("%-%"); /* { dg-warning "13:type" "missing type" } */
   /* { dg-warning "14:trailing" "bogus %%" { target *-*-* } 58 } */
   printf ("%-%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 60 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 60 } */
   printf ("%5%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 62 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 62 } */
   printf ("%h%\n"); /* { dg-warning "13:format" "bogus %%" } */
-  /* { dg-warning "15:format" "bogus %%" { target *-*-* } 64 } */
+  /* { dg-warning "16:format" "bogus %%" { target *-*-* } 64 } */
   /* Valid and invalid %h, %l, %L constructions.  */
   printf ("%hd", i);
   printf ("%hi", i);
@@ -184,8 +184,8 @@
   printf ("%-08G", d); /* { dg-warning "11:flags|ignored" "0 flag ignored with - flag" } */
   /* Various tests of bad argument types.  */
   printf ("%d", l); /* { dg-warning "13:format" "bad argument types" } */
-  printf ("%*.*d", l, i2, i); /* { dg-warning "13:field" "bad * argument types" } */
-  printf ("%*.*d", i1, l, i); /* { dg-warning "15:field" "bad * argument types" } */
+  printf ("%*.*d", l, i2, i); /* { dg-warning "16:field" "bad * argument types" } */
+  printf ("%*.*d", i1, l, i); /* { dg-warning "16:field" "bad * argument types" } */
   printf ("%ld", i); /* { dg-warning "14:format" "bad argument types" } */
   printf ("%s", n); /* { dg-warning "13:format" "bad argument types" } */
   printf ("%p", i); /* { dg-warning "13:format" "bad argument types" } */
@@ -231,8 +231,8 @@
   printf ("%d", i, i); /* { dg-warning "11:arguments" "wrong number of args" } */
   /* Miscellaneous bogus constructions.  */
   printf (""); /* { dg-warning "11:zero-length" "warning for empty format" } */
-  printf ("\0"); /* { dg-warning "12:embedded" "warning for embedded NUL" } */
-  printf ("%d\0", i); /* { dg-warning "14:embedded" "warning for embedded NUL" } */
+  printf ("\0"); /* { dg-warning "13:embedded" "warning for embedded NUL" } */
+  printf ("%d\0", i); /* { dg-warning "15:embedded" "warning for embedded NUL" } */
   printf ("%d\0%d", i, i); /* { dg-warning "embedded|too many" "warning for embedded NUL" } */
   printf (NULL); /* { dg-warning "3:null" "null format string warning" } */
   printf ("%"); /* { dg-warning "12:trailing" "trailing % warning" } */
Index: gcc/testsuite/ChangeLog
===================================================================
--- gcc/testsuite/ChangeLog	(revision 239252)
+++ gcc/testsuite/ChangeLog	(revision 239253)
@@ -1,3 +1,11 @@
+2016-08-08  David Malcolm  <dmalcolm@redhat.com>
+
+	PR c/52952
+	* gcc.dg/cpp/pr66415-1.c: Likewise.
+	* gcc.dg/format/asm_fprintf-1.c: Update column numbers.
+	* gcc.dg/format/c90-printf-1.c: Likewise.
+	* gcc.dg/format/diagnostic-ranges.c: New test case.
+
 2016-08-08  Jakub Jelinek  <jakub@redhat.com>
 
 	PR fortran/72716

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-07-24  0:37   ` David Malcolm
@ 2016-08-23  3:25     ` Martin Sebor
  2016-08-23 13:59       ` David Malcolm
  0 siblings, 1 reply; 61+ messages in thread
From: Martin Sebor @ 2016-08-23  3:25 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

>> Beyond that, the range normally works fine, except when macros
>> are involved like they are in my tests.  You can see the effect
>> in the range.out file.  (This works without your patch but it
>> could very well be because I didn't set it up right.)
>
> Sadly I can't figure out what's going wrong - but the code's changed a
> lot at my end since then.  Sorry.

I have integrated the latest (already committed) version of your
patch into my -Wformat-length patch.  Everything (well almost)
works and I get nice ranges for the format string and for (some)
arguments.

I was surprised at how long it took me to switch from the previous
implementation (also copied from c-format.c) to this new API.  As
before, I had to copy bits and pieces of code from other parts of
the compiler to get it to work.  I was also surprised at how complex
making use of it is.  It added 130 lines of code to the pass, which
is 40 more than what I had before.  It seems that the
format_warning_at_substring function from c-format.c (perhaps
generalized and with some reasonable defaults hardcoded) should
be defined where other parts of GCC (including the middle end)
can reuse it.

I ran into a few minor glitches while testing it and I raised
the following bugs for two of them:

   77328 - incorrect caret location in -Wformat calling printf
           via a macro (this was pre-existing)
   77331 - incorrect range location in -Wformat with a concatenated
           format literal (this is new)

The third issue seems like a limitation that I should be able to
overcome but I couldn't figure out how using the new API.  The
problem is that there doesn't seem to be a way to point the caret
at the closing quote of a string, like for example in the following
test.  Even though by default the whole string is underlined (and
the caret points to the opening quote), there doesn't seem to be
a way to specify a range where the caret points to the other quote.
It's no big deal and I only noticed it because one of my tests
started failing, but it seems like it should be possible.

$ cat t.c && /build/gcc-49905/gcc/xgcc -B /build/gcc-49905/gcc -S 
-Wformat t.c
char d [2];

void f (void)
{
   __builtin_sprintf (d, "%sX", "1");
}
t.c: In function â€˜fâ€™:
t.c:5:25: warning: writing a terminating nul past the end of the 
destination [-Wformat-length=]
    __builtin_sprintf (d, "%sX", "1");
                          ^~~~~

What I would like to see is similar to what I get when one of
the format string characters is written past the end:

t.c:5:30: warning: writing format character â€˜Zâ€™ at offset 4 past the end 
of the destination [-Wformat-length=]
    __builtin_sprintf (d, "%sXYZ", "");
                               ^

Martin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-08-23  3:25     ` Martin Sebor
@ 2016-08-23 13:59       ` David Malcolm
  2016-08-23 15:18         ` Martin Sebor
  0 siblings, 1 reply; 61+ messages in thread
From: David Malcolm @ 2016-08-23 13:59 UTC (permalink / raw)
  To: Martin Sebor, gcc-patches

On Mon, 2016-08-22 at 21:25 -0600, Martin Sebor wrote:
> > > Beyond that, the range normally works fine, except when macros
> > > are involved like they are in my tests.  You can see the effect
> > > in the range.out file.  (This works without your patch but it
> > > could very well be because I didn't set it up right.)
> > 
> > Sadly I can't figure out what's going wrong - but the code's
> > changed a
> > lot at my end since then.  Sorry.
> 
> I have integrated the latest (already committed) version of your
> patch into my -Wformat-length patch.  Everything (well almost)
> works and I get nice ranges for the format string and for (some)
> arguments.
> 
> I was surprised at how long it took me to switch from the previous
> implementation (also copied from c-format.c) to this new API.  As
> before, I had to copy bits and pieces of code from other parts of
> the compiler to get it to work.  I was also surprised at how complex
> making use of it is.  It added 130 lines of code to the pass, which
> is 40 more than what I had before.  It seems that the
> format_warning_at_substring function from c-format.c (perhaps
> generalized and with some reasonable defaults hardcoded) should
> be defined where other parts of GCC (including the middle end)
> can reuse it.

I'm guessing that it was difficult because the most useful parts are
currently in c-format.c, whereas your code is in the middle-end.

Is the latest version of your patch posted somewhere where I can see
it?

The substring_loc class should probably be moved from c-family to gcc
also.   We might need a langhook to support that though (not sure yet).

I'd be up for doing these moves (maybe moving
 format_warning_at_substring to diagnostic.h/c), but I'd prefer to see
your patch first.


> I ran into a few minor glitches while testing it and I raised
> the following bugs for two of them:
> 
>    77328 - incorrect caret location in -Wformat calling printf
>            via a macro (this was pre-existing)
>    77331 - incorrect range location in -Wformat with a concatenated
>            format literal (this is new)
> 
> The third issue seems like a limitation that I should be able to
> overcome but I couldn't figure out how using the new API.  The
> problem is that there doesn't seem to be a way to point the caret
> at the closing quote of a string, like for example in the following
> test.  Even though by default the whole string is underlined (and
> the caret points to the opening quote), there doesn't seem to be
> a way to specify a range where the caret points to the other quote.
> It's no big deal and I only noticed it because one of my tests
> started failing, but it seems like it should be possible.
> 
> $ cat t.c && /build/gcc-49905/gcc/xgcc -B /build/gcc-49905/gcc -S 
> -Wformat t.c
> char d [2];
> 
> void f (void)
> {
>    __builtin_sprintf (d, "%sX", "1");
> }
> t.c: In function â€˜fâ€™:
> t.c:5:25: warning: writing a terminating nul past the end of the 
> destination [-Wformat-length=]
>     __builtin_sprintf (d, "%sX", "1");
>                           ^~~~~


> What I would like to see is similar to what I get when one of
> the format string characters is written past the end:
> 
> t.c:5:30: warning: writing format character â€˜Zâ€™ at offset 4 past the
> end 
> of the destination [-Wformat-length=]
>     __builtin_sprintf (d, "%sXYZ", "");
>                                ^
> 

So would you like the output to look like this:


Option (a): underline whole string, with caret at close-quote

t.c: In function â€˜fâ€™:
t.c:5:25: warning: writing a terminating nul past the end of the
destination [-Wformat-length=]
     __builtin_sprintf (d, "%sX", "1");
                           ~~~~^

or like this:

Option (b): just the close-quote

t.c: In function â€˜fâ€™:
t.c:5:25: warning: writing a terminating nul past
the end of the
destination [-Wformat-length=]
     __builtin_sprintf (d,
"%sX", "1");
                               ^
?

(do you also emit a note/inform showing the size of d?)

What API are you using to emit the warning?  Given the location of the
string as a whole expressed as a location_t, you can probably do
something like this:

location_t
option_a (location_t string_as_a_whole_loc)
{
  source_range src_range
    = get_range_from_loc (line_table, string_as_a_whole_loc);
  
  return make_location (src_range.m_finish, /* caret */
                        src_range.m_start, src_range.m_finish);
}

location_t
option_b (location_t string_as_a_whole_loc)
{
  source_range
src_range
    = get_range_from_loc (line_table, string_as_a_whole_loc);
 
  return src_range.m_finish;
}

(these could be added to libcpp)

but I get the impression you want something like this integrated into
the format_warning or substring_loc APIs (and it's hard to tell without
seeing your patch).

Hope this is helpful
Dave

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] RFC: On-demand locations within string-literals
  2016-08-23 13:59       ` David Malcolm
@ 2016-08-23 15:18         ` Martin Sebor
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Sebor @ 2016-08-23 15:18 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

On 08/23/2016 07:59 AM, David Malcolm wrote:
> On Mon, 2016-08-22 at 21:25 -0600, Martin Sebor wrote:
>>>> Beyond that, the range normally works fine, except when macros
>>>> are involved like they are in my tests.  You can see the effect
>>>> in the range.out file.  (This works without your patch but it
>>>> could very well be because I didn't set it up right.)
>>>
>>> Sadly I can't figure out what's going wrong - but the code's
>>> changed a
>>> lot at my end since then.  Sorry.
>>
>> I have integrated the latest (already committed) version of your
>> patch into my -Wformat-length patch.  Everything (well almost)
>> works and I get nice ranges for the format string and for (some)
>> arguments.
>>
>> I was surprised at how long it took me to switch from the previous
>> implementation (also copied from c-format.c) to this new API.  As
>> before, I had to copy bits and pieces of code from other parts of
>> the compiler to get it to work.  I was also surprised at how complex
>> making use of it is.  It added 130 lines of code to the pass, which
>> is 40 more than what I had before.  It seems that the
>> format_warning_at_substring function from c-format.c (perhaps
>> generalized and with some reasonable defaults hardcoded) should
>> be defined where other parts of GCC (including the middle end)
>> can reuse it.
>
> I'm guessing that it was difficult because the most useful parts are
> currently in c-format.c, whereas your code is in the middle-end.
>
> Is the latest version of your patch posted somewhere where I can see
> it?

I'm planning/hoping to post it this week.  It was only difficult
in that I had to gather bits from different parts of the compiler
and figure out how to make them work together.  I.e., it wasn't
a simple matter of replacing one function call with another.  In
the end, it boiled down to replacing the get_location function
with this:

const char *
substring_loc::get_location (location_t *out_loc) const
{
   gcc_assert (out_loc);

   if (!g_string_concat_db)
     g_string_concat_db
       = new (ggc_alloc <string_concat_db> ()) string_concat_db ();

   static struct cpp_reader* parse_in;
   if (!parse_in)
     {
       /* Create and initialize a preprocessing reader.  */
       parse_in = cpp_create_reader (CLK_GNUC99, ident_hash, line_table);
       cpp_init_iconv (parse_in);
     }

   return get_source_location_for_substring (parse_in, g_string_concat_db,
					    m_fmt_string_loc, CPP_STRING,
					    m_caret_idx, m_start_idx, m_end_idx,
					    out_loc);
}

>
> The substring_loc class should probably be moved from c-family to gcc
> also.   We might need a langhook to support that though (not sure yet).
>
> I'd be up for doing these moves (maybe moving
>   format_warning_at_substring to diagnostic.h/c), but I'd prefer to see
> your patch first.
>

Sure.

> So would you like the output to look like this:
>
>
> Option (a): underline whole string, with caret at close-quote
>
> t.c: In function â€˜fâ€™:
> t.c:5:25: warning: writing a terminating nul past the end of the
> destination [-Wformat-length=]
>       __builtin_sprintf (d, "%sX", "1");
>                             ~~~~^
>
> or like this:
>
> Option (b): just the close-quote
>
> t.c: In function â€˜fâ€™:
> t.c:5:25: warning: writing a terminating nul past
> the end of the
> destination [-Wformat-length=]
>       __builtin_sprintf (d,
> "%sX", "1");
>                                 ^
> ?
>

Either one would work for me.  If this were to become a general
purpose interface then I think it would be nice to let the caller
decide which end/quote (if any) to point the caret at.

> (do you also emit a note/inform showing the size of d?)

Yes.  The full diagnostic looks like this (with the care at
the wrong end as we're discussing):

t.c:5:25: warning: writing a terminating nul past the end of the 
destination [-Wformat-length=]
    __builtin_sprintf (d, "%sX", "A");
                          ^~~~~
t.c:5:3: note: format output 3 bytes into a destination of size 2
    __builtin_sprintf (d, "%sX", "A");
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>
> What API are you using to emit the warning?

I copied the format_warning_va and format_warning_at subscript
functions from c-format.c and I'm calling the latter.  I'm not
using the hint for anything yet (not sure if there's
an opportunity to make use of it).

> Given the location of the
> string as a whole expressed as a location_t, you can probably do
> something like this:
>
> location_t
> option_a (location_t string_as_a_whole_loc)
> {
>    source_range src_range
>      = get_range_from_loc (line_table, string_as_a_whole_loc);
>
>    return make_location (src_range.m_finish, /* caret */
>                          src_range.m_start, src_range.m_finish);
> }
>
> location_t
> option_b (location_t string_as_a_whole_loc)
> {
>    source_range
> src_range
>      = get_range_from_loc (line_table, string_as_a_whole_loc);
>
>    return src_range.m_finish;
> }
>
> (these could be added to libcpp)
>
> but I get the impression you want something like this integrated into
> the format_warning or substring_loc APIs (and it's hard to tell without
> seeing your patch).

I guess I was hoping for a simple high level interface to warning_at
where I could control the offset of the caret and the underlining
within some bounds.  Completely off the cuff, say if warning_at were
overloaded to take another argument with the offsets, then something
like this (with offsets in characters):

   void my_warning (location_t loc)
   {
     int caret = /* offset of caret from the beginning of loc */;
     int begin = /* optional offset of the start of underlining */;
     int end = /* optional offset of the end of underlining */;

     warning_at (range (loc, caret, begin, end), ...);
   }

Maybe I should prototype it to understand if it can be done and what
the tradeoffs might be.

The most recently posted patch uses the location_from_offset function
that c-format.c used before your changes:

   https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00986.html

In my latest patch I replaced the function and calls to warning_at
with its result with format_warning_at_substring.  I mainly did it
because I thought that was going to be new/recommended API for this
sort of thing.  I wasn't having any problems with previous approach
or looking for enhancements (though the split location for the
format directive and its argument is nice).  Did I misunderstand
what the intent of your changes was?  (I.e., did you not expect
me to make that switch?)

>
> Hope this is helpful
> Dave

Yes, thanks.

Martin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2016-08-05 18:17                     ` [Committed] [PATCH 2/4] (v4) " David Malcolm
  2016-08-06  5:48                       ` Markus Trippelsdorf
@ 2021-09-02 13:59                       ` Thomas Schwinge
  2021-09-02 19:09                         ` Thomas Schwinge
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-02 13:59 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1661 bytes --]

Hi!

On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
> Committed to trunk as r239175; I'm attaching the final version of the
> patch for reference.

David, you've added here 'gcc/input.h:struct location_hash' (see quoted
below), which will be useful elsewhere, so:

> --- a/gcc/input.c
> +++ b/gcc/input.c

> +/* Internal function.  Canonicalize LOC into a form suitable for
> +   use as a key within the database, stripping away macro expansion,
> +   ad-hoc information, and range information, using the location of
> +   the start of LOC within an ordinary linemap.  */
> +
> +location_t
> +string_concat_db::get_key_loc (location_t loc)
> +{
> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
> +                               NULL);
> +
> +  loc = get_range_from_loc (line_table, loc).m_start;
> +
> +  return loc;
> +}

OK to push the attached
"Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
my analysis for development work elsewhere.)

> --- a/gcc/input.h
> +++ b/gcc/input.h

> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
> +
> +class GTY(()) string_concat_db
> +{
> +[...]
> +  hash_map <location_hash, string_concat *> *m_table;
> +};

OK to push the attached
"Generalize 'gcc/input.h:struct location_hash'"?


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Harden-gcc-input.c-string_concat_db-get_key_loc.patch --]
[-- Type: text/x-diff, Size: 982 bytes --]

From 521c94471ae2f044f8cca8025bfa8db2d2936aea Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:05:46 +0200
Subject: [PATCH 1/2] Harden 'gcc/input.c:string_concat_db::get_key_loc'

We're using 'UNKNOWN_LOCATION' as a spare value for 'Empty', so should
ascertain that we don't use it as a key additionally.

Follow-up to r239175 (commit 88fa5555a309e5d6c6171b957daaf2f800920869)
"On-demand locations within string-literals".

	gcc/
	* input.c (string_concat_db::get_key_loc): Harden.
---
 gcc/input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/input.c b/gcc/input.c
index 4b809862e02..98b8bb64618 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)
 
   loc = get_range_from_loc (line_table, loc).m_start;
 
+  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
+  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
+
   return loc;
 }
 
-- 
2.33.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2065 bytes --]

From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index e6881072c5f..46971a2684c 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* If the following is used more than once, 'gengtype' generates duplicate
+   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
+   etc.):
+
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+
+   Likewise for this:
+
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -230,8 +249,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2021-09-02 13:59                       ` [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Thomas Schwinge
@ 2021-09-02 19:09                         ` Thomas Schwinge
  2021-09-03 16:33                           ` Thomas Schwinge
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-02 19:09 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

Hi!

On 2021-09-02T15:59:14+0200, I wrote:
> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>> Committed to trunk as r239175; I'm attaching the final version of the
>> patch for reference.
>
> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
> below), which will be useful elsewhere, so:
>
>> --- a/gcc/input.c
>> +++ b/gcc/input.c
>
>> +/* Internal function.  Canonicalize LOC into a form suitable for
>> +   use as a key within the database, stripping away macro expansion,
>> +   ad-hoc information, and range information, using the location of
>> +   the start of LOC within an ordinary linemap.  */
>> +
>> +location_t
>> +string_concat_db::get_key_loc (location_t loc)
>> +{
>> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
>> +                              NULL);
>> +
>> +  loc = get_range_from_loc (line_table, loc).m_start;
>> +
>> +  return loc;
>> +}
>
> OK to push the attached
> "Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
> my analysis for development work elsewhere.)

My suggested patch was:

    --- a/gcc/input.c
    +++ b/gcc/input.c
    @@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)

       loc = get_range_from_loc (line_table, loc).m_start;

    +  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
    +  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
    +
       return loc;
     }

Uh, I should've looked at the correct test logs...  This change actually
does regress 'c-c++-common/substring-location-PR-87721.c' and
'gcc.dg/plugin/diagnostic-test-string-literals-1.c': for these, we do see
'BUILTINS_LOCATION' (via 'string_concat_db::record_string_concatenation').
Unless someone tell me that's unexpected (I'm completely lost in this
code...), I shall change/generalize my changes to provide both a
'location_hash' only using 'UNKNOWN_LOCATION' as a spare value for
'Empty' (as currently used here) and another variant additionally using
'BUILTINS_LOCATION' as spare value for 'Deleted'.


Grüße
 Thomas


>> --- a/gcc/input.h
>> +++ b/gcc/input.h
>
>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>> +
>> +class GTY(()) string_concat_db
>> +{
>> +[...]
>> +  hash_map <location_hash, string_concat *> *m_table;
>> +};
>
> OK to push the attached
> "Generalize 'gcc/input.h:struct location_hash'"?

My suggested patch was:

> Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'
>
> This is currently only used here ('gcc/input.h:class string_concat_db'), but is
> actually generally useful, so advertize it as such.
>
> Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
> 'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
> for 'Empty'.
>
>       gcc/
>       * input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
>       for 'Deleted'.  Turn into a '#define'.
> ---
>  gcc/input.h | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/input.h b/gcc/input.h
> index e6881072c5f..46971a2684c 100644
> --- a/gcc/input.h
> +++ b/gcc/input.h
> @@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
>     both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
>  STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
>
> +/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
> +   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
> +   'Empty'/'Deleted'.  */
> +/* If the following is used more than once, 'gengtype' generates duplicate
> +   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
> +   etc.):
> +
> +       struct location_hash
> +         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
> +
> +   Likewise for this:
> +
> +       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
> +         location_hash;
> +
> +   Thus, use a plain ol' '#define':
> +*/
> +#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
> +
>  extern bool is_location_from_builtin_token (location_t);
>  extern expanded_location expand_location (location_t);
>
> @@ -230,8 +249,6 @@ public:
>    location_t * GTY ((atomic)) m_locs;
>  };
>
> -struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
> -
>  class GTY(()) string_concat_db
>  {
>   public:
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2021-09-02 19:09                         ` Thomas Schwinge
@ 2021-09-03 16:33                           ` Thomas Schwinge
  2021-09-10  7:48                             ` [PING] " Thomas Schwinge
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-03 16:33 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]

Hi!

On 2021-09-02T21:09:54+0200, I wrote:
> On 2021-09-02T15:59:14+0200, I wrote:
>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>> Committed to trunk as r239175; I'm attaching the final version of the
>>> patch for reference.
>>
>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>> below), which will be useful elsewhere, so:
>>
>>> --- a/gcc/input.c
>>> +++ b/gcc/input.c
>>
>>> +/* Internal function.  Canonicalize LOC into a form suitable for
>>> +   use as a key within the database, stripping away macro expansion,
>>> +   ad-hoc information, and range information, using the location of
>>> +   the start of LOC within an ordinary linemap.  */
>>> +
>>> +location_t
>>> +string_concat_db::get_key_loc (location_t loc)
>>> +{
>>> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
>>> +                             NULL);
>>> +
>>> +  loc = get_range_from_loc (line_table, loc).m_start;
>>> +
>>> +  return loc;
>>> +}
>>
>> OK to push the attached
>> "Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
>> my analysis for development work elsewhere.)
>
> My suggested patch was:
>
>     --- a/gcc/input.c
>     +++ b/gcc/input.c
>     @@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)
>
>        loc = get_range_from_loc (line_table, loc).m_start;
>
>     +  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
>     +  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
>     +
>        return loc;
>      }
>
> Uh, I should've looked at the correct test logs...  This change actually
> does regress 'c-c++-common/substring-location-PR-87721.c' and
> 'gcc.dg/plugin/diagnostic-test-string-literals-1.c': for these, we do see
> 'BUILTINS_LOCATION' (via 'string_concat_db::record_string_concatenation').
> Unless someone tell me that's unexpected (I'm completely lost in this
> code...)

I think I convinced myself that the current code doesn't have stable
behavior, so...

> I shall change/generalize my changes to provide both a
> 'location_hash' only using 'UNKNOWN_LOCATION' as a spare value for
> 'Empty' (as currently used here) and another variant additionally using
> 'BUILTINS_LOCATION' as spare value for 'Deleted'.

... I didn't do this, but instead would like to push the attached
"Don't record string concatenation data for 'RESERVED_LOCATION_P'"
(replacing "Harden 'gcc/input.c:string_concat_db::get_key_loc'" as
originally proposed).  OK?


... and then re:

>>> --- a/gcc/input.h
>>> +++ b/gcc/input.h
>>
>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>> +
>>> +class GTY(()) string_concat_db
>>> +{
>>> +[...]
>>> +  hash_map <location_hash, string_concat *> *m_table;
>>> +};
>>
>> OK to push the attached
>> "Generalize 'gcc/input.h:struct location_hash'"?

Attached again.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Don-t-record-string-concatenation-data-for-RESERVED_.patch --]
[-- Type: text/x-diff, Size: 3253 bytes --]

From 9f1066fcb770397d6e791aa0594f067a755e2ed6 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 3 Sep 2021 18:25:10 +0200
Subject: [PATCH] Don't record string concatenation data for
 'RESERVED_LOCATION_P'

'RESERVED_LOCATION_P' means 'UNKNOWN_LOCATION' or 'BUILTINS_LOCATION.
We're using 'UNKNOWN_LOCATION' as a spare value for 'Empty', so should
ascertain that we don't use it as a key additionally.  Similarly for
'BUILTINS_LOCATION' that we'd later like to use as a spare value for
'Deleted'.

As discussed in the source code comment added, for these we didn't have
stable behavior anyway.

Follow-up to r239175 (commit 88fa5555a309e5d6c6171b957daaf2f800920869)
"On-demand locations within string-literals".

	gcc/
	* input.c (string_concat_db::record_string_concatenation)
	(string_concat_db::get_string_concatenation): Skip for
	'RESERVED_LOCATION_P'.
	gcc/testsuite/
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: Adjust
	expected error diagnostics.
---
 gcc/input.c                                              | 9 +++++++++
 .../gcc.dg/plugin/diagnostic-test-string-literals-1.c    | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/input.c b/gcc/input.c
index 4b809862e02..dd753decfa0 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1437,6 +1437,11 @@ string_concat_db::record_string_concatenation (int num, location_t *locs)
   gcc_assert (locs);
 
   location_t key_loc = get_key_loc (locs[0]);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values:
+     any data now recorded under key 'key_loc' would be overwritten by a
+     subsequent call with the same key 'key_loc'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return;
 
   string_concat *concat
     = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
@@ -1460,6 +1465,10 @@ string_concat_db::get_string_concatenation (location_t loc,
   gcc_assert (out_locs);
 
   location_t key_loc = get_key_loc (loc);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values; see
+     discussion in 'string_concat_db::record_string_concatenation'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return false;
 
   string_concat **concat = m_table->get (key_loc);
   if (!concat)
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
index 4cba87be2ae..8818192eb45 100644
--- a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -332,8 +332,8 @@ pr87652 (const char *stem, int counter)
 				OFFSET + end_idx);		\
   } while (0)
 
-/* { dg-error "unable to read substring location: unable to read source line" "" { target c } 329 } */
-/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c++ } 329 } */
+/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c } 329 } */
+/* { dg-error "unable to read substring location: macro expansion" "" { target c++ } 329 } */
 /* { dg-begin-multiline-output "" }
      __emit_string_literal_range(__FILE__":%5d: " format,        \
                                  ^~~~~~~~
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2065 bytes --]

From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index e6881072c5f..46971a2684c 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* If the following is used more than once, 'gengtype' generates duplicate
+   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
+   etc.):
+
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+
+   Likewise for this:
+
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -230,8 +249,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PING] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2021-09-03 16:33                           ` Thomas Schwinge
@ 2021-09-10  7:48                             ` Thomas Schwinge
  2021-09-17 11:16                               ` [PING^2] " Thomas Schwinge
  2021-09-19  5:52                               ` [PING] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Jeff Law
  0 siblings, 2 replies; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-10  7:48 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]

Hi!

Ping.  My patches again attached, for easy reference.


Grüße
 Thomas


On 2021-09-03T18:33:37+0200, I wrote:
> Hi!
>
> On 2021-09-02T21:09:54+0200, I wrote:
>> On 2021-09-02T15:59:14+0200, I wrote:
>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>> patch for reference.
>>>
>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>> below), which will be useful elsewhere, so:
>>>
>>>> --- a/gcc/input.c
>>>> +++ b/gcc/input.c
>>>
>>>> +/* Internal function.  Canonicalize LOC into a form suitable for
>>>> +   use as a key within the database, stripping away macro expansion,
>>>> +   ad-hoc information, and range information, using the location of
>>>> +   the start of LOC within an ordinary linemap.  */
>>>> +
>>>> +location_t
>>>> +string_concat_db::get_key_loc (location_t loc)
>>>> +{
>>>> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
>>>> +                             NULL);
>>>> +
>>>> +  loc = get_range_from_loc (line_table, loc).m_start;
>>>> +
>>>> +  return loc;
>>>> +}
>>>
>>> OK to push the attached
>>> "Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
>>> my analysis for development work elsewhere.)
>>
>> My suggested patch was:
>>
>>     --- a/gcc/input.c
>>     +++ b/gcc/input.c
>>     @@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)
>>
>>        loc = get_range_from_loc (line_table, loc).m_start;
>>
>>     +  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
>>     +  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
>>     +
>>        return loc;
>>      }
>>
>> Uh, I should've looked at the correct test logs...  This change actually
>> does regress 'c-c++-common/substring-location-PR-87721.c' and
>> 'gcc.dg/plugin/diagnostic-test-string-literals-1.c': for these, we do see
>> 'BUILTINS_LOCATION' (via 'string_concat_db::record_string_concatenation').
>> Unless someone tell me that's unexpected (I'm completely lost in this
>> code...)
>
> I think I convinced myself that the current code doesn't have stable
> behavior, so...
>
>> I shall change/generalize my changes to provide both a
>> 'location_hash' only using 'UNKNOWN_LOCATION' as a spare value for
>> 'Empty' (as currently used here) and another variant additionally using
>> 'BUILTINS_LOCATION' as spare value for 'Deleted'.
>
> ... I didn't do this, but instead would like to push the attached
> "Don't record string concatenation data for 'RESERVED_LOCATION_P'"
> (replacing "Harden 'gcc/input.c:string_concat_db::get_key_loc'" as
> originally proposed).  OK?
>
>
> ... and then re:
>
>>>> --- a/gcc/input.h
>>>> +++ b/gcc/input.h
>>>
>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>> +
>>>> +class GTY(()) string_concat_db
>>>> +{
>>>> +[...]
>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>> +};
>>>
>>> OK to push the attached
>>> "Generalize 'gcc/input.h:struct location_hash'"?
>
> Attached again.
>
>
> Grüße
>  Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Don-t-record-string-concatenation-data-for-RESERVED_.patch --]
[-- Type: text/x-diff, Size: 3253 bytes --]

From 9f1066fcb770397d6e791aa0594f067a755e2ed6 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 3 Sep 2021 18:25:10 +0200
Subject: [PATCH] Don't record string concatenation data for
 'RESERVED_LOCATION_P'

'RESERVED_LOCATION_P' means 'UNKNOWN_LOCATION' or 'BUILTINS_LOCATION.
We're using 'UNKNOWN_LOCATION' as a spare value for 'Empty', so should
ascertain that we don't use it as a key additionally.  Similarly for
'BUILTINS_LOCATION' that we'd later like to use as a spare value for
'Deleted'.

As discussed in the source code comment added, for these we didn't have
stable behavior anyway.

Follow-up to r239175 (commit 88fa5555a309e5d6c6171b957daaf2f800920869)
"On-demand locations within string-literals".

	gcc/
	* input.c (string_concat_db::record_string_concatenation)
	(string_concat_db::get_string_concatenation): Skip for
	'RESERVED_LOCATION_P'.
	gcc/testsuite/
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: Adjust
	expected error diagnostics.
---
 gcc/input.c                                              | 9 +++++++++
 .../gcc.dg/plugin/diagnostic-test-string-literals-1.c    | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/input.c b/gcc/input.c
index 4b809862e02..dd753decfa0 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1437,6 +1437,11 @@ string_concat_db::record_string_concatenation (int num, location_t *locs)
   gcc_assert (locs);
 
   location_t key_loc = get_key_loc (locs[0]);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values:
+     any data now recorded under key 'key_loc' would be overwritten by a
+     subsequent call with the same key 'key_loc'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return;
 
   string_concat *concat
     = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
@@ -1460,6 +1465,10 @@ string_concat_db::get_string_concatenation (location_t loc,
   gcc_assert (out_locs);
 
   location_t key_loc = get_key_loc (loc);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values; see
+     discussion in 'string_concat_db::record_string_concatenation'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return false;
 
   string_concat **concat = m_table->get (key_loc);
   if (!concat)
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
index 4cba87be2ae..8818192eb45 100644
--- a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -332,8 +332,8 @@ pr87652 (const char *stem, int counter)
 				OFFSET + end_idx);		\
   } while (0)
 
-/* { dg-error "unable to read substring location: unable to read source line" "" { target c } 329 } */
-/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c++ } 329 } */
+/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c } 329 } */
+/* { dg-error "unable to read substring location: macro expansion" "" { target c++ } 329 } */
 /* { dg-begin-multiline-output "" }
      __emit_string_literal_range(__FILE__":%5d: " format,        \
                                  ^~~~~~~~
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2065 bytes --]

From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index e6881072c5f..46971a2684c 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* If the following is used more than once, 'gengtype' generates duplicate
+   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
+   etc.):
+
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+
+   Likewise for this:
+
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -230,8 +249,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PING^2] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2021-09-10  7:48                             ` [PING] " Thomas Schwinge
@ 2021-09-17 11:16                               ` Thomas Schwinge
  2021-09-30  6:47                                 ` [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals) Thomas Schwinge
  2021-09-19  5:52                               ` [PING] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Jeff Law
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-17 11:16 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3616 bytes --]

Hi!

On 2021-09-10T09:48:56+0200, I wrote:
> Ping.  My patches again attached, for easy reference.

Ping once again.


Grüße
 Thomas


> On 2021-09-03T18:33:37+0200, I wrote:
>> Hi!
>>
>> On 2021-09-02T21:09:54+0200, I wrote:
>>> On 2021-09-02T15:59:14+0200, I wrote:
>>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>>> patch for reference.
>>>>
>>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>>> below), which will be useful elsewhere, so:
>>>>
>>>>> --- a/gcc/input.c
>>>>> +++ b/gcc/input.c
>>>>
>>>>> +/* Internal function.  Canonicalize LOC into a form suitable for
>>>>> +   use as a key within the database, stripping away macro expansion,
>>>>> +   ad-hoc information, and range information, using the location of
>>>>> +   the start of LOC within an ordinary linemap.  */
>>>>> +
>>>>> +location_t
>>>>> +string_concat_db::get_key_loc (location_t loc)
>>>>> +{
>>>>> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
>>>>> +                             NULL);
>>>>> +
>>>>> +  loc = get_range_from_loc (line_table, loc).m_start;
>>>>> +
>>>>> +  return loc;
>>>>> +}
>>>>
>>>> OK to push the attached
>>>> "Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
>>>> my analysis for development work elsewhere.)
>>>
>>> My suggested patch was:
>>>
>>>     --- a/gcc/input.c
>>>     +++ b/gcc/input.c
>>>     @@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)
>>>
>>>        loc = get_range_from_loc (line_table, loc).m_start;
>>>
>>>     +  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
>>>     +  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
>>>     +
>>>        return loc;
>>>      }
>>>
>>> Uh, I should've looked at the correct test logs...  This change actually
>>> does regress 'c-c++-common/substring-location-PR-87721.c' and
>>> 'gcc.dg/plugin/diagnostic-test-string-literals-1.c': for these, we do see
>>> 'BUILTINS_LOCATION' (via 'string_concat_db::record_string_concatenation').
>>> Unless someone tell me that's unexpected (I'm completely lost in this
>>> code...)
>>
>> I think I convinced myself that the current code doesn't have stable
>> behavior, so...
>>
>>> I shall change/generalize my changes to provide both a
>>> 'location_hash' only using 'UNKNOWN_LOCATION' as a spare value for
>>> 'Empty' (as currently used here) and another variant additionally using
>>> 'BUILTINS_LOCATION' as spare value for 'Deleted'.
>>
>> ... I didn't do this, but instead would like to push the attached
>> "Don't record string concatenation data for 'RESERVED_LOCATION_P'"
>> (replacing "Harden 'gcc/input.c:string_concat_db::get_key_loc'" as
>> originally proposed).  OK?
>>
>>
>> ... and then re:
>>
>>>>> --- a/gcc/input.h
>>>>> +++ b/gcc/input.h
>>>>
>>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>>> +
>>>>> +class GTY(()) string_concat_db
>>>>> +{
>>>>> +[...]
>>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>>> +};
>>>>
>>>> OK to push the attached
>>>> "Generalize 'gcc/input.h:struct location_hash'"?
>>
>> Attached again.
>>
>>
>> Grüße
>>  Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Don-t-record-string-concatenation-data-for-RESERVED_.patch --]
[-- Type: text/x-diff, Size: 3253 bytes --]

From 9f1066fcb770397d6e791aa0594f067a755e2ed6 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 3 Sep 2021 18:25:10 +0200
Subject: [PATCH] Don't record string concatenation data for
 'RESERVED_LOCATION_P'

'RESERVED_LOCATION_P' means 'UNKNOWN_LOCATION' or 'BUILTINS_LOCATION.
We're using 'UNKNOWN_LOCATION' as a spare value for 'Empty', so should
ascertain that we don't use it as a key additionally.  Similarly for
'BUILTINS_LOCATION' that we'd later like to use as a spare value for
'Deleted'.

As discussed in the source code comment added, for these we didn't have
stable behavior anyway.

Follow-up to r239175 (commit 88fa5555a309e5d6c6171b957daaf2f800920869)
"On-demand locations within string-literals".

	gcc/
	* input.c (string_concat_db::record_string_concatenation)
	(string_concat_db::get_string_concatenation): Skip for
	'RESERVED_LOCATION_P'.
	gcc/testsuite/
	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: Adjust
	expected error diagnostics.
---
 gcc/input.c                                              | 9 +++++++++
 .../gcc.dg/plugin/diagnostic-test-string-literals-1.c    | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/input.c b/gcc/input.c
index 4b809862e02..dd753decfa0 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1437,6 +1437,11 @@ string_concat_db::record_string_concatenation (int num, location_t *locs)
   gcc_assert (locs);
 
   location_t key_loc = get_key_loc (locs[0]);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values:
+     any data now recorded under key 'key_loc' would be overwritten by a
+     subsequent call with the same key 'key_loc'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return;
 
   string_concat *concat
     = new (ggc_alloc <string_concat> ()) string_concat (num, locs);
@@ -1460,6 +1465,10 @@ string_concat_db::get_string_concatenation (location_t loc,
   gcc_assert (out_locs);
 
   location_t key_loc = get_key_loc (loc);
+  /* We don't record data for 'RESERVED_LOCATION_P (key_loc)' key values; see
+     discussion in 'string_concat_db::record_string_concatenation'.  */
+  if (RESERVED_LOCATION_P (key_loc))
+    return false;
 
   string_concat **concat = m_table->get (key_loc);
   if (!concat)
diff --git a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
index 4cba87be2ae..8818192eb45 100644
--- a/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
+++ b/gcc/testsuite/gcc.dg/plugin/diagnostic-test-string-literals-1.c
@@ -332,8 +332,8 @@ pr87652 (const char *stem, int counter)
 				OFFSET + end_idx);		\
   } while (0)
 
-/* { dg-error "unable to read substring location: unable to read source line" "" { target c } 329 } */
-/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c++ } 329 } */
+/* { dg-error "unable to read substring location: failed to get ordinary maps" "" { target c } 329 } */
+/* { dg-error "unable to read substring location: macro expansion" "" { target c++ } 329 } */
 /* { dg-begin-multiline-output "" }
      __emit_string_literal_range(__FILE__":%5d: " format,        \
                                  ^~~~~~~~
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2065 bytes --]

From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index e6881072c5f..46971a2684c 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* If the following is used more than once, 'gengtype' generates duplicate
+   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
+   etc.):
+
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+
+   Likewise for this:
+
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -230,8 +249,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PING] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals
  2021-09-10  7:48                             ` [PING] " Thomas Schwinge
  2021-09-17 11:16                               ` [PING^2] " Thomas Schwinge
@ 2021-09-19  5:52                               ` Jeff Law
  1 sibling, 0 replies; 61+ messages in thread
From: Jeff Law @ 2021-09-19  5:52 UTC (permalink / raw)
  To: Thomas Schwinge, David Malcolm, gcc-patches



On 9/10/2021 1:48 AM, Thomas Schwinge wrote:
> Hi!
>
> Ping.  My patches again attached, for easy reference.
>
>
> Grüße
>   Thomas
>
>
> On 2021-09-03T18:33:37+0200, I wrote:
>> Hi!
>>
>> On 2021-09-02T21:09:54+0200, I wrote:
>>> On 2021-09-02T15:59:14+0200, I wrote:
>>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>>> patch for reference.
>>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>>> below), which will be useful elsewhere, so:
>>>>
>>>>> --- a/gcc/input.c
>>>>> +++ b/gcc/input.c
>>>>> +/* Internal function.  Canonicalize LOC into a form suitable for
>>>>> +   use as a key within the database, stripping away macro expansion,
>>>>> +   ad-hoc information, and range information, using the location of
>>>>> +   the start of LOC within an ordinary linemap.  */
>>>>> +
>>>>> +location_t
>>>>> +string_concat_db::get_key_loc (location_t loc)
>>>>> +{
>>>>> +  loc = linemap_resolve_location (line_table, loc, LRK_SPELLING_LOCATION,
>>>>> +                             NULL);
>>>>> +
>>>>> +  loc = get_range_from_loc (line_table, loc).m_start;
>>>>> +
>>>>> +  return loc;
>>>>> +}
>>>> OK to push the attached
>>>> "Harden 'gcc/input.c:string_concat_db::get_key_loc'"?  (This fell out of
>>>> my analysis for development work elsewhere.)
>>> My suggested patch was:
>>>
>>>      --- a/gcc/input.c
>>>      +++ b/gcc/input.c
>>>      @@ -1483,6 +1483,9 @@ string_concat_db::get_key_loc (location_t loc)
>>>
>>>         loc = get_range_from_loc (line_table, loc).m_start;
>>>
>>>      +  /* Ascertain that 'loc' is valid as a key in 'm_table'.  */
>>>      +  gcc_checking_assert (!RESERVED_LOCATION_P (loc));
>>>      +
>>>         return loc;
>>>       }
>>>
>>> Uh, I should've looked at the correct test logs...  This change actually
>>> does regress 'c-c++-common/substring-location-PR-87721.c' and
>>> 'gcc.dg/plugin/diagnostic-test-string-literals-1.c': for these, we do see
>>> 'BUILTINS_LOCATION' (via 'string_concat_db::record_string_concatenation').
>>> Unless someone tell me that's unexpected (I'm completely lost in this
>>> code...)
>> I think I convinced myself that the current code doesn't have stable
>> behavior, so...
>>
>>> I shall change/generalize my changes to provide both a
>>> 'location_hash' only using 'UNKNOWN_LOCATION' as a spare value for
>>> 'Empty' (as currently used here) and another variant additionally using
>>> 'BUILTINS_LOCATION' as spare value for 'Deleted'.
>> ... I didn't do this, but instead would like to push the attached
>> "Don't record string concatenation data for 'RESERVED_LOCATION_P'"
>> (replacing "Harden 'gcc/input.c:string_concat_db::get_key_loc'" as
>> originally proposed).  OK?
>>
>>
>> ... and then re:
>>
>>>>> --- a/gcc/input.h
>>>>> +++ b/gcc/input.h
>>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>>> +
>>>>> +class GTY(()) string_concat_db
>>>>> +{
>>>>> +[...]
>>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>>> +};
>>>> OK to push the attached
>>>> "Generalize 'gcc/input.h:struct location_hash'"?
>> Attached again.
>>
>>
>> Grüße
>>   Thomas
>
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
>
> 0001-Don-t-record-string-concatenation-data-for-RESERVED_.patch
>
>  From 9f1066fcb770397d6e791aa0594f067a755e2ed6 Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge <thomas@codesourcery.com>
> Date: Fri, 3 Sep 2021 18:25:10 +0200
> Subject: [PATCH] Don't record string concatenation data for
>   'RESERVED_LOCATION_P'
>
> 'RESERVED_LOCATION_P' means 'UNKNOWN_LOCATION' or 'BUILTINS_LOCATION.
> We're using 'UNKNOWN_LOCATION' as a spare value for 'Empty', so should
> ascertain that we don't use it as a key additionally.  Similarly for
> 'BUILTINS_LOCATION' that we'd later like to use as a spare value for
> 'Deleted'.
>
> As discussed in the source code comment added, for these we didn't have
> stable behavior anyway.
>
> Follow-up to r239175 (commit 88fa5555a309e5d6c6171b957daaf2f800920869)
> "On-demand locations within string-literals".
>
> 	gcc/
> 	* input.c (string_concat_db::record_string_concatenation)
> 	(string_concat_db::get_string_concatenation): Skip for
> 	'RESERVED_LOCATION_P'.
> 	gcc/testsuite/
> 	* gcc.dg/plugin/diagnostic-test-string-literals-1.c: Adjust
> 	expected error diagnostics.
OK
jeff


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals)
  2021-09-17 11:16                               ` [PING^2] " Thomas Schwinge
@ 2021-09-30  6:47                                 ` Thomas Schwinge
  2021-10-17 22:33                                   ` Jeff Law
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Schwinge @ 2021-09-30  6:47 UTC (permalink / raw)
  To: David Malcolm, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]

Hi!

On 2021-09-17T13:16:14+0200, I wrote:
> On 2021-09-10T09:48:56+0200, I wrote:
>> Ping.  My patches again attached, for easy reference.
>
> Ping once again.

Jeff had ACKed "Don't record string concatenation data for
'RESERVED_LOCATION_P'" (thanks!), but "Generalize 'gcc/input.h:struct
location_hash'" is still awaiting review:

>> On 2021-09-03T18:33:37+0200, I wrote:
>>> On 2021-09-02T21:09:54+0200, I wrote:
>>>> On 2021-09-02T15:59:14+0200, I wrote:
>>>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>>>> patch for reference.
>>>>>
>>>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>>>> below), which will be useful elsewhere, so:

>>>>>> --- a/gcc/input.h
>>>>>> +++ b/gcc/input.h
>>>>>
>>>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>>>> +
>>>>>> +class GTY(()) string_concat_db
>>>>>> +{
>>>>>> +[...]
>>>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>>>> +};
>>>>>
>>>>> OK to push the attached
>>>>> "Generalize 'gcc/input.h:struct location_hash'"?

Attached again, for easy reference.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2065 bytes --]

From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index e6881072c5f..46971a2684c 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,25 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* If the following is used more than once, 'gengtype' generates duplicate
+   functions (thus: "error: redefinition of 'void gt_ggc_mx(location_hash&)'"
+   etc.):
+
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+
+   Likewise for this:
+
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -230,8 +249,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals)
  2021-09-30  6:47                                 ` [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals) Thomas Schwinge
@ 2021-10-17 22:33                                   ` Jeff Law
  2021-11-09 13:48                                     ` Thomas Schwinge
  0 siblings, 1 reply; 61+ messages in thread
From: Jeff Law @ 2021-10-17 22:33 UTC (permalink / raw)
  To: Thomas Schwinge, David Malcolm, gcc-patches



On 9/30/2021 12:47 AM, Thomas Schwinge wrote:
> Hi!
>
> On 2021-09-17T13:16:14+0200, I wrote:
>> On 2021-09-10T09:48:56+0200, I wrote:
>>> Ping.  My patches again attached, for easy reference.
>> Ping once again.
> Jeff had ACKed "Don't record string concatenation data for
> 'RESERVED_LOCATION_P'" (thanks!), but "Generalize 'gcc/input.h:struct
> location_hash'" is still awaiting review:
>
>>> On 2021-09-03T18:33:37+0200, I wrote:
>>>> On 2021-09-02T21:09:54+0200, I wrote:
>>>>> On 2021-09-02T15:59:14+0200, I wrote:
>>>>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>>>>> patch for reference.
>>>>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>>>>> below), which will be useful elsewhere, so:
>>>>>>> --- a/gcc/input.h
>>>>>>> +++ b/gcc/input.h
>>>>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>>>>> +
>>>>>>> +class GTY(()) string_concat_db
>>>>>>> +{
>>>>>>> +[...]
>>>>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>>>>> +};
>>>>>> OK to push the attached
>>>>>> "Generalize 'gcc/input.h:struct location_hash'"?
> Attached again, for easy reference.
>
>
> Grüße
>   Thomas
>
>
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
>
> 0002-Generalize-gcc-input.h-struct-location_hash.patch
>
>  From 349a3172f64db93ee98ea39b36489b702b6596ab Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge <thomas@codesourcery.com>
> Date: Tue, 31 Aug 2021 23:30:25 +0200
> Subject: [PATCH 2/2] Generalize 'gcc/input.h:struct location_hash'
>
> This is currently only used here ('gcc/input.h:class string_concat_db'), but is
> actually generally useful, so advertize it as such.
>
> Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
> 'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
> for 'Empty'.
>
> 	gcc/
> 	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
> 	for 'Deleted'.  Turn into a '#define'.
OK
jeff


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals)
  2021-10-17 22:33                                   ` Jeff Law
@ 2021-11-09 13:48                                     ` Thomas Schwinge
  0 siblings, 0 replies; 61+ messages in thread
From: Thomas Schwinge @ 2021-11-09 13:48 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1661 bytes --]

Hi!

On 2021-10-17T16:33:03-0600, Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On 9/30/2021 12:47 AM, Thomas Schwinge wrote:
>> On 2021-09-17T13:16:14+0200, I wrote:
>>> On 2021-09-10T09:48:56+0200, I wrote:
>>>> On 2021-09-03T18:33:37+0200, I wrote:
>>>>> On 2021-09-02T21:09:54+0200, I wrote:
>>>>>> On 2021-09-02T15:59:14+0200, I wrote:
>>>>>>> On 2016-08-05T14:16:58-0400, David Malcolm <dmalcolm@redhat.com> wrote:
>>>>>>>> Committed to trunk as r239175; I'm attaching the final version of the
>>>>>>>> patch for reference.
>>>>>>> David, you've added here 'gcc/input.h:struct location_hash' (see quoted
>>>>>>> below), which will be useful elsewhere, so:
>>>>>>>> --- a/gcc/input.h
>>>>>>>> +++ b/gcc/input.h
>>>>>>>> +struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
>>>>>>>> +
>>>>>>>> +class GTY(()) string_concat_db
>>>>>>>> +{
>>>>>>>> +[...]
>>>>>>>> +  hash_map <location_hash, string_concat *> *m_table;
>>>>>>>> +};
>>>>>>> OK to push the attached
>>>>>>> "Generalize 'gcc/input.h:struct location_hash'"?

> OK

Thanks.  With the commentary slightly updated for PR103157 "'gengtype':
'typedef' causing infinite-recursion code to be generated", I've pushed
to master branch commit 088199e5d0fc0d54f48af0783a2630a773bbb387
"Generalize 'gcc/input.h:struct location_hash'", see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Generalize-gcc-input.h-struct-location_hash.patch --]
[-- Type: text/x-diff, Size: 2438 bytes --]

From 088199e5d0fc0d54f48af0783a2630a773bbb387 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 31 Aug 2021 23:30:25 +0200
Subject: [PATCH] Generalize 'gcc/input.h:struct location_hash'

This is currently only used here ('gcc/input.h:class string_concat_db'), but is
actually generally useful, so advertize it as such.

Per the rationale given, we may use 'BUILTINS_LOCATION' as spare value for
'Deleted', in addition to the existing use of 'UNKNOWN_LOCATION' as spare value
for 'Empty'.

	gcc/
	* input.h (location_hash): Use 'BUILTINS_LOCATION' as spare value
	for 'Deleted'.  Turn into a '#define'.
---
 gcc/input.h | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/gcc/input.h b/gcc/input.h
index f7b08bdc444..bc44ba2507f 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -36,6 +36,28 @@ extern GTY(()) class line_maps *saved_line_table;
    both UNKNOWN_LOCATION and BUILTINS_LOCATION fit into that.  */
 STATIC_ASSERT (BUILTINS_LOCATION < RESERVED_LOCATION_COUNT);
 
+/* Hasher for 'location_t' values satisfying '!RESERVED_LOCATION_P', thus able
+   to use 'UNKNOWN_LOCATION'/'BUILTINS_LOCATION' as spare values for
+   'Empty'/'Deleted'.  */
+/* Per PR103157 "'gengtype': 'typedef' causing infinite-recursion code to be
+   generated", don't use
+       typedef int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+         location_hash;
+   here.
+
+   It works for a single-use case, but when using a 'struct'-based variant
+       struct location_hash
+         : int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION> {};
+   in more than one place, 'gengtype' generates duplicate functions (thus:
+   "error: redefinition of 'void gt_ggc_mx(location_hash&)'" etc.).
+   Attempting to mark that one up with GTY options, we run into a 'gengtype'
+   "parse error: expected '{', have '<'", which probably falls into category
+   "understanding of C++ is limited", as documented in 'gcc/doc/gty.texi'.
+
+   Thus, use a plain ol' '#define':
+*/
+#define location_hash int_hash<location_t, UNKNOWN_LOCATION, BUILTINS_LOCATION>
+
 extern bool is_location_from_builtin_token (location_t);
 extern expanded_location expand_location (location_t);
 
@@ -233,8 +255,6 @@ public:
   location_t * GTY ((atomic)) m_locs;
 };
 
-struct location_hash : int_hash <location_t, UNKNOWN_LOCATION> { };
-
 class GTY(()) string_concat_db
 {
  public:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2021-11-09 13:48 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08 21:22 [PATCH] RFC: On-demand locations within string-literals David Malcolm
2016-07-20 19:38 ` David Malcolm
2016-07-21 16:38   ` Jeff Law
2016-07-26 16:43     ` [PATCH 1/3] (v2) " David Malcolm
2016-07-26 16:43       ` [PATCH 3/3] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
2016-07-26 16:43       ` [PATCH 2/3] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
2016-07-26 18:06       ` [PATCH 1/3] (v2) On-demand locations within string-literals Manuel López-Ibáñez
2016-07-27 14:30         ` David Malcolm
2016-07-27 22:42           ` Manuel López-Ibáñez
2016-07-28 20:12             ` David Malcolm
2016-07-28 20:38               ` Martin Sebor
2016-07-28 21:17                 ` Martin Sebor
2016-07-29 12:37                   ` David Malcolm
2016-07-29 14:22                     ` Martin Sebor
2016-07-29 14:46                       ` David Malcolm
2016-07-29 15:26                         ` David Malcolm
2016-07-29 16:54                           ` Manuel López-Ibáñez
2016-07-29 17:27                             ` David Malcolm
2016-07-30  1:18                               ` Manuel López-Ibáñez
2016-08-03 15:56                               ` Jeff Law
2016-08-01 21:13                   ` Joseph Myers
2016-07-29 21:42       ` Joseph Myers
2016-07-30  1:16         ` David Malcolm
2016-08-03 15:17           ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT David Malcolm
2016-08-03 15:17             ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
2016-08-04 18:09               ` Jeff Law
2016-08-04 19:25                 ` David Malcolm
2016-08-04 20:22                   ` Jeff Law
2016-08-06  0:56                     ` [PATCH] c-format.c: cleanup of check_format_info_main David Malcolm
2016-08-08 17:20                       ` Jeff Law
2016-08-08 20:16                 ` [PATCH 3/4] Use class substring_loc in c-format.c (PR c/52952) David Malcolm
2016-08-03 15:17             ` [PATCH 2/4] (v3) On-demand locations within string-literals David Malcolm
2016-08-04 17:38               ` Jeff Law
2016-08-04 19:21                 ` David Malcolm
2016-08-04 20:18                   ` Jeff Law
2016-08-05 18:17                     ` [Committed] [PATCH 2/4] (v4) " David Malcolm
2016-08-06  5:48                       ` Markus Trippelsdorf
2016-08-06  5:59                         ` Prathamesh Kulkarni
2016-08-06 18:10                           ` [committed] Fix crash in selftest::test_lexer_string_locations_ucn4 (PR bootstrap/72823) David Malcolm
2021-09-02 13:59                       ` [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Thomas Schwinge
2021-09-02 19:09                         ` Thomas Schwinge
2021-09-03 16:33                           ` Thomas Schwinge
2021-09-10  7:48                             ` [PING] " Thomas Schwinge
2021-09-17 11:16                               ` [PING^2] " Thomas Schwinge
2021-09-30  6:47                                 ` [PING^3] Generalize 'gcc/input.h:struct location_hash' (was: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals) Thomas Schwinge
2021-10-17 22:33                                   ` Jeff Law
2021-11-09 13:48                                     ` Thomas Schwinge
2021-09-19  5:52                               ` [PING] Re: [Committed] [PATCH 2/4] (v4) On-demand locations within string-literals Jeff Law
2016-08-03 15:17             ` [PATCH 4/4] c-format.c: suggest the correct format string to use (PR c/64955) David Malcolm
2016-08-04 19:55               ` Jeff Law
2016-08-04 21:06                 ` David Malcolm
2016-08-03 16:06             ` [PATCH 1/4] selftest.h: Add ASSERT_TRUE_AT and ASSERT_FALSE_AT Jeff Law
2016-08-04 19:02               ` David Malcolm
2016-08-03 15:59         ` [PATCH 1/3] (v2) On-demand locations within string-literals Jeff Law
2016-08-04 14:27           ` David Malcolm
2016-08-04 17:37             ` Jeff Law
2016-07-23 21:36 ` [PATCH] RFC: " Martin Sebor
2016-07-24  0:37   ` David Malcolm
2016-08-23  3:25     ` Martin Sebor
2016-08-23 13:59       ` David Malcolm
2016-08-23 15:18         ` Martin Sebor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).