public inbox for gcc-cvs@sourceware.org
help / color / mirror / Atom feed
* [gcc r12-4830] contrib: add unicode/utf8-dump.py
@ 2021-11-01 15:52 David Malcolm
0 siblings, 0 replies; only message in thread
From: David Malcolm @ 2021-11-01 15:52 UTC (permalink / raw)
To: gcc-cvs
https://gcc.gnu.org/g:b050653c4cb63fe46b9727af924f9bc2b6475fba
commit r12-4830-gb050653c4cb63fe46b9727af924f9bc2b6475fba
Author: David Malcolm <dmalcolm@redhat.com>
Date: Fri Oct 8 10:53:42 2021 -0400
contrib: add unicode/utf8-dump.py
This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).
It dumps a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.
For example:
$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";
$ ./contrib/unicode/utf8-dump.py test.c
1 | int གྷ;
| U+0069 0x69 LATIN SMALL LETTER I i
| U+006E 0x6e LATIN SMALL LETTER N n
| U+0074 0x74 LATIN SMALL LETTER T t
| U+0020 0x20 SPACE (separator)
| U+0F43 0xe0 0xbd 0x83 TIBETAN LETTER GHA གྷ
| U+003B 0x3b SEMICOLON ;
| U+000A 0x0a LINE FEED (LF) (control character)
2 | const char *אבג = "ALEF-BET-GIMEL";
| U+0063 0x63 LATIN SMALL LETTER C c
| U+006F 0x6f LATIN SMALL LETTER O o
| U+006E 0x6e LATIN SMALL LETTER N n
| U+0073 0x73 LATIN SMALL LETTER S s
| U+0074 0x74 LATIN SMALL LETTER T t
| U+0020 0x20 SPACE (separator)
| U+0063 0x63 LATIN SMALL LETTER C c
| U+0068 0x68 LATIN SMALL LETTER H h
| U+0061 0x61 LATIN SMALL LETTER A a
| U+0072 0x72 LATIN SMALL LETTER R r
| U+0020 0x20 SPACE (separator)
| U+002A 0x2a ASTERISK *
| U+05D0 0xd7 0x90 HEBREW LETTER ALEF א
| U+05D1 0xd7 0x91 HEBREW LETTER BET ב
| U+05D2 0xd7 0x92 HEBREW LETTER GIMEL ג
| U+0020 0x20 SPACE (separator)
| U+003D 0x3d EQUALS SIGN =
| U+0020 0x20 SPACE (separator)
| U+0022 0x22 QUOTATION MARK "
| U+0041 0x41 LATIN CAPITAL LETTER A A
| U+004C 0x4c LATIN CAPITAL LETTER L L
| U+0045 0x45 LATIN CAPITAL LETTER E E
| U+0046 0x46 LATIN CAPITAL LETTER F F
| U+002D 0x2d HYPHEN-MINUS -
| U+0042 0x42 LATIN CAPITAL LETTER B B
| U+0045 0x45 LATIN CAPITAL LETTER E E
| U+0054 0x54 LATIN CAPITAL LETTER T T
| U+002D 0x2d HYPHEN-MINUS -
| U+0047 0x47 LATIN CAPITAL LETTER G G
| U+0049 0x49 LATIN CAPITAL LETTER I I
| U+004D 0x4d LATIN CAPITAL LETTER M M
| U+0045 0x45 LATIN CAPITAL LETTER E E
| U+004C 0x4c LATIN CAPITAL LETTER L L
| U+0022 0x22 QUOTATION MARK "
| U+003B 0x3b SEMICOLON ;
| U+000A 0x0a LINE FEED (LF) (control character)
Tested with Python 3.8
contrib/ChangeLog:
* unicode/utf8-dump.py: New file.
Signed-off-by: David Malcolm <dmalcolm@redhat.com>
Diff:
---
contrib/unicode/utf8-dump.py | 69 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 69 insertions(+)
diff --git a/contrib/unicode/utf8-dump.py b/contrib/unicode/utf8-dump.py
new file mode 100755
index 00000000000..f12ee79f9f2
--- /dev/null
+++ b/contrib/unicode/utf8-dump.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python3
+#
+# Script to dump a UTF-8 file as a list of numbered lines (mimicking GCC's
+# diagnostic output format), interleaved with lines per character showing
+# the Unicode codepoints, the UTF-8 encoding bytes, the name of the
+# character, and, where printable, the characters themselves.
+# The lines are printed in logical order, which may help the reader to grok
+# the relationship between visual and logical ordering in bi-di files.
+#
+# SPDX-License-Identifier: MIT
+#
+# Copyright (C) 2021 David Malcolm <dmalcolm@redhat.com>.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
+# OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
+# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+import sys
+import unicodedata
+
+
+def get_name(ch):
+ try:
+ return unicodedata.name(ch)
+ except ValueError:
+ if ch == '\n':
+ return 'LINE FEED (LF)'
+ return '(unknown)'
+
+
+def get_printable(ch):
+ cat = unicodedata.category(ch)
+ if cat == 'Cc':
+ return '(control character)'
+ elif cat == 'Cf':
+ return '(format control)'
+ elif cat[0] == 'Z':
+ return '(separator)'
+ return ch
+
+
+def dump_file(f_in):
+ line_num = 1
+ for line in f_in:
+ print('%4i | %s' % (line_num, line.rstrip()))
+ for ch in line:
+ utf8_desc = '%15s' % (' '.join(['0x%02x' % b
+ for b in ch.encode('utf-8')]))
+ print('%4s | U+%04X %s %40s %s'
+ % ('', ord(ch), utf8_desc, get_name(ch), get_printable(ch)))
+ line_num += 1
+
+
+with open(sys.argv[1], mode='r') as f_in:
+ dump_file(f_in)
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2021-11-01 15:52 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01 15:52 [gcc r12-4830] contrib: add unicode/utf8-dump.py David Malcolm
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).