* [PATCH] Make Unicode generation reproducible.
@ 2021-04-29 17:27 Carlos O'Donell
2021-04-30 9:26 ` Florian Weimer
0 siblings, 1 reply; 3+ messages in thread
From: Carlos O'Donell @ 2021-04-29 17:27 UTC (permalink / raw)
To: libc-alpha, fweimer, joseph
The following changes make Unicode generation reproducible.
First we create a UnicodeRelease.txt file with metadata about the
release. This metadata contains the release date for the Unicode
version that we imported into glibc. Then we add APIs to
unicode_utils.py to access the release metadata. Then we refactor
all of the code use the release metadata, which includes using
the consistent date of the Unicode release for the required
LC_IDENTIFICATION dates. If the existing files like i18n_ctype
or tr_TR have newer dates then we keep those, otherwise we use the
newer date from the Unicode release.
All data files are regenerated with:
cd localedata/unicode-gen
make
make install
Subsequent regeneration will not alter any file dates and makes
the Unicode generation reproducible.
Tested on x86_64 and i686 without regression.
---
localedata/locales/i18n_ctype | 4 +-
localedata/locales/tr_TR | 2 +-
localedata/locales/translit_circle | 2 +-
localedata/locales/translit_cjk_compat | 2 +-
localedata/locales/translit_combining | 2 +-
localedata/locales/translit_compat | 2 +-
localedata/locales/translit_font | 2 +-
localedata/locales/translit_fraction | 2 +-
localedata/unicode-gen/Makefile | 66 ++++++++-----------
localedata/unicode-gen/UnicodeRelease.txt | 8 +++
localedata/unicode-gen/gen_translit_circle.py | 20 +++---
.../unicode-gen/gen_translit_cjk_compat.py | 20 +++---
.../unicode-gen/gen_translit_combining.py | 20 +++---
localedata/unicode-gen/gen_translit_compat.py | 20 +++---
localedata/unicode-gen/gen_translit_font.py | 20 +++---
.../unicode-gen/gen_translit_fraction.py | 20 +++---
localedata/unicode-gen/gen_unicode_ctype.py | 50 ++++++--------
localedata/unicode-gen/unicode_utils.py | 38 +++++++++++
localedata/unicode-gen/utf8_compatibility.py | 27 ++++----
localedata/unicode-gen/utf8_gen.py | 61 +++++++----------
20 files changed, 189 insertions(+), 199 deletions(-)
create mode 100644 localedata/unicode-gen/UnicodeRelease.txt
diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype
index c63e0790fc..f5063fe743 100644
--- a/localedata/locales/i18n_ctype
+++ b/localedata/locales/i18n_ctype
@@ -13,7 +13,7 @@ comment_char %
% information, but with different transliterations, can include it
% directly.
-% Generated automatically by gen_unicode_ctype.py for Unicode 12.1.0.
+% Generated automatically by gen_unicode_ctype.py.
LC_IDENTIFICATION
title "Unicode 13.0.0 FDCC-set"
@@ -26,7 +26,7 @@ fax ""
language ""
territory "Earth"
revision "13.0.0"
-date "2020-06-25"
+date "2021-03-10"
category "i18n:2012";LC_CTYPE
END LC_IDENTIFICATION
diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR
index 7dbb923228..ff8b315b7b 100644
--- a/localedata/locales/tr_TR
+++ b/localedata/locales/tr_TR
@@ -43,7 +43,7 @@ fax ""
language "Turkish"
territory "Turkey"
revision "1.0"
-date "2020-06-25"
+date "2021-03-10"
category "i18n:2012";LC_IDENTIFICATION
category "i18n:2012";LC_CTYPE
diff --git a/localedata/locales/translit_circle b/localedata/locales/translit_circle
index 5c07b44532..f2ef558e2d 100644
--- a/localedata/locales/translit_circle
+++ b/localedata/locales/translit_circle
@@ -9,7 +9,7 @@ comment_char %
% otherwise be governed by that license.
% Transliterations of encircled characters.
-% Generated automatically from UnicodeData.txt by gen_translit_circle.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_circle.py for Unicode 13.0.0.
LC_CTYPE
diff --git a/localedata/locales/translit_cjk_compat b/localedata/locales/translit_cjk_compat
index ee0d7f83c6..2696445dbf 100644
--- a/localedata/locales/translit_cjk_compat
+++ b/localedata/locales/translit_cjk_compat
@@ -9,7 +9,7 @@ comment_char %
% otherwise be governed by that license.
% Transliterations of CJK compatibility characters.
-% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py for Unicode 13.0.0.
LC_CTYPE
diff --git a/localedata/locales/translit_combining b/localedata/locales/translit_combining
index 36128f097a..b8e6b7efbd 100644
--- a/localedata/locales/translit_combining
+++ b/localedata/locales/translit_combining
@@ -10,7 +10,7 @@ comment_char %
% Transliterations that remove all combining characters (accents,
% pronounciation marks, etc.).
-% Generated automatically from UnicodeData.txt by gen_translit_combining.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_combining.py for Unicode 13.0.0.
LC_CTYPE
diff --git a/localedata/locales/translit_compat b/localedata/locales/translit_compat
index ac24c4e938..61cdcccbc9 100644
--- a/localedata/locales/translit_compat
+++ b/localedata/locales/translit_compat
@@ -9,7 +9,7 @@ comment_char %
% otherwise be governed by that license.
% Transliterations of compatibility characters and ligatures.
-% Generated automatically from UnicodeData.txt by gen_translit_compat.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_compat.py for Unicode 13.0.0.
LC_CTYPE
diff --git a/localedata/locales/translit_font b/localedata/locales/translit_font
index 680c4ed426..c3d7b44772 100644
--- a/localedata/locales/translit_font
+++ b/localedata/locales/translit_font
@@ -9,7 +9,7 @@ comment_char %
% otherwise be governed by that license.
% Transliterations of font equivalents.
-% Generated automatically from UnicodeData.txt by gen_translit_font.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_font.py for Unicode 13.0.0.
LC_CTYPE
diff --git a/localedata/locales/translit_fraction b/localedata/locales/translit_fraction
index b52244969e..292fe3e806 100644
--- a/localedata/locales/translit_fraction
+++ b/localedata/locales/translit_fraction
@@ -9,7 +9,7 @@ comment_char %
% otherwise be governed by that license.
% Transliterations of fractions.
-% Generated automatically from UnicodeData.txt by gen_translit_fraction.py on 2020-06-25 for Unicode 13.0.0.
+% Generated automatically from UnicodeData.txt by gen_translit_fraction.py for Unicode 13.0.0.
% The replacements have been surrounded with spaces, because fractions are
% often preceded by a decimal number and followed by a unit or a math symbol.
diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile
index d0dd1b78a5..b5c9c5517b 100644
--- a/localedata/unicode-gen/Makefile
+++ b/localedata/unicode-gen/Makefile
@@ -18,11 +18,10 @@
# Makefile for generating and updating Unicode-extracted files.
-# This Makefile is NOT used as part of the GNU libc build. It needs
-# to be run manually, within the source tree, at Unicode upgrades
-# (change UNICODE_VERSION below), to update ../locales/i18n_ctype ctype
-# information (part of the file is preserved, so don't wipe it all
-# out), and ../charmaps/UTF-8.
+# This Makefile is NOT used as part of the GNU libc build. It needs to
+# be run manually, within the source tree, at Unicode upgrades, to
+# update ../locales/i18n_ctype ctype information (part of the file is
+# preserved, so don't wipe it all out), and ../charmaps/UTF-8.
# Use make all to generate the files used in the glibc build out of
# the original Unicode files; make check to verify that they are what
@@ -33,13 +32,14 @@
# running afoul of the LGPL corresponding sources requirements, even
# though it's not clear that they are preferred over the generated
# files for making modifications.
-
-
-UNICODE_VERSION = 13.0.0
+#
+# The UnicodeRelease.txt file must be updated manually to include the
+# information about the downloaded Unicode release.
PYTHON3 = python3
WGET = wget
+RELEASEDATA = UnicodeRelease.txt
DOWNLOADS = UnicodeData.txt DerivedCoreProperties.txt EastAsianWidth.txt PropList.txt
GENERATED = i18n_ctype tr_TR UTF-8 translit_combining translit_compat translit_circle translit_cjk_compat translit_font translit_fraction
REPORTS = i18n_ctype-report UTF-8-report
@@ -66,12 +66,10 @@ mostlyclean:
.PHONY: all check clean mostlyclean install
-i18n_ctype: UnicodeData.txt DerivedCoreProperties.txt
+i18n_ctype: UnicodeData.txt DerivedCoreProperties.txt $(RELEASEDATA)
i18n_ctype: ../locales/i18n_ctype # Preserve non-ctype information.
i18n_ctype: gen_unicode_ctype.py
- $(PYTHON3) gen_unicode_ctype.py -u UnicodeData.txt \
- -d DerivedCoreProperties.txt -i ../locales/i18n_ctype -o $@ \
- --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) gen_unicode_ctype.py -i ../locales/i18n_ctype -o $@
i18n_ctype-report: i18n_ctype ../locales/i18n_ctype
i18n_ctype-report: ctype_compatibility.py ctype_compatibility_test_cases.py
@@ -86,55 +84,45 @@ check-i18n_ctype: i18n_ctype-report
tr_TR: UnicodeData.txt DerivedCoreProperties.txt
tr_TR: ../locales/tr_TR # Preserve non-ctype information.
tr_TR: gen_unicode_ctype.py
- $(PYTHON3) gen_unicode_ctype.py -u UnicodeData.txt \
- -d DerivedCoreProperties.txt -i ../locales/tr_TR -o $@ \
- --unicode_version $(UNICODE_VERSION) --turkish
+ $(PYTHON3) gen_unicode_ctype.py -i ../locales/tr_TR -o $@ \
+ --turkish
-UTF-8: UnicodeData.txt EastAsianWidth.txt
+UTF-8: UnicodeData.txt EastAsianWidth.txt $(RELEASEDATA)
UTF-8: utf8_gen.py
- $(PYTHON3) utf8_gen.py -u UnicodeData.txt \
- -e EastAsianWidth.txt -p PropList.txt \
- --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) utf8_gen.py
UTF-8-report: UTF-8 ../charmaps/UTF-8
UTF-8-report: utf8_compatibility.py
- $(PYTHON3) ./utf8_compatibility.py -u UnicodeData.txt \
- -e EastAsianWidth.txt -o ../charmaps/UTF-8 \
+ $(PYTHON3) ./utf8_compatibility.py -o ../charmaps/UTF-8 \
-n UTF-8 -a -m -c > $@
check-UTF-8: UTF-8-report
@if grep '^Total.*: [^0]' UTF-8-report; \
then echo manual verification required; false; else true; fi
-translit_combining: UnicodeData.txt
+translit_combining: UnicodeData.txt $(RELEASEDATA)
translit_combining: gen_translit_combining.py
- $(PYTHON3) ./gen_translit_combining.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_combining.py -o $@
-translit_compat: UnicodeData.txt
+translit_compat: UnicodeData.txt $(RELEASEDATA)
translit_compat: gen_translit_compat.py
- $(PYTHON3) ./gen_translit_compat.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_compat.py -o $@
-translit_circle: UnicodeData.txt
+translit_circle: UnicodeData.txt $(RELEASEDATA)
translit_circle: gen_translit_circle.py
- $(PYTHON3) ./gen_translit_circle.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_circle.py -o $@
-translit_cjk_compat: UnicodeData.txt
+translit_cjk_compat: UnicodeData.txt $(RELEASEDATA)
translit_cjk_compat: gen_translit_cjk_compat.py
- $(PYTHON3) ./gen_translit_cjk_compat.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_cjk_compat.py -o $@
-translit_font: UnicodeData.txt
+translit_font: UnicodeData.txt $(RELEASEDATA)
translit_font: gen_translit_font.py
- $(PYTHON3) ./gen_translit_font.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_font.py -o $@
-translit_fraction: UnicodeData.txt
+translit_fraction: UnicodeData.txt $(RELEASEDATA)
translit_fraction: gen_translit_fraction.py
- $(PYTHON3) ./gen_translit_fraction.py -u UnicodeData.txt \
- -o $@ --unicode_version $(UNICODE_VERSION)
+ $(PYTHON3) ./gen_translit_fraction.py -o $@
.PHONY: downloads clean-downloads
downloads: $(DOWNLOADS)
diff --git a/localedata/unicode-gen/UnicodeRelease.txt b/localedata/unicode-gen/UnicodeRelease.txt
new file mode 100644
index 0000000000..bd9cc14ae0
--- /dev/null
+++ b/localedata/unicode-gen/UnicodeRelease.txt
@@ -0,0 +1,8 @@
+% This metadata is used by glibc and updated by the developer(s)
+% carrying out the Unicode update.
+Version,13.0.0
+ReleaseDate,2021-03-10
+Data,UnicodeData.txt
+DcpData,DerivedCoreProperties.txt
+EawData,EastAsianWidth.txt
+PlData,PropList.txt
diff --git a/localedata/unicode-gen/gen_translit_circle.py b/localedata/unicode-gen/gen_translit_circle.py
index a83dccc163..cc897b2f5f 100644
--- a/localedata/unicode-gen/gen_translit_circle.py
+++ b/localedata/unicode-gen/gen_translit_circle.py
@@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('% Transliterations of encircled characters.\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_circle.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('\n')
translit_file.write('LC_CTYPE\n')
@@ -110,11 +109,11 @@ if __name__ == "__main__":
Generate a translit_circle file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -133,19 +132,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_translit_cjk_compat.py b/localedata/unicode-gen/gen_translit_cjk_compat.py
index a040511d06..ac127a8e21 100644
--- a/localedata/unicode-gen/gen_translit_cjk_compat.py
+++ b/localedata/unicode-gen/gen_translit_cjk_compat.py
@@ -69,7 +69,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('characters.\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_cjk_compat.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('\n')
translit_file.write('LC_CTYPE\n')
@@ -180,11 +179,11 @@ if __name__ == "__main__":
Generate a translit_cjk_compat file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -203,19 +202,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_translit_combining.py b/localedata/unicode-gen/gen_translit_combining.py
index 88be8f4b8a..082c0da92c 100644
--- a/localedata/unicode-gen/gen_translit_combining.py
+++ b/localedata/unicode-gen/gen_translit_combining.py
@@ -69,7 +69,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('% pronounciation marks, etc.).\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_combining.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('\n')
translit_file.write('LC_CTYPE\n')
@@ -404,11 +403,11 @@ if __name__ == "__main__":
Generate a translit_combining file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -427,19 +426,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_translit_compat.py b/localedata/unicode-gen/gen_translit_compat.py
index c8c63b23af..ba144e9bee 100644
--- a/localedata/unicode-gen/gen_translit_compat.py
+++ b/localedata/unicode-gen/gen_translit_compat.py
@@ -68,7 +68,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('and ligatures.\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_compat.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('\n')
translit_file.write('LC_CTYPE\n')
@@ -286,11 +285,11 @@ if __name__ == "__main__":
Generate a translit_compat file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -309,19 +308,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_translit_font.py b/localedata/unicode-gen/gen_translit_font.py
index db41b47fab..93b2f128fa 100644
--- a/localedata/unicode-gen/gen_translit_font.py
+++ b/localedata/unicode-gen/gen_translit_font.py
@@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('% Transliterations of font equivalents.\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_font.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('\n')
translit_file.write('LC_CTYPE\n')
@@ -116,11 +115,11 @@ if __name__ == "__main__":
Generate a translit_font file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -139,19 +138,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_translit_fraction.py b/localedata/unicode-gen/gen_translit_fraction.py
index c3c1513eb9..097cb04ea0 100644
--- a/localedata/unicode-gen/gen_translit_fraction.py
+++ b/localedata/unicode-gen/gen_translit_fraction.py
@@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''):
translit_file.write('% Transliterations of fractions.\n')
translit_file.write('% Generated automatically from UnicodeData.txt '
+ 'by gen_translit_fraction.py '
- + 'on {:s} '.format(time.strftime('%Y-%m-%d'))
+ 'for Unicode {:s}.\n'.format(unicode_version))
translit_file.write('% The replacements have been surrounded ')
translit_file.write('with spaces, because fractions are\n')
@@ -157,11 +156,11 @@ if __name__ == "__main__":
Generate a translit_cjk_compat file from UnicodeData.txt.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -180,19 +179,16 @@ if __name__ == "__main__":
“translit_start” line and the tail from the “translit_end”
line to the end of the file will be copied unchanged into the
output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
HEAD = TAIL = ''
if ARGS.input_file:
(HEAD, TAIL) = read_input_file(ARGS.input_file)
with open(ARGS.output_file, mode='w') as TRANSLIT_FILE:
- output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD)
+ output_head(TRANSLIT_FILE, unicode_version, head=HEAD)
output_transliteration(TRANSLIT_FILE)
output_tail(TRANSLIT_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py
index 7548961df1..41760567cf 100755
--- a/localedata/unicode-gen/gen_unicode_ctype.py
+++ b/localedata/unicode-gen/gen_unicode_ctype.py
@@ -32,6 +32,7 @@ To see how this script is used, call it with the “-h” option:
import argparse
import time
import re
+import datetime
import unicode_utils
def code_point_ranges(is_class_function):
@@ -123,7 +124,7 @@ def output_charmap(i18n_file, map_name, map_function):
i18n_file.write(line+'\n')
i18n_file.write('\n')
-def read_input_file(filename):
+def read_input_file(filename, unicode_release_date):
'''Reads the original glibc i18n file to get the original head
and tail.
@@ -140,8 +141,13 @@ def read_input_file(filename):
r'^(?P<key>date\s+)(?P<value>"[0-9]{4}-[0-9]{2}-[0-9]{2}")',
line)
if match:
- line = match.group('key') \
- + '"{:s}"\n'.format(time.strftime('%Y-%m-%d'))
+ # Update the file date if the Unicode standard date
+ # is newer.
+ orig_date = datetime.date.fromisoformat(match.group('value').strip('"'))
+ new_date = datetime.date.fromisoformat(unicode_release_date)
+ if new_date > orig_date:
+ line = match.group('key') \
+ + '"{:s}"\n'.format(unicode_release_date)
head = head + line
if line.startswith('LC_CTYPE'):
break
@@ -153,7 +159,7 @@ def read_input_file(filename):
tail = tail + line
return (head, tail)
-def output_head(i18n_file, unicode_version, head=''):
+def output_head(i18n_file, unicode_version, unicode_release_date, head=''):
'''Write the header of the output file, i.e. the part of the file
before the “LC_CTYPE” line.
'''
@@ -180,8 +186,7 @@ def output_head(i18n_file, unicode_version, head=''):
i18n_file.write('language ""\n')
i18n_file.write('territory "Earth"\n')
i18n_file.write('revision "{:s}"\n'.format(unicode_version))
- i18n_file.write('date "{:s}"\n'.format(
- time.strftime('%Y-%m-%d')))
+ i18n_file.write('date "{:s}"\n'.format(unicode_release_date))
i18n_file.write('category "i18n:2012";LC_CTYPE\n')
i18n_file.write('END LC_IDENTIFICATION\n')
i18n_file.write('\n')
@@ -267,18 +272,11 @@ if __name__ == "__main__":
UnicodeData.txt and DerivedCoreProperties.txt files.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
- + 'default: %(default)s'))
- PARSER.add_argument(
- '-d', '--derived_core_properties_file',
- nargs='?',
- type=str,
- default='DerivedCoreProperties.txt',
- help=('The DerivedCoreProperties.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'-i', '--input_file',
@@ -298,27 +296,21 @@ if __name__ == "__main__":
classes and the date stamp in
LC_IDENTIFICATION will be copied unchanged
into the output file. ''')
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
PARSER.add_argument(
'--turkish',
action='store_true',
help='Use Turkish case conversions.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(
- ARGS.unicode_data_file)
- unicode_utils.fill_derived_core_properties(
- ARGS.derived_core_properties_file)
+ unicode_version = unicode_utils.release_version (ARGS.unicode_data_dir)
+ unicode_release_date = unicode_utils.release_date (ARGS.unicode_data_dir)
+ unicode_utils.fill_attributes(unicode_utils.release_data_file(ARGS.unicode_data_dir))
+ unicode_utils.fill_derived_core_properties(unicode_utils.release_dcp_file(ARGS.unicode_data_dir))
unicode_utils.verifications()
HEAD = TAIL = ''
if ARGS.input_file:
- (HEAD, TAIL) = read_input_file(ARGS.input_file)
+ (HEAD, TAIL) = read_input_file(ARGS.input_file, unicode_release_date)
with open(ARGS.output_file, mode='w') as I18N_FILE:
- output_head(I18N_FILE, ARGS.unicode_version, head=HEAD)
- output_tables(I18N_FILE, ARGS.unicode_version, ARGS.turkish)
+ output_head(I18N_FILE, unicode_version, unicode_release_date, head=HEAD)
+ output_tables(I18N_FILE, unicode_version, ARGS.turkish)
output_tail(I18N_FILE, tail=TAIL)
diff --git a/localedata/unicode-gen/unicode_utils.py b/localedata/unicode-gen/unicode_utils.py
index 3263f4510b..2b7c6aaa45 100644
--- a/localedata/unicode-gen/unicode_utils.py
+++ b/localedata/unicode-gen/unicode_utils.py
@@ -525,3 +525,41 @@ def verifications():
and (is_graph(code_point) or code_point == 0x0020)):
sys.stderr.write('%(sym)s is graph|<space> but not print\n' %{
'sym': unicode_utils.ucs_symbol(code_point)})
+
+def release_metadata(data_dir, parameter):
+ ''' Parse the UnicodeRelease.txt metadata and return the value for
+ the specified parameter.'''
+ value = ""
+ with open(data_dir + '/' + "UnicodeRelease.txt", "r") as f:
+ for line in f:
+ if line.strip()[0] == '%':
+ continue
+ fields = line.strip().split(",")
+ if fields[0] == parameter:
+ value = fields[1].strip()
+ assert value != ""
+ return value
+
+def release_version(data_dir):
+ ''' Return the Unicode version of the data in use.'''
+ return release_metadata(data_dir, "Version")
+
+def release_date(data_dir):
+ ''' Release the release date for the Unicode version of the data.'''
+ return release_metadata(data_dir, "ReleaseDate")
+
+def release_data_file(data_dir):
+ ''' The name of the primary data file.'''
+ return data_dir + '/' + release_metadata(data_dir, 'Data')
+
+def release_dcp_file(data_dir):
+ ''' The name of the derived core properties data file.'''
+ return data_dir + '/' + release_metadata(data_dir, 'DcpData')
+
+def release_eaw_file(data_dir):
+ ''' The name of the East Asian width data file.'''
+ return data_dir + '/' + release_metadata(data_dir, 'EawData')
+
+def release_pl_file(data_dir):
+ ''' The name of the properties list data file.'''
+ return data_dir + '/' + release_metadata(data_dir, 'PlData')
diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py
index eca2e8cddc..7e485ba759 100755
--- a/localedata/unicode-gen/utf8_compatibility.py
+++ b/localedata/unicode-gen/utf8_compatibility.py
@@ -216,6 +216,13 @@ if __name__ == "__main__":
description='''
Compare the contents of LC_CTYPE in two files and check for errors.
''')
+ PARSER.add_argument(
+ '-u', '--unicode_data_dir',
+ nargs='?',
+ type=str,
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ + 'default: %(default)s'))
PARSER.add_argument(
'-o', '--old_utf8_file',
nargs='?',
@@ -228,16 +235,6 @@ if __name__ == "__main__":
required=True,
type=str,
help='The new UTF-8 file.')
- PARSER.add_argument(
- '-u', '--unicode_data_file',
- nargs='?',
- type=str,
- help='The UnicodeData.txt file to read.')
- PARSER.add_argument(
- '-e', '--east_asian_width_file',
- nargs='?',
- type=str,
- help='The EastAsianWidth.txt file to read.')
PARSER.add_argument(
'-a', '--show_added_characters',
action='store_true',
@@ -252,9 +249,11 @@ if __name__ == "__main__":
help='Show characters whose width was changed in detail.')
ARGS = PARSER.parse_args()
- if ARGS.unicode_data_file:
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
- if ARGS.east_asian_width_file:
- unicode_utils.fill_east_asian_widths(ARGS.east_asian_width_file)
+ unicode_data_file = unicode_utils.release_data_file (ARGS.unicode_data_dir)
+ east_asian_width_file = unicode_utils.release_eaw_file (ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
+ unicode_utils.fill_east_asian_widths(east_asian_width_file)
+
check_charmap(ARGS.old_utf8_file, ARGS.new_utf8_file)
check_width(ARGS.old_utf8_file, ARGS.new_utf8_file)
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..4fc3038fe0 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -22,7 +22,7 @@
This script generates a glibc/localedata/charmaps/UTF-8 file
from Unicode data.
-Usage: python3 utf8_gen.py UnicodeData.txt EastAsianWidth.txt
+Usage: python3 utf8_gen.py
It will output UTF-8 file
'''
@@ -198,23 +198,27 @@ def write_header_charmap(outfile):
outfile.write("% alias ISO-10646/UTF-8\n")
outfile.write("CHARMAP\n")
-def write_header_width(outfile, unicode_version):
+def write_header_width(outfile, unicode_data_dir):
'''Writes the header on top of the WIDTH section to the output file'''
+ unicode_version = unicode_utils.release_version(unicode_data_dir)
+ unicode_data = unicode_utils.release_metadata(unicode_data_dir, 'Data')
+ eaw_data = unicode_utils.release_metadata(unicode_data_dir, 'EawData')
+ pl_data = unicode_utils.release_metadata(unicode_data_dir, 'PlData')
outfile.write('% Character width according to Unicode '
+ '{:s}.\n'.format(unicode_version))
outfile.write('% - Default width is 1.\n')
outfile.write('% - Double-width characters have width 2; generated from\n')
- outfile.write('% "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n')
+ outfile.write('% "grep \'^[^;]*;[WF]\' ' + eaw_data + '"\n')
outfile.write('% - Non-spacing characters have width 0; '
- + 'generated from PropList.txt or\n')
+ + 'generated from ' + pl_data + ' or\n')
outfile.write('% "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' '
- + 'UnicodeData.txt"\n')
+ + unicode_data + '"\n')
outfile.write('% - Format control characters have width 0; '
+ 'generated from\n')
- outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n")
+ outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' " + unicode_data + "\"\n")
# Not needed covered by Cf
# outfile.write("% - Zero width characters have width 0; generated from\n")
-# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n")
+# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' " + unicode_data + "\"\n")
outfile.write("WIDTH\n")
def process_width(outfile, ulines, elines, plines):
@@ -302,41 +306,26 @@ def process_width(outfile, ulines, elines, plines):
if __name__ == "__main__":
PARSER = argparse.ArgumentParser(
description='''
- Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt.
+ Generate a UTF-8 file from the Unicode release data files.
''')
PARSER.add_argument(
- '-u', '--unicode_data_file',
+ '-u', '--unicode_data_dir',
nargs='?',
type=str,
- default='UnicodeData.txt',
- help=('The UnicodeData.txt file to read, '
+ default='.',
+ help=('The directory containing Unicode data to read, '
+ 'default: %(default)s'))
- PARSER.add_argument(
- '-e', '--east_asian_with_file',
- nargs='?',
- type=str,
- default='EastAsianWidth.txt',
- help=('The EastAsianWidth.txt file to read, '
- + 'default: %(default)s'))
- PARSER.add_argument(
- '-p', '--prop_list_file',
- nargs='?',
- type=str,
- default='PropList.txt',
- help=('The PropList.txt file to read, '
- + 'default: %(default)s'))
- PARSER.add_argument(
- '--unicode_version',
- nargs='?',
- required=True,
- type=str,
- help='The Unicode version of the input files used.')
ARGS = PARSER.parse_args()
- unicode_utils.fill_attributes(ARGS.unicode_data_file)
- with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
+ unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir)
+ unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir)
+ east_asian_width_file = unicode_utils.release_eaw_file(ARGS.unicode_data_dir)
+ prop_list_file = unicode_utils.release_pl_file(ARGS.unicode_data_dir)
+
+ unicode_utils.fill_attributes(unicode_data_file)
+ with open(unicode_data_file, mode='r') as UNIDATA_FILE:
UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
- with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:
+ with open(east_asian_width_file, mode='r') as EAST_ASIAN_WIDTH_FILE:
EAST_ASIAN_WIDTH_LINES = []
for LINE in EAST_ASIAN_WIDTH_FILE:
# If characters from EastAasianWidth.txt which are from
@@ -352,7 +341,7 @@ if __name__ == "__main__":
continue
if re.match(r'^[^;]*;[WF]', LINE):
EAST_ASIAN_WIDTH_LINES.append(LINE.strip())
- with open(ARGS.prop_list_file, mode='r') as PROP_LIST_FILE:
+ with open(prop_list_file, mode='r') as PROP_LIST_FILE:
PROP_LIST_LINES = []
for LINE in PROP_LIST_FILE:
if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE):
@@ -363,7 +352,7 @@ if __name__ == "__main__":
process_charmap(UNICODE_DATA_LINES, OUTFILE)
OUTFILE.write("END CHARMAP\n\n")
# Processing EastAsianWidth.txt and write WIDTH to UTF-8 file
- write_header_width(OUTFILE, ARGS.unicode_version)
+ write_header_width(OUTFILE, ARGS.unicode_data_dir)
process_width(OUTFILE,
UNICODE_DATA_LINES,
EAST_ASIAN_WIDTH_LINES,
--
2.26.3
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] Make Unicode generation reproducible.
2021-04-29 17:27 [PATCH] Make Unicode generation reproducible Carlos O'Donell
@ 2021-04-30 9:26 ` Florian Weimer
2021-05-04 21:46 ` Florian Weimer
0 siblings, 1 reply; 3+ messages in thread
From: Florian Weimer @ 2021-04-30 9:26 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: libc-alpha, joseph
* Carlos O'Donell:
> diff --git a/localedata/unicode-gen/UnicodeRelease.txt b/localedata/unicode-gen/UnicodeRelease.txt
> new file mode 100644
> index 0000000000..bd9cc14ae0
> --- /dev/null
> +++ b/localedata/unicode-gen/UnicodeRelease.txt
> @@ -0,0 +1,8 @@
> +% This metadata is used by glibc and updated by the developer(s)
> +% carrying out the Unicode update.
> +Version,13.0.0
> +ReleaseDate,2021-03-10
> +Data,UnicodeData.txt
> +DcpData,DerivedCoreProperties.txt
> +EawData,EastAsianWidth.txt
> +PlData,PropList.txt
I suggest to use <https://www.unicode.org/Public/13.0.0/ucd/ReadMe.txt>
instead. It would give 2020-03-06 as the date. 2021-03-10 is
definitely wrong, it should be 2020-03-10.
Perhaps it's time to move the *.txt files into their own directory and
also include <https://www.unicode.org/copyright.html> and
<https://www.unicode.org/license.html>.
Thanks,
Florian
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] Make Unicode generation reproducible.
2021-04-30 9:26 ` Florian Weimer
@ 2021-05-04 21:46 ` Florian Weimer
0 siblings, 0 replies; 3+ messages in thread
From: Florian Weimer @ 2021-05-04 21:46 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: libc-alpha, joseph
* Florian Weimer:
> * Carlos O'Donell:
>
>> diff --git a/localedata/unicode-gen/UnicodeRelease.txt b/localedata/unicode-gen/UnicodeRelease.txt
>> new file mode 100644
>> index 0000000000..bd9cc14ae0
>> --- /dev/null
>> +++ b/localedata/unicode-gen/UnicodeRelease.txt
>> @@ -0,0 +1,8 @@
>> +% This metadata is used by glibc and updated by the developer(s)
>> +% carrying out the Unicode update.
>> +Version,13.0.0
>> +ReleaseDate,2021-03-10
>> +Data,UnicodeData.txt
>> +DcpData,DerivedCoreProperties.txt
>> +EawData,EastAsianWidth.txt
>> +PlData,PropList.txt
>
> I suggest to use <https://www.unicode.org/Public/13.0.0/ucd/ReadMe.txt>
> instead. It would give 2020-03-06 as the date. 2021-03-10 is
> definitely wrong, it should be 2020-03-10.
>
> Perhaps it's time to move the *.txt files into their own directory and
> also include <https://www.unicode.org/copyright.html> and
> <https://www.unicode.org/license.html>.
Hmm, maybe this is asking for too much, given that this started out as
something else entirely.
Maybe put variables into the generator script along with a comment for
now? I think the custom descriptor file is probably a bit overdesigned.
Thanks,
Florian
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-05-04 21:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-29 17:27 [PATCH] Make Unicode generation reproducible Carlos O'Donell
2021-04-30 9:26 ` Florian Weimer
2021-05-04 21:46 ` Florian Weimer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).