Is it OK to write ASCII strings directly into locale source files?

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Is it OK to write ASCII strings directly into locale source files?
@ 2017-07-24 13:13 Mike FABIAN
  2017-07-24 13:28 ` Carlos O'Donell
  0 siblings, 1 reply; 20+ messages in thread
From: Mike FABIAN @ 2017-07-24 13:13 UTC (permalink / raw)
  To: libc-alpha

Currently the locale source files use a lot of code points even for
strings which are pure ASCII. For example localedata/locales/de_DE
contains:

%	"%a %d %b %Y %T %Z"
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"

Would it be OK to write this as

d_t_fmt "%a %d %b %Y %T %Z"

??

This would make the files much more readable.

Stuff that is mostly ASCII can probably be written like this:

% https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
country_name "Fran<U00E7>a"

which is already more readable then writing it all in <U00??> code points.

It would be even nicer to write it completely in UTF-8, i.e.:

country_name "FranÃ§a"

but I am not sure whether this is allowed in the locale source files.

But at least for everything which is ASCII, it might be OK already to
write the characters directly.

Is writing ASCII there allowed or not??

-- 
Mike FABIAN <mfabian@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 13:13 Is it OK to write ASCII strings directly into locale source files? Mike FABIAN
@ 2017-07-24 13:28 ` Carlos O'Donell
  2017-07-24 13:32   ` Mike FABIAN
  2017-07-24 14:49   ` Andreas Schwab
  0 siblings, 2 replies; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 13:28 UTC (permalink / raw)
  To: Mike FABIAN, libc-alpha

On 07/24/2017 09:09 AM, Mike FABIAN wrote:
> 
> Currently the locale source files use a lot of code points even for
> strings which are pure ASCII. For example localedata/locales/de_DE
> contains:
> 
> %	"%a %d %b %Y %T %Z"
> d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
> 
> Would it be OK to write this as
> 
> d_t_fmt "%a %d %b %Y %T %Z"
> 
> ??
> 
> This would make the files much more readable.
> 
> Stuff that is mostly ASCII can probably be written like this:
> 
> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
> country_name "Fran<U00E7>a"
> 
> which is already more readable then writing it all in <U00??> code points.
> 
> It would be even nicer to write it completely in UTF-8, i.e.:
> 
> country_name "FranÃ§a"
> 
> but I am not sure whether this is allowed in the locale source files.
> 
> But at least for everything which is ASCII, it might be OK already to
> write the characters directly.
> 
> Is writing ASCII there allowed or not??

It's not ASCII though is it? Since '<' and '>' have to be reserved
to support parsing of UTF-8 code points, so it's "almost ASCII."

I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
instead of the verbose code-points, but we need to document exactly
which characters are allowed. I believe the answer is everything
except '<>'.

I'm not entirely ready to allow all UTF-8, since that descends into
the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
which form should be used. Then there are discussions around uniqueness
of decomposition and exactly what did the source author want.

So let us start slowly and agree with 'ASCII - [<>]' where < denotes
the start of a code point and > the end of the code point.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 13:28 ` Carlos O'Donell
@ 2017-07-24 13:32   ` Mike FABIAN
  2017-07-24 14:47     ` Carlos O'Donell
  2017-07-24 14:49   ` Andreas Schwab
  1 sibling, 1 reply; 20+ messages in thread
From: Mike FABIAN @ 2017-07-24 13:32 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

Carlos O'Donell <carlos@redhat.com> wrote:

> On 07/24/2017 09:09 AM, Mike FABIAN wrote:
>> 
>> Currently the locale source files use a lot of code points even for
>> strings which are pure ASCII. For example localedata/locales/de_DE
>> contains:
>> 
>> %	"%a %d %b %Y %T %Z"
>> d_t_fmt
>> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
>> 
>> Would it be OK to write this as
>> 
>> d_t_fmt "%a %d %b %Y %T %Z"
>> 
>> ??
>> 
>> This would make the files much more readable.
>> 
>> Stuff that is mostly ASCII can probably be written like this:
>> 
>> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
>> country_name "Fran<U00E7>a"
>> 
>> which is already more readable then writing it all in <U00??> code points.
>> 
>> It would be even nicer to write it completely in UTF-8, i.e.:
>> 
>> country_name "FranÃ§a"
>> 
>> but I am not sure whether this is allowed in the locale source files.
>> 
>> But at least for everything which is ASCII, it might be OK already to
>> write the characters directly.
>> 
>> Is writing ASCII there allowed or not??
>  
> It's not ASCII though is it? Since '<' and '>' have to be reserved
> to support parsing of UTF-8 code points, so it's "almost ASCII."
>
> I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
> instead of the verbose code-points, but we need to document exactly
> which characters are allowed. I believe the answer is everything
> except '<>'.
>
> I'm not entirely ready to allow all UTF-8, since that descends into
> the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
> which form should be used. Then there are discussions around uniqueness
> of decomposition and exactly what did the source author want.
>
> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
> the start of a code point and > the end of the code point.

Yes, that sounds like a very reasonable first step!

Is it OK to use that already *now*?

Or is any change necessary to make that work?

I tried

country_name "Fran<U00E7>a"

and it seems to work:

bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name
country_name="FranÃ§a"

So maybe it is possible to use that right now without having to change
anything in the code parsing the locale source files.

-- 
Mike FABIAN <mfabian@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 13:32   ` Mike FABIAN
@ 2017-07-24 14:47     ` Carlos O'Donell
  2017-07-24 15:03       ` Mike FABIAN
  2017-07-24 22:39       ` Rafal Luzynski
  0 siblings, 2 replies; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 14:47 UTC (permalink / raw)
  To: Mike FABIAN; +Cc: libc-alpha

On 07/24/2017 09:28 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> On 07/24/2017 09:09 AM, Mike FABIAN wrote:
>>>
>>> Currently the locale source files use a lot of code points even for
>>> strings which are pure ASCII. For example localedata/locales/de_DE
>>> contains:
>>>
>>> %	"%a %d %b %Y %T %Z"
>>> d_t_fmt
>>> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
>>>
>>> Would it be OK to write this as
>>>
>>> d_t_fmt "%a %d %b %Y %T %Z"
>>>
>>> ??
>>>
>>> This would make the files much more readable.
>>>
>>> Stuff that is mostly ASCII can probably be written like this:
>>>
>>> % https://oc.wikipedia.org/wiki/Fran%C3%A7a FranÃ§a
>>> country_name "Fran<U00E7>a"
>>>
>>> which is already more readable then writing it all in <U00??> code points.
>>>
>>> It would be even nicer to write it completely in UTF-8, i.e.:
>>>
>>> country_name "FranÃ§a"
>>>
>>> but I am not sure whether this is allowed in the locale source files.
>>>
>>> But at least for everything which is ASCII, it might be OK already to
>>> write the characters directly.
>>>
>>> Is writing ASCII there allowed or not??
>>  
>> It's not ASCII though is it? Since '<' and '>' have to be reserved
>> to support parsing of UTF-8 code points, so it's "almost ASCII."
>>
>> I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form
>> instead of the verbose code-points, but we need to document exactly
>> which characters are allowed. I believe the answer is everything
>> except '<>'.
>>
>> I'm not entirely ready to allow all UTF-8, since that descends into
>> the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and
>> which form should be used. Then there are discussions around uniqueness
>> of decomposition and exactly what did the source author want.
>>
>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>> the start of a code point and > the end of the code point.
> 
> Yes, that sounds like a very reasonable first step!
> 
> Is it OK to use that already *now*?

You and Rafal are localedata maintainers, you can assume consensus, therefore
you can start changing things in whatever way you wish.

Before you change this though I would like to see your list of reasons
for making the change, what benefits do you see it bringing? Is readability
the only one?

> Or is any change necessary to make that work?

I do not know.

> I tried
> 
> country_name "Fran<U00E7>a"
> 
> and it seems to work:
> 
> bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name
> country_name="FranÃ§a"
> 
> So maybe it is possible to use that right now without having to change
> anything in the code parsing the locale source files.
 
You need to document somewhere what is acceptable and what is not and
which ASCII characters cannot be used.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 13:28 ` Carlos O'Donell
  2017-07-24 13:32   ` Mike FABIAN
@ 2017-07-24 14:49   ` Andreas Schwab
  2017-07-24 15:07     ` Carlos O'Donell
  2017-07-24 17:07     ` Florian Weimer
  1 sibling, 2 replies; 20+ messages in thread
From: Andreas Schwab @ 2017-07-24 14:49 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Mike FABIAN, libc-alpha

On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:

> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
> the start of a code point and > the end of the code point.

POSIX says "character in the portable character set" if you want to keep
it portable.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 14:47     ` Carlos O'Donell
@ 2017-07-24 15:03       ` Mike FABIAN
  2017-07-24 15:45         ` Carlos O'Donell
  2017-07-24 22:39       ` Rafal Luzynski
  1 sibling, 1 reply; 20+ messages in thread
From: Mike FABIAN @ 2017-07-24 15:03 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

Carlos O'Donell <carlos@redhat.com> ã•ã‚“ã¯ã‹ãã¾ã—ãŸ:

> On 07/24/2017 09:28 AM, Mike FABIAN wrote:
>> Carlos O'Donell <carlos@redhat.com> wrote:
>> 
>>> On 07/24/2017 09:09 AM, Mike FABIAN wrote:

[...]

>> Yes, that sounds like a very reasonable first step!
>> 
>> Is it OK to use that already *now*?
>
> You and Rafal are localedata maintainers, you can assume consensus, therefore
> you can start changing things in whatever way you wish.
>
> Before you change this though I would like to see your list of reasons
> for making the change, what benefits do you see it bringing? Is readability
> the only one?

Readability is the only reason for me at the moment, editing these files
using the code points is very tedious and error prone. 

>> So maybe it is possible to use that right now without having to change
>> anything in the code parsing the locale source files.
>  
> You need to document somewhere what is acceptable and what is not and
> which ASCII characters cannot be used.

Where should that be documented?

-- 
Mike FABIAN <mfabian@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 14:49   ` Andreas Schwab
@ 2017-07-24 15:07     ` Carlos O'Donell
  2017-07-24 17:07     ` Florian Weimer
  1 sibling, 0 replies; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 15:07 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Mike FABIAN, libc-alpha

On 07/24/2017 10:47 AM, Andreas Schwab wrote:
> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>> the start of a code point and > the end of the code point.
> 
> POSIX says "character in the portable character set" if you want to keep
> it portable.

Yes, that's right, and the "7.3 Locale Definition" already says that "<"
is in the reserved namespace and must be escaped to be used literally.

So we should already support using characters from the POSIX portable
character set with proper escaping.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 15:03       ` Mike FABIAN
@ 2017-07-24 15:45         ` Carlos O'Donell
  0 siblings, 0 replies; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 15:45 UTC (permalink / raw)
  To: Mike FABIAN; +Cc: libc-alpha

On 07/24/2017 10:49 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> ã•ã‚“ã¯ã‹ãã¾ã—ãŸ:
> 
>> On 07/24/2017 09:28 AM, Mike FABIAN wrote:
>>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> On 07/24/2017 09:09 AM, Mike FABIAN wrote:
> 
> [...]
> 
>>> Yes, that sounds like a very reasonable first step!
>>>
>>> Is it OK to use that already *now*?
>>
>> You and Rafal are localedata maintainers, you can assume consensus, therefore
>> you can start changing things in whatever way you wish.
>>
>> Before you change this though I would like to see your list of reasons
>> for making the change, what benefits do you see it bringing? Is readability
>> the only one?
> 
> Readability is the only reason for me at the moment, editing these files
> using the code points is very tedious and error prone. 

Sounds like a good reason to me.

>>> So maybe it is possible to use that right now without having to change
>>> anything in the code parsing the locale source files.
>>  
>> You need to document somewhere what is acceptable and what is not and
>> which ASCII characters cannot be used.
> 
> Where should that be documented?

Andreas Schwab just pointed out that we should already technically support
using any character in the POSIX portable character set in the locale definitions.

So I think there is nothing to comment, and we are still within the definition of
POSIX for making these locales portable.

So feel free to change any UTF-8 code points into their equivalent POSIX portable
character set characters.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 14:49   ` Andreas Schwab
  2017-07-24 15:07     ` Carlos O'Donell
@ 2017-07-24 17:07     ` Florian Weimer
  2017-07-24 20:07       ` Carlos O'Donell
  1 sibling, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2017-07-24 17:07 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Carlos O'Donell, Mike FABIAN, libc-alpha

* Andreas Schwab:

> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
>
>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>> the start of a code point and > the end of the code point.
>
> POSIX says "character in the portable character set" if you want to keep
> it portable.

But our locales only have to be compatible with our localedef, right?

I know that the FSF does not claim copyright on our locales, so anyone
is free to take them and use them with their own non-GNU systems (or
sell them as PDFs/books).  But this does not mean we have to make
their lives easier if it comes at a cost to us (e.g., verifying that
we only use the portable character set, or refraining from using full
UTF-8 at a future date).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 17:07     ` Florian Weimer
@ 2017-07-24 20:07       ` Carlos O'Donell
  2017-07-24 22:34         ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 20:07 UTC (permalink / raw)
  To: Florian Weimer, Andreas Schwab; +Cc: Mike FABIAN, libc-alpha

On 07/24/2017 01:05 PM, Florian Weimer wrote:
> * Andreas Schwab:
> 
>> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
>>
>>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>>> the start of a code point and > the end of the code point.
>>
>> POSIX says "character in the portable character set" if you want to keep
>> it portable.
> 
> But our locales only have to be compatible with our localedef, right?

Should developers be able to write tools to the POSIX locale spec and parse
our source locale definitions? Supporting more than just GNU/Linux? Do the
BSDs share our locale definitions?

> I know that the FSF does not claim copyright on our locales, so anyone
> is free to take them and use them with their own non-GNU systems (or
> sell them as PDFs/books).  But this does not mean we have to make
> their lives easier if it comes at a cost to us (e.g., verifying that
> we only use the portable character set, or refraining from using full
> UTF-8 at a future date). 
I agree with your sentiment, and leave it up to Mike to decide what makes
it ultimately easier for him as a subsystem maintainer to work with. There
is certainly a cost/reward balance.

My only technical objection with writing straight UTF-8 is that it could
lead to more mistakes, and Mike just found one in CLDR where an Arabic
Farsi character was used incorrectly because it displayed the same glyph.
It was caught when harmonizing with glibc where you have to write out the
code points (Mike filed a bug upstream with CLDR).

My preference would be to start small, start using the POSIX portable
character set to it's maximum extent for all latin-based languages, see
how that works out, and then decide if we even need to pursue full UTF-8
and in which form.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 20:07       ` Carlos O'Donell
@ 2017-07-24 22:34         ` Florian Weimer
  2017-07-24 22:51           ` Rafal Luzynski
  2017-07-25  5:40           ` Carlos O'Donell
  0 siblings, 2 replies; 20+ messages in thread
From: Florian Weimer @ 2017-07-24 22:34 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Andreas Schwab, Mike FABIAN, libc-alpha

* Carlos O'Donell:

> On 07/24/2017 01:05 PM, Florian Weimer wrote:
>> * Andreas Schwab:
>> 
>>> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
>>>> the start of a code point and > the end of the code point.
>>>
>>> POSIX says "character in the portable character set" if you want to keep
>>> it portable.
>> 
>> But our locales only have to be compatible with our localedef, right?
>
> Should developers be able to write tools to the POSIX locale spec and parse
> our source locale definitions? Supporting more than just GNU/Linux? Do the
> BSDs share our locale definitions?

No, they don't.  For one thing, they have partially implemented %OB
(without fixing all the locales, creating inconsistencies).

> My only technical objection with writing straight UTF-8 is that it could
> lead to more mistakes, and Mike just found one in CLDR where an Arabic
> Farsi character was used incorrectly because it displayed the same glyph.
> It was caught when harmonizing with glibc where you have to write out the
> code points (Mike filed a bug upstream with CLDR).

Wasn't it caught by locale testing which revealed that the locale
wasn't compatible with ISO-8859-6?  That sanity check would still
apply to locale definitions written in UTF-8.

If we are worried about this kind of problem, I think web browsers
have multi-script detection logic to deal with cross-script homographs
in IDNA labels.  I don't know how hard it would be to extract that
logic from there and run it on locale strings, for cross-verification.

> My preference would be to start small, start using the POSIX portable
> character set to it's maximum extent for all latin-based languages,

I would still prefer the <U…> encoding for control characters which
are in the portable character set.  So I have to object to the
“maximum” part. :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 14:47     ` Carlos O'Donell
  2017-07-24 15:03       ` Mike FABIAN
@ 2017-07-24 22:39       ` Rafal Luzynski
  2017-07-24 22:55         ` Carlos O'Donell
  1 sibling, 1 reply; 20+ messages in thread
From: Rafal Luzynski @ 2017-07-24 22:39 UTC (permalink / raw)
  To: Mike FABIAN, Carlos O'Donell; +Cc: libc-alpha

24.07.2017 15:32 Carlos O'Donell <carlos@redhat.com> wrote:
> On 07/24/2017 09:28 AM, Mike FABIAN wrote:
> > Carlos O'Donell <carlos@redhat.com> wrote:
> >
> >> [...]
> >> So let us start slowly and agree with 'ASCII - [<>]' where < denotes
> >> the start of a code point and > the end of the code point.
> >
> > Yes, that sounds like a very reasonable first step!
> >
> > Is it OK to use that already *now*?
>
> You and Rafal are localedata maintainers, you can assume consensus, therefore
> you can start changing things in whatever way you wish.

At the moment I would hesitate with this change.  My reasons:

1. 2.26 release is just around the corner.
2. I don't know why this <U00xx> format was introduced.  I'm afraid that
   nobody here knows and also I'm afraid there was a good reason to introduce
   it.  Nobody knows what bugs will (re)appear if we revert to the more
   readable format.

Or maybe somebody understands the reasons and can explain we can safely
revert to the readable format or we can't?  If nobody can then let's
investigate the git history of the repo and find the reasons behind the
change.  If it turns out we can safely switch then let's switch.
If we find a good reason not to switch then we'll just do nothing.
If we still don't know the reasons then let's switch after 2.26 release
so we have enough time to test during the 2.27 development cycle.

> Before you change this though I would like to see your list of reasons
> for making the change, what benefits do you see it bringing? Is readability
> the only one?

Mike has already explained that the readability is a good reason and
to large extent I agree with this.  But which characters can we use in
the source code before the code becomes actually less readable?
What about the languages which use Latin alphabet with lots of
diacritical characters?  Non-European languages using Latin alphabet?
Greek, Cyrillic?  Right-to-left alphabets?  What if another developer
does not yet have or cannot ever have a font which supports a specific
alphabet?

Regards,

Rafal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 22:34         ` Florian Weimer
@ 2017-07-24 22:51           ` Rafal Luzynski
  2017-07-25  5:40           ` Carlos O'Donell
  1 sibling, 0 replies; 20+ messages in thread
From: Rafal Luzynski @ 2017-07-24 22:51 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell
  Cc: libc-alpha, Mike FABIAN, Andreas Schwab

24.07.2017 23:13 Florian Weimer <fw@deneb.enyo.de> wrote:
>
>
> * Carlos O'Donell:
>
> [...]
> > My only technical objection with writing straight UTF-8 is that it could
> > lead to more mistakes, and Mike just found one in CLDR where an Arabic
> > Farsi character was used incorrectly because it displayed the same glyph.
> > It was caught when harmonizing with glibc where you have to write out the
> > code points (Mike filed a bug upstream with CLDR).
>
> Wasn't it caught by locale testing which revealed that the locale
> wasn't compatible with ISO-8859-6? [...]

This is exactly what happened.  The character was not representable in
ISO-8859-6.  There was no problem in UTF-8.

> [...]
> > My preference would be to start small, start using the POSIX portable
> > character set to it's maximum extent for all latin-based languages,
>
> I would still prefer the <U…> encoding for control characters which
> are in the portable character set. So I have to object to the
> “maximum” part. :)

I agree modulo the concerns which I expressed in another email:
let's investigate the history behind it and if we still don't
know then let's just wait for the 2.26 release.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 22:39       ` Rafal Luzynski
@ 2017-07-24 22:55         ` Carlos O'Donell
  0 siblings, 0 replies; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-24 22:55 UTC (permalink / raw)
  To: Rafal Luzynski, Mike FABIAN; +Cc: libc-alpha

On 07/24/2017 06:34 PM, Rafal Luzynski wrote:
> 2. I don't know why this <U00xx> format was introduced.  I'm afraid that
>    nobody here knows and also I'm afraid there was a good reason to introduce
>    it.  Nobody knows what bugs will (re)appear if we revert to the more
>    readable format.

I asked Ulrich Drepper about this and his response was that the original design
was such that the locale source files could be convertible to any encoding
system and still work.

Given the ubiquity of UTF-8 today it seems like this goal is no longer as
important.

My opinion still stands that we should start with a limited set of characters
and see if we run into any problems.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-24 22:34         ` Florian Weimer
  2017-07-24 22:51           ` Rafal Luzynski
@ 2017-07-25  5:40           ` Carlos O'Donell
  2017-07-25  6:27             ` Mike FABIAN
  1 sibling, 1 reply; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-25  5:40 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andreas Schwab, Mike FABIAN, libc-alpha

On 07/24/2017 05:13 PM, Florian Weimer wrote:
>> My only technical objection with writing straight UTF-8 is that it could
>> lead to more mistakes, and Mike just found one in CLDR where an Arabic
>> Farsi character was used incorrectly because it displayed the same glyph.
>> It was caught when harmonizing with glibc where you have to write out the
>> code points (Mike filed a bug upstream with CLDR).
> 
> Wasn't it caught by locale testing which revealed that the locale
> wasn't compatible with ISO-8859-6?  That sanity check would still
> apply to locale definitions written in UTF-8.

My point was that the mistake was made in CLDR upstream where I only
presume the mistake was made because the glyphs are identical.

If we had not been using ISO-8859-6, or if we'd had a mapping from
all the UTF-8 chars into ISO-8859-6 (there was no transliteration for the
Farsi character), then we would not have noticed the error in the 
original source locale.

My only argument is that when you are forced to use <Uxxx> encoding it
is empirically less likely you'll make a mistake. Like reading a sentence
backwards to catch errors since it prevents your brain from filling in
the missing information.

> I would still prefer the <Uâ€¦> encoding for control characters which
> are in the portable character set.  So I have to object to the
> â€œmaximumâ€ part. :)

Yes, I had ignored the control characters, so I agree, not maximally :}

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-25  5:40           ` Carlos O'Donell
@ 2017-07-25  6:27             ` Mike FABIAN
  2017-07-25 12:48               ` Carlos O'Donell
  0 siblings, 1 reply; 20+ messages in thread
From: Mike FABIAN @ 2017-07-25  6:27 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, Andreas Schwab, libc-alpha

Carlos O'Donell <carlos@redhat.com> wrote:

> My only argument is that when you are forced to use <Uxxx> encoding it
> is empirically less likely you'll make a mistake. Like reading a sentence
> backwards to catch errors since it prevents your brain from filling in
> the missing information.

But there are also many mistakes because somebody mistyped code points.
Several weird typos in things like month names look as if somebody
mistyped code points.

-- 
Mike FABIAN <mfabian@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-25  6:27             ` Mike FABIAN
@ 2017-07-25 12:48               ` Carlos O'Donell
  2017-07-25 14:21                 ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-25 12:48 UTC (permalink / raw)
  To: Mike FABIAN; +Cc: Florian Weimer, Andreas Schwab, libc-alpha

On 07/25/2017 02:20 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> My only argument is that when you are forced to use <Uxxx> encoding it
>> is empirically less likely you'll make a mistake. Like reading a sentence
>> backwards to catch errors since it prevents your brain from filling in
>> the missing information.
> 
> But there are also many mistakes because somebody mistyped code points.
> Several weird typos in things like month names look as if somebody
> mistyped code points.

Ultimately I defer to your judgement as localedata maintainer to create
a workflow that is easy for you and benefits your work.

However, I caution against throwing away the compatibility of our locales
with POSIX, which doesn't seem to allow UTF-8 in the specification.

I would suggest the following:

(a) Documentation:

    File an Austin bug to adjust the text of the standard to allow what
    we want. Effectively documenting the defacto glibc standard which
    uses UTF-8.

(b) New process:

    Post-process the locale source before commit, and enforce, that there
    is an auto-generated comment that contains either the UTF-8 or code
    points, for the author to review before commit. If we wrote UTF-8
    in a special markup comment, and auto-generated the locale entry
    with code points then we would remain mostly compatible with POSIX
    and what we have today (less churn for user tools).

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-25 12:48               ` Carlos O'Donell
@ 2017-07-25 14:21                 ` Florian Weimer
  2017-07-25 14:37                   ` Carlos O'Donell
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2017-07-25 14:21 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha

* Carlos O'Donell:

> On 07/25/2017 02:20 AM, Mike FABIAN wrote:
>> Carlos O'Donell <carlos@redhat.com> wrote:
>> 
>>> My only argument is that when you are forced to use <Uxxx> encoding it
>>> is empirically less likely you'll make a mistake. Like reading a sentence
>>> backwards to catch errors since it prevents your brain from filling in
>>> the missing information.
>> 
>> But there are also many mistakes because somebody mistyped code points.
>> Several weird typos in things like month names look as if somebody
>> mistyped code points.
>
> Ultimately I defer to your judgement as localedata maintainer to create
> a workflow that is easy for you and benefits your work.
>
> However, I caution against throwing away the compatibility of our locales
> with POSIX, which doesn't seem to allow UTF-8 in the specification.

It does, to some extent:

| A character in the portable character set can be represented by the
| character itself, in which case the value of the character is
| implementation-defined. (Implementations may allow other characters
| to be represented as themselves, but such locale definitions are not
| portable.)

You'll need a very hostile interpretation to say that this doesn't
allow multi-byte character sequences in localedef input.

But I found this in the guts of localedef:

	      /* The standards leave it up to the implementation to decide
		 what to do with character which stand for themself.  We
		 could jump through hoops to find out the value relative to
		 the charmap and the repertoire map, but instead we leave
		 it up to the locale definition author to write a better
		 definition.  We assume here that every character which
		 stands for itself is encoded using ISO 8859-1.  Using the
		 escape character is allowed.  */

So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the
bootstrapping problem.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-25 14:21                 ` Florian Weimer
@ 2017-07-25 14:37                   ` Carlos O'Donell
  2017-07-25 19:05                     ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Carlos O'Donell @ 2017-07-25 14:37 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha

On 07/25/2017 10:12 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 07/25/2017 02:20 AM, Mike FABIAN wrote:
>>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>
>>>> My only argument is that when you are forced to use <Uxxx> encoding it
>>>> is empirically less likely you'll make a mistake. Like reading a sentence
>>>> backwards to catch errors since it prevents your brain from filling in
>>>> the missing information.
>>>
>>> But there are also many mistakes because somebody mistyped code points.
>>> Several weird typos in things like month names look as if somebody
>>> mistyped code points.
>>
>> Ultimately I defer to your judgement as localedata maintainer to create
>> a workflow that is easy for you and benefits your work.
>>
>> However, I caution against throwing away the compatibility of our locales
>> with POSIX, which doesn't seem to allow UTF-8 in the specification.
> 
> It does, to some extent:
> 
> | A character in the portable character set can be represented by the
> | character itself, in which case the value of the character is
> | implementation-defined. (Implementations may allow other characters
> | to be represented as themselves, but such locale definitions are not
> | portable.)
> 
> You'll need a very hostile interpretation to say that this doesn't
> allow multi-byte character sequences in localedef input.

I see what you're saying, which is that we are *still* POSIX comliant,
but not portable?

I assume we are focusing on the "()" text which allows some kind of escape
hatch outside of the portable character set and allow us to use UTF-8?

> But I found this in the guts of localedef:
> 
> 	      /* The standards leave it up to the implementation to decide
> 		 what to do with character which stand for themself.  We
> 		 could jump through hoops to find out the value relative to
> 		 the charmap and the repertoire map, but instead we leave
> 		 it up to the locale definition author to write a better
> 		 definition.  We assume here that every character which
> 		 stands for itself is encoded using ISO 8859-1.  Using the
> 		 escape character is allowed.  */
> 
> So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the
> bootstrapping problem.
 
We could just assume UTF-8, but yes, it looks like this needs a little bit
more looking into.

Either way, I support using the portable character set today, and that's
a step forward.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Is it OK to write ASCII strings directly into locale source files?
  2017-07-25 14:37                   ` Carlos O'Donell
@ 2017-07-25 19:05                     ` Florian Weimer
  0 siblings, 0 replies; 20+ messages in thread
From: Florian Weimer @ 2017-07-25 19:05 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha

* Carlos O'Donell:

>>> However, I caution against throwing away the compatibility of our locales
>>> with POSIX, which doesn't seem to allow UTF-8 in the specification.
>> 
>> It does, to some extent:
>> 
>> | A character in the portable character set can be represented by the
>> | character itself, in which case the value of the character is
>> | implementation-defined. (Implementations may allow other characters
>> | to be represented as themselves, but such locale definitions are not
>> | portable.)
>> 
>> You'll need a very hostile interpretation to say that this doesn't
>> allow multi-byte character sequences in localedef input.
>
> I see what you're saying, which is that we are *still* POSIX comliant,
> but not portable?

Right, and I think that's okay because the glibc locales are for
glibc.

> I assume we are focusing on the "()" text which allows some kind of escape
> hatch outside of the portable character set and allow us to use UTF-8?

Exactly.

>> But I found this in the guts of localedef:
>> 
>> 	      /* The standards leave it up to the implementation to decide
>> 		 what to do with character which stand for themself.  We
>> 		 could jump through hoops to find out the value relative to
>> 		 the charmap and the repertoire map, but instead we leave
>> 		 it up to the locale definition author to write a better
>> 		 definition.  We assume here that every character which
>> 		 stands for itself is encoded using ISO 8859-1.  Using the
>> 		 escape character is allowed.  */
>> 
>> So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the
>> bootstrapping problem.
>  
> We could just assume UTF-8, but yes, it looks like this needs a little bit
> more looking into.

Yes, and we don't have a real bootstrapping problem because while we
have charmap file for UTF-8, we have a separate UTF-8 implementation
in iconv/gconv, and we could use that to break the loop.

> Either way, I support using the portable character set today, and that's
> a step forward.

Agreed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-07-25 14:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24 13:13 Is it OK to write ASCII strings directly into locale source files? Mike FABIAN
2017-07-24 13:28 ` Carlos O'Donell
2017-07-24 13:32   ` Mike FABIAN
2017-07-24 14:47     ` Carlos O'Donell
2017-07-24 15:03       ` Mike FABIAN
2017-07-24 15:45         ` Carlos O'Donell
2017-07-24 22:39       ` Rafal Luzynski
2017-07-24 22:55         ` Carlos O'Donell
2017-07-24 14:49   ` Andreas Schwab
2017-07-24 15:07     ` Carlos O'Donell
2017-07-24 17:07     ` Florian Weimer
2017-07-24 20:07       ` Carlos O'Donell
2017-07-24 22:34         ` Florian Weimer
2017-07-24 22:51           ` Rafal Luzynski
2017-07-25  5:40           ` Carlos O'Donell
2017-07-25  6:27             ` Mike FABIAN
2017-07-25 12:48               ` Carlos O'Donell
2017-07-25 14:21                 ` Florian Weimer
2017-07-25 14:37                   ` Carlos O'Donell
2017-07-25 19:05                     ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).