* Is it OK to write ASCII strings directly into locale source files? @ 2017-07-24 13:13 Mike FABIAN 2017-07-24 13:28 ` Carlos O'Donell 0 siblings, 1 reply; 20+ messages in thread From: Mike FABIAN @ 2017-07-24 13:13 UTC (permalink / raw) To: libc-alpha Currently the locale source files use a lot of code points even for strings which are pure ASCII. For example localedata/locales/de_DE contains: % "%a %d %b %Y %T %Z" d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>" Would it be OK to write this as d_t_fmt "%a %d %b %Y %T %Z" ?? This would make the files much more readable. Stuff that is mostly ASCII can probably be written like this: % https://oc.wikipedia.org/wiki/Fran%C3%A7a França country_name "Fran<U00E7>a" which is already more readable then writing it all in <U00??> code points. It would be even nicer to write it completely in UTF-8, i.e.: country_name "França" but I am not sure whether this is allowed in the locale source files. But at least for everything which is ASCII, it might be OK already to write the characters directly. Is writing ASCII there allowed or not?? -- Mike FABIAN <mfabian@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 13:13 Is it OK to write ASCII strings directly into locale source files? Mike FABIAN @ 2017-07-24 13:28 ` Carlos O'Donell 2017-07-24 13:32 ` Mike FABIAN 2017-07-24 14:49 ` Andreas Schwab 0 siblings, 2 replies; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 13:28 UTC (permalink / raw) To: Mike FABIAN, libc-alpha On 07/24/2017 09:09 AM, Mike FABIAN wrote: > > Currently the locale source files use a lot of code points even for > strings which are pure ASCII. For example localedata/locales/de_DE > contains: > > % "%a %d %b %Y %T %Z" > d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>" > > Would it be OK to write this as > > d_t_fmt "%a %d %b %Y %T %Z" > > ?? > > This would make the files much more readable. > > Stuff that is mostly ASCII can probably be written like this: > > % https://oc.wikipedia.org/wiki/Fran%C3%A7a França > country_name "Fran<U00E7>a" > > which is already more readable then writing it all in <U00??> code points. > > It would be even nicer to write it completely in UTF-8, i.e.: > > country_name "França" > > but I am not sure whether this is allowed in the locale source files. > > But at least for everything which is ASCII, it might be OK already to > write the characters directly. > > Is writing ASCII there allowed or not?? It's not ASCII though is it? Since '<' and '>' have to be reserved to support parsing of UTF-8 code points, so it's "almost ASCII." I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form instead of the verbose code-points, but we need to document exactly which characters are allowed. I believe the answer is everything except '<>'. I'm not entirely ready to allow all UTF-8, since that descends into the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and which form should be used. Then there are discussions around uniqueness of decomposition and exactly what did the source author want. So let us start slowly and agree with 'ASCII - [<>]' where < denotes the start of a code point and > the end of the code point. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 13:28 ` Carlos O'Donell @ 2017-07-24 13:32 ` Mike FABIAN 2017-07-24 14:47 ` Carlos O'Donell 2017-07-24 14:49 ` Andreas Schwab 1 sibling, 1 reply; 20+ messages in thread From: Mike FABIAN @ 2017-07-24 13:32 UTC (permalink / raw) To: Carlos O'Donell; +Cc: libc-alpha Carlos O'Donell <carlos@redhat.com> wrote: > On 07/24/2017 09:09 AM, Mike FABIAN wrote: >> >> Currently the locale source files use a lot of code points even for >> strings which are pure ASCII. For example localedata/locales/de_DE >> contains: >> >> % "%a %d %b %Y %T %Z" >> d_t_fmt >> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>" >> >> Would it be OK to write this as >> >> d_t_fmt "%a %d %b %Y %T %Z" >> >> ?? >> >> This would make the files much more readable. >> >> Stuff that is mostly ASCII can probably be written like this: >> >> % https://oc.wikipedia.org/wiki/Fran%C3%A7a França >> country_name "Fran<U00E7>a" >> >> which is already more readable then writing it all in <U00??> code points. >> >> It would be even nicer to write it completely in UTF-8, i.e.: >> >> country_name "França" >> >> but I am not sure whether this is allowed in the locale source files. >> >> But at least for everything which is ASCII, it might be OK already to >> write the characters directly. >> >> Is writing ASCII there allowed or not?? > > It's not ASCII though is it? Since '<' and '>' have to be reserved > to support parsing of UTF-8 code points, so it's "almost ASCII." > > I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form > instead of the verbose code-points, but we need to document exactly > which characters are allowed. I believe the answer is everything > except '<>'. > > I'm not entirely ready to allow all UTF-8, since that descends into > the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and > which form should be used. Then there are discussions around uniqueness > of decomposition and exactly what did the source author want. > > So let us start slowly and agree with 'ASCII - [<>]' where < denotes > the start of a code point and > the end of the code point. Yes, that sounds like a very reasonable first step! Is it OK to use that already *now*? Or is any change necessary to make that work? I tried country_name "Fran<U00E7>a" and it seems to work: bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name country_name="França" So maybe it is possible to use that right now without having to change anything in the code parsing the locale source files. -- Mike FABIAN <mfabian@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 13:32 ` Mike FABIAN @ 2017-07-24 14:47 ` Carlos O'Donell 2017-07-24 15:03 ` Mike FABIAN 2017-07-24 22:39 ` Rafal Luzynski 0 siblings, 2 replies; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 14:47 UTC (permalink / raw) To: Mike FABIAN; +Cc: libc-alpha On 07/24/2017 09:28 AM, Mike FABIAN wrote: > Carlos O'Donell <carlos@redhat.com> wrote: > >> On 07/24/2017 09:09 AM, Mike FABIAN wrote: >>> >>> Currently the locale source files use a lot of code points even for >>> strings which are pure ASCII. For example localedata/locales/de_DE >>> contains: >>> >>> % "%a %d %b %Y %T %Z" >>> d_t_fmt >>> "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>" >>> >>> Would it be OK to write this as >>> >>> d_t_fmt "%a %d %b %Y %T %Z" >>> >>> ?? >>> >>> This would make the files much more readable. >>> >>> Stuff that is mostly ASCII can probably be written like this: >>> >>> % https://oc.wikipedia.org/wiki/Fran%C3%A7a França >>> country_name "Fran<U00E7>a" >>> >>> which is already more readable then writing it all in <U00??> code points. >>> >>> It would be even nicer to write it completely in UTF-8, i.e.: >>> >>> country_name "França" >>> >>> but I am not sure whether this is allowed in the locale source files. >>> >>> But at least for everything which is ASCII, it might be OK already to >>> write the characters directly. >>> >>> Is writing ASCII there allowed or not?? >> >> It's not ASCII though is it? Since '<' and '>' have to be reserved >> to support parsing of UTF-8 code points, so it's "almost ASCII." >> >> I'm ok using 'almost' ASCII characters as their 1-byte UTF-8 form >> instead of the verbose code-points, but we need to document exactly >> which characters are allowed. I believe the answer is everything >> except '<>'. >> >> I'm not entirely ready to allow all UTF-8, since that descends into >> the much more complex discussion around NFC, NFKC, NFD, NFKD etc. and >> which form should be used. Then there are discussions around uniqueness >> of decomposition and exactly what did the source author want. >> >> So let us start slowly and agree with 'ASCII - [<>]' where < denotes >> the start of a code point and > the end of the code point. > > Yes, that sounds like a very reasonable first step! > > Is it OK to use that already *now*? You and Rafal are localedata maintainers, you can assume consensus, therefore you can start changing things in whatever way you wish. Before you change this though I would like to see your list of reasons for making the change, what benefits do you see it bringing? Is readability the only one? > Or is any change necessary to make that work? I do not know. > I tried > > country_name "Fran<U00E7>a" > > and it seems to work: > > bash-4.4# LC_ALL=oc_FR.UTF-8 locale -k country_name > country_name="França" > > So maybe it is possible to use that right now without having to change > anything in the code parsing the locale source files. You need to document somewhere what is acceptable and what is not and which ASCII characters cannot be used. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 14:47 ` Carlos O'Donell @ 2017-07-24 15:03 ` Mike FABIAN 2017-07-24 15:45 ` Carlos O'Donell 2017-07-24 22:39 ` Rafal Luzynski 1 sibling, 1 reply; 20+ messages in thread From: Mike FABIAN @ 2017-07-24 15:03 UTC (permalink / raw) To: Carlos O'Donell; +Cc: libc-alpha Carlos O'Donell <carlos@redhat.com> ããã¯ããã¾ãã: > On 07/24/2017 09:28 AM, Mike FABIAN wrote: >> Carlos O'Donell <carlos@redhat.com> wrote: >> >>> On 07/24/2017 09:09 AM, Mike FABIAN wrote: [...] >> Yes, that sounds like a very reasonable first step! >> >> Is it OK to use that already *now*? > > You and Rafal are localedata maintainers, you can assume consensus, therefore > you can start changing things in whatever way you wish. > > Before you change this though I would like to see your list of reasons > for making the change, what benefits do you see it bringing? Is readability > the only one? Readability is the only reason for me at the moment, editing these files using the code points is very tedious and error prone. >> So maybe it is possible to use that right now without having to change >> anything in the code parsing the locale source files. > > You need to document somewhere what is acceptable and what is not and > which ASCII characters cannot be used. Where should that be documented? -- Mike FABIAN <mfabian@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 15:03 ` Mike FABIAN @ 2017-07-24 15:45 ` Carlos O'Donell 0 siblings, 0 replies; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 15:45 UTC (permalink / raw) To: Mike FABIAN; +Cc: libc-alpha On 07/24/2017 10:49 AM, Mike FABIAN wrote: > Carlos O'Donell <carlos@redhat.com> ããã¯ããã¾ãã: > >> On 07/24/2017 09:28 AM, Mike FABIAN wrote: >>> Carlos O'Donell <carlos@redhat.com> wrote: >>> >>>> On 07/24/2017 09:09 AM, Mike FABIAN wrote: > > [...] > >>> Yes, that sounds like a very reasonable first step! >>> >>> Is it OK to use that already *now*? >> >> You and Rafal are localedata maintainers, you can assume consensus, therefore >> you can start changing things in whatever way you wish. >> >> Before you change this though I would like to see your list of reasons >> for making the change, what benefits do you see it bringing? Is readability >> the only one? > > Readability is the only reason for me at the moment, editing these files > using the code points is very tedious and error prone. Sounds like a good reason to me. >>> So maybe it is possible to use that right now without having to change >>> anything in the code parsing the locale source files. >> >> You need to document somewhere what is acceptable and what is not and >> which ASCII characters cannot be used. > > Where should that be documented? Andreas Schwab just pointed out that we should already technically support using any character in the POSIX portable character set in the locale definitions. So I think there is nothing to comment, and we are still within the definition of POSIX for making these locales portable. So feel free to change any UTF-8 code points into their equivalent POSIX portable character set characters. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 14:47 ` Carlos O'Donell 2017-07-24 15:03 ` Mike FABIAN @ 2017-07-24 22:39 ` Rafal Luzynski 2017-07-24 22:55 ` Carlos O'Donell 1 sibling, 1 reply; 20+ messages in thread From: Rafal Luzynski @ 2017-07-24 22:39 UTC (permalink / raw) To: Mike FABIAN, Carlos O'Donell; +Cc: libc-alpha 24.07.2017 15:32 Carlos O'Donell <carlos@redhat.com> wrote: > On 07/24/2017 09:28 AM, Mike FABIAN wrote: > > Carlos O'Donell <carlos@redhat.com> wrote: > > > >> [...] > >> So let us start slowly and agree with 'ASCII - [<>]' where < denotes > >> the start of a code point and > the end of the code point. > > > > Yes, that sounds like a very reasonable first step! > > > > Is it OK to use that already *now*? > > You and Rafal are localedata maintainers, you can assume consensus, therefore > you can start changing things in whatever way you wish. At the moment I would hesitate with this change. My reasons: 1. 2.26 release is just around the corner. 2. I don't know why this <U00xx> format was introduced. I'm afraid that nobody here knows and also I'm afraid there was a good reason to introduce it. Nobody knows what bugs will (re)appear if we revert to the more readable format. Or maybe somebody understands the reasons and can explain we can safely revert to the readable format or we can't? If nobody can then let's investigate the git history of the repo and find the reasons behind the change. If it turns out we can safely switch then let's switch. If we find a good reason not to switch then we'll just do nothing. If we still don't know the reasons then let's switch after 2.26 release so we have enough time to test during the 2.27 development cycle. > Before you change this though I would like to see your list of reasons > for making the change, what benefits do you see it bringing? Is readability > the only one? Mike has already explained that the readability is a good reason and to large extent I agree with this. But which characters can we use in the source code before the code becomes actually less readable? What about the languages which use Latin alphabet with lots of diacritical characters? Non-European languages using Latin alphabet? Greek, Cyrillic? Right-to-left alphabets? What if another developer does not yet have or cannot ever have a font which supports a specific alphabet? Regards, Rafal ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 22:39 ` Rafal Luzynski @ 2017-07-24 22:55 ` Carlos O'Donell 0 siblings, 0 replies; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 22:55 UTC (permalink / raw) To: Rafal Luzynski, Mike FABIAN; +Cc: libc-alpha On 07/24/2017 06:34 PM, Rafal Luzynski wrote: > 2. I don't know why this <U00xx> format was introduced. I'm afraid that > nobody here knows and also I'm afraid there was a good reason to introduce > it. Nobody knows what bugs will (re)appear if we revert to the more > readable format. I asked Ulrich Drepper about this and his response was that the original design was such that the locale source files could be convertible to any encoding system and still work. Given the ubiquity of UTF-8 today it seems like this goal is no longer as important. My opinion still stands that we should start with a limited set of characters and see if we run into any problems. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 13:28 ` Carlos O'Donell 2017-07-24 13:32 ` Mike FABIAN @ 2017-07-24 14:49 ` Andreas Schwab 2017-07-24 15:07 ` Carlos O'Donell 2017-07-24 17:07 ` Florian Weimer 1 sibling, 2 replies; 20+ messages in thread From: Andreas Schwab @ 2017-07-24 14:49 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Mike FABIAN, libc-alpha On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote: > So let us start slowly and agree with 'ASCII - [<>]' where < denotes > the start of a code point and > the end of the code point. POSIX says "character in the portable character set" if you want to keep it portable. Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 14:49 ` Andreas Schwab @ 2017-07-24 15:07 ` Carlos O'Donell 2017-07-24 17:07 ` Florian Weimer 1 sibling, 0 replies; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 15:07 UTC (permalink / raw) To: Andreas Schwab; +Cc: Mike FABIAN, libc-alpha On 07/24/2017 10:47 AM, Andreas Schwab wrote: > On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote: > >> So let us start slowly and agree with 'ASCII - [<>]' where < denotes >> the start of a code point and > the end of the code point. > > POSIX says "character in the portable character set" if you want to keep > it portable. Yes, that's right, and the "7.3 Locale Definition" already says that "<" is in the reserved namespace and must be escaped to be used literally. So we should already support using characters from the POSIX portable character set with proper escaping. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 14:49 ` Andreas Schwab 2017-07-24 15:07 ` Carlos O'Donell @ 2017-07-24 17:07 ` Florian Weimer 2017-07-24 20:07 ` Carlos O'Donell 1 sibling, 1 reply; 20+ messages in thread From: Florian Weimer @ 2017-07-24 17:07 UTC (permalink / raw) To: Andreas Schwab; +Cc: Carlos O'Donell, Mike FABIAN, libc-alpha * Andreas Schwab: > On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote: > >> So let us start slowly and agree with 'ASCII - [<>]' where < denotes >> the start of a code point and > the end of the code point. > > POSIX says "character in the portable character set" if you want to keep > it portable. But our locales only have to be compatible with our localedef, right? I know that the FSF does not claim copyright on our locales, so anyone is free to take them and use them with their own non-GNU systems (or sell them as PDFs/books). But this does not mean we have to make their lives easier if it comes at a cost to us (e.g., verifying that we only use the portable character set, or refraining from using full UTF-8 at a future date). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 17:07 ` Florian Weimer @ 2017-07-24 20:07 ` Carlos O'Donell 2017-07-24 22:34 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Carlos O'Donell @ 2017-07-24 20:07 UTC (permalink / raw) To: Florian Weimer, Andreas Schwab; +Cc: Mike FABIAN, libc-alpha On 07/24/2017 01:05 PM, Florian Weimer wrote: > * Andreas Schwab: > >> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote: >> >>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes >>> the start of a code point and > the end of the code point. >> >> POSIX says "character in the portable character set" if you want to keep >> it portable. > > But our locales only have to be compatible with our localedef, right? Should developers be able to write tools to the POSIX locale spec and parse our source locale definitions? Supporting more than just GNU/Linux? Do the BSDs share our locale definitions? > I know that the FSF does not claim copyright on our locales, so anyone > is free to take them and use them with their own non-GNU systems (or > sell them as PDFs/books). But this does not mean we have to make > their lives easier if it comes at a cost to us (e.g., verifying that > we only use the portable character set, or refraining from using full > UTF-8 at a future date). I agree with your sentiment, and leave it up to Mike to decide what makes it ultimately easier for him as a subsystem maintainer to work with. There is certainly a cost/reward balance. My only technical objection with writing straight UTF-8 is that it could lead to more mistakes, and Mike just found one in CLDR where an Arabic Farsi character was used incorrectly because it displayed the same glyph. It was caught when harmonizing with glibc where you have to write out the code points (Mike filed a bug upstream with CLDR). My preference would be to start small, start using the POSIX portable character set to it's maximum extent for all latin-based languages, see how that works out, and then decide if we even need to pursue full UTF-8 and in which form. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 20:07 ` Carlos O'Donell @ 2017-07-24 22:34 ` Florian Weimer 2017-07-24 22:51 ` Rafal Luzynski 2017-07-25 5:40 ` Carlos O'Donell 0 siblings, 2 replies; 20+ messages in thread From: Florian Weimer @ 2017-07-24 22:34 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Andreas Schwab, Mike FABIAN, libc-alpha * Carlos O'Donell: > On 07/24/2017 01:05 PM, Florian Weimer wrote: >> * Andreas Schwab: >> >>> On Jul 24 2017, Carlos O'Donell <carlos@redhat.com> wrote: >>> >>>> So let us start slowly and agree with 'ASCII - [<>]' where < denotes >>>> the start of a code point and > the end of the code point. >>> >>> POSIX says "character in the portable character set" if you want to keep >>> it portable. >> >> But our locales only have to be compatible with our localedef, right? > > Should developers be able to write tools to the POSIX locale spec and parse > our source locale definitions? Supporting more than just GNU/Linux? Do the > BSDs share our locale definitions? No, they don't. For one thing, they have partially implemented %OB (without fixing all the locales, creating inconsistencies). > My only technical objection with writing straight UTF-8 is that it could > lead to more mistakes, and Mike just found one in CLDR where an Arabic > Farsi character was used incorrectly because it displayed the same glyph. > It was caught when harmonizing with glibc where you have to write out the > code points (Mike filed a bug upstream with CLDR). Wasn't it caught by locale testing which revealed that the locale wasn't compatible with ISO-8859-6? That sanity check would still apply to locale definitions written in UTF-8. If we are worried about this kind of problem, I think web browsers have multi-script detection logic to deal with cross-script homographs in IDNA labels. I don't know how hard it would be to extract that logic from there and run it on locale strings, for cross-verification. > My preference would be to start small, start using the POSIX portable > character set to it's maximum extent for all latin-based languages, I would still prefer the <U…> encoding for control characters which are in the portable character set. So I have to object to the “maximum” part. :) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 22:34 ` Florian Weimer @ 2017-07-24 22:51 ` Rafal Luzynski 2017-07-25 5:40 ` Carlos O'Donell 1 sibling, 0 replies; 20+ messages in thread From: Rafal Luzynski @ 2017-07-24 22:51 UTC (permalink / raw) To: Florian Weimer, Carlos O'Donell Cc: libc-alpha, Mike FABIAN, Andreas Schwab 24.07.2017 23:13 Florian Weimer <fw@deneb.enyo.de> wrote: > > > * Carlos O'Donell: > > [...] > > My only technical objection with writing straight UTF-8 is that it could > > lead to more mistakes, and Mike just found one in CLDR where an Arabic > > Farsi character was used incorrectly because it displayed the same glyph. > > It was caught when harmonizing with glibc where you have to write out the > > code points (Mike filed a bug upstream with CLDR). > > Wasn't it caught by locale testing which revealed that the locale > wasn't compatible with ISO-8859-6? [...] This is exactly what happened. The character was not representable in ISO-8859-6. There was no problem in UTF-8. > [...] > > My preference would be to start small, start using the POSIX portable > > character set to it's maximum extent for all latin-based languages, > > I would still prefer the <U…> encoding for control characters which > are in the portable character set. So I have to object to the > “maximum” part. :) I agree modulo the concerns which I expressed in another email: let's investigate the history behind it and if we still don't know then let's just wait for the 2.26 release. Regards, Rafal ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-24 22:34 ` Florian Weimer 2017-07-24 22:51 ` Rafal Luzynski @ 2017-07-25 5:40 ` Carlos O'Donell 2017-07-25 6:27 ` Mike FABIAN 1 sibling, 1 reply; 20+ messages in thread From: Carlos O'Donell @ 2017-07-25 5:40 UTC (permalink / raw) To: Florian Weimer; +Cc: Andreas Schwab, Mike FABIAN, libc-alpha On 07/24/2017 05:13 PM, Florian Weimer wrote: >> My only technical objection with writing straight UTF-8 is that it could >> lead to more mistakes, and Mike just found one in CLDR where an Arabic >> Farsi character was used incorrectly because it displayed the same glyph. >> It was caught when harmonizing with glibc where you have to write out the >> code points (Mike filed a bug upstream with CLDR). > > Wasn't it caught by locale testing which revealed that the locale > wasn't compatible with ISO-8859-6? That sanity check would still > apply to locale definitions written in UTF-8. My point was that the mistake was made in CLDR upstream where I only presume the mistake was made because the glyphs are identical. If we had not been using ISO-8859-6, or if we'd had a mapping from all the UTF-8 chars into ISO-8859-6 (there was no transliteration for the Farsi character), then we would not have noticed the error in the original source locale. My only argument is that when you are forced to use <Uxxx> encoding it is empirically less likely you'll make a mistake. Like reading a sentence backwards to catch errors since it prevents your brain from filling in the missing information. > I would still prefer the <Uâ¦> encoding for control characters which > are in the portable character set. So I have to object to the > âmaximumâ part. :) Yes, I had ignored the control characters, so I agree, not maximally :} -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-25 5:40 ` Carlos O'Donell @ 2017-07-25 6:27 ` Mike FABIAN 2017-07-25 12:48 ` Carlos O'Donell 0 siblings, 1 reply; 20+ messages in thread From: Mike FABIAN @ 2017-07-25 6:27 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Florian Weimer, Andreas Schwab, libc-alpha Carlos O'Donell <carlos@redhat.com> wrote: > My only argument is that when you are forced to use <Uxxx> encoding it > is empirically less likely you'll make a mistake. Like reading a sentence > backwards to catch errors since it prevents your brain from filling in > the missing information. But there are also many mistakes because somebody mistyped code points. Several weird typos in things like month names look as if somebody mistyped code points. -- Mike FABIAN <mfabian@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-25 6:27 ` Mike FABIAN @ 2017-07-25 12:48 ` Carlos O'Donell 2017-07-25 14:21 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Carlos O'Donell @ 2017-07-25 12:48 UTC (permalink / raw) To: Mike FABIAN; +Cc: Florian Weimer, Andreas Schwab, libc-alpha On 07/25/2017 02:20 AM, Mike FABIAN wrote: > Carlos O'Donell <carlos@redhat.com> wrote: > >> My only argument is that when you are forced to use <Uxxx> encoding it >> is empirically less likely you'll make a mistake. Like reading a sentence >> backwards to catch errors since it prevents your brain from filling in >> the missing information. > > But there are also many mistakes because somebody mistyped code points. > Several weird typos in things like month names look as if somebody > mistyped code points. Ultimately I defer to your judgement as localedata maintainer to create a workflow that is easy for you and benefits your work. However, I caution against throwing away the compatibility of our locales with POSIX, which doesn't seem to allow UTF-8 in the specification. I would suggest the following: (a) Documentation: File an Austin bug to adjust the text of the standard to allow what we want. Effectively documenting the defacto glibc standard which uses UTF-8. (b) New process: Post-process the locale source before commit, and enforce, that there is an auto-generated comment that contains either the UTF-8 or code points, for the author to review before commit. If we wrote UTF-8 in a special markup comment, and auto-generated the locale entry with code points then we would remain mostly compatible with POSIX and what we have today (less churn for user tools). -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-25 12:48 ` Carlos O'Donell @ 2017-07-25 14:21 ` Florian Weimer 2017-07-25 14:37 ` Carlos O'Donell 0 siblings, 1 reply; 20+ messages in thread From: Florian Weimer @ 2017-07-25 14:21 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha * Carlos O'Donell: > On 07/25/2017 02:20 AM, Mike FABIAN wrote: >> Carlos O'Donell <carlos@redhat.com> wrote: >> >>> My only argument is that when you are forced to use <Uxxx> encoding it >>> is empirically less likely you'll make a mistake. Like reading a sentence >>> backwards to catch errors since it prevents your brain from filling in >>> the missing information. >> >> But there are also many mistakes because somebody mistyped code points. >> Several weird typos in things like month names look as if somebody >> mistyped code points. > > Ultimately I defer to your judgement as localedata maintainer to create > a workflow that is easy for you and benefits your work. > > However, I caution against throwing away the compatibility of our locales > with POSIX, which doesn't seem to allow UTF-8 in the specification. It does, to some extent: | A character in the portable character set can be represented by the | character itself, in which case the value of the character is | implementation-defined. (Implementations may allow other characters | to be represented as themselves, but such locale definitions are not | portable.) You'll need a very hostile interpretation to say that this doesn't allow multi-byte character sequences in localedef input. But I found this in the guts of localedef: /* The standards leave it up to the implementation to decide what to do with character which stand for themself. We could jump through hoops to find out the value relative to the charmap and the repertoire map, but instead we leave it up to the locale definition author to write a better definition. We assume here that every character which stands for itself is encoded using ISO 8859-1. Using the escape character is allowed. */ So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the bootstrapping problem. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-25 14:21 ` Florian Weimer @ 2017-07-25 14:37 ` Carlos O'Donell 2017-07-25 19:05 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Carlos O'Donell @ 2017-07-25 14:37 UTC (permalink / raw) To: Florian Weimer; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha On 07/25/2017 10:12 AM, Florian Weimer wrote: > * Carlos O'Donell: > >> On 07/25/2017 02:20 AM, Mike FABIAN wrote: >>> Carlos O'Donell <carlos@redhat.com> wrote: >>> >>>> My only argument is that when you are forced to use <Uxxx> encoding it >>>> is empirically less likely you'll make a mistake. Like reading a sentence >>>> backwards to catch errors since it prevents your brain from filling in >>>> the missing information. >>> >>> But there are also many mistakes because somebody mistyped code points. >>> Several weird typos in things like month names look as if somebody >>> mistyped code points. >> >> Ultimately I defer to your judgement as localedata maintainer to create >> a workflow that is easy for you and benefits your work. >> >> However, I caution against throwing away the compatibility of our locales >> with POSIX, which doesn't seem to allow UTF-8 in the specification. > > It does, to some extent: > > | A character in the portable character set can be represented by the > | character itself, in which case the value of the character is > | implementation-defined. (Implementations may allow other characters > | to be represented as themselves, but such locale definitions are not > | portable.) > > You'll need a very hostile interpretation to say that this doesn't > allow multi-byte character sequences in localedef input. I see what you're saying, which is that we are *still* POSIX comliant, but not portable? I assume we are focusing on the "()" text which allows some kind of escape hatch outside of the portable character set and allow us to use UTF-8? > But I found this in the guts of localedef: > > /* The standards leave it up to the implementation to decide > what to do with character which stand for themself. We > could jump through hoops to find out the value relative to > the charmap and the repertoire map, but instead we leave > it up to the locale definition author to write a better > definition. We assume here that every character which > stands for itself is encoded using ISO 8859-1. Using the > escape character is allowed. */ > > So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the > bootstrapping problem. We could just assume UTF-8, but yes, it looks like this needs a little bit more looking into. Either way, I support using the portable character set today, and that's a step forward. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Is it OK to write ASCII strings directly into locale source files? 2017-07-25 14:37 ` Carlos O'Donell @ 2017-07-25 19:05 ` Florian Weimer 0 siblings, 0 replies; 20+ messages in thread From: Florian Weimer @ 2017-07-25 19:05 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Mike FABIAN, Andreas Schwab, libc-alpha * Carlos O'Donell: >>> However, I caution against throwing away the compatibility of our locales >>> with POSIX, which doesn't seem to allow UTF-8 in the specification. >> >> It does, to some extent: >> >> | A character in the portable character set can be represented by the >> | character itself, in which case the value of the character is >> | implementation-defined. (Implementations may allow other characters >> | to be represented as themselves, but such locale definitions are not >> | portable.) >> >> You'll need a very hostile interpretation to say that this doesn't >> allow multi-byte character sequences in localedef input. > > I see what you're saying, which is that we are *still* POSIX comliant, > but not portable? Right, and I think that's okay because the glibc locales are for glibc. > I assume we are focusing on the "()" text which allows some kind of escape > hatch outside of the portable character set and allow us to use UTF-8? Exactly. >> But I found this in the guts of localedef: >> >> /* The standards leave it up to the implementation to decide >> what to do with character which stand for themself. We >> could jump through hoops to find out the value relative to >> the charmap and the repertoire map, but instead we leave >> it up to the locale definition author to write a better >> definition. We assume here that every character which >> stands for itself is encoded using ISO 8859-1. Using the >> escape character is allowed. */ >> >> So we currently hard-code ISO 8859-1 (not UTF-8) to avoid the >> bootstrapping problem. > > We could just assume UTF-8, but yes, it looks like this needs a little bit > more looking into. Yes, and we don't have a real bootstrapping problem because while we have charmap file for UTF-8, we have a separate UTF-8 implementation in iconv/gconv, and we could use that to break the loop. > Either way, I support using the portable character set today, and that's > a step forward. Agreed. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2017-07-25 14:37 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-24 13:13 Is it OK to write ASCII strings directly into locale source files? Mike FABIAN 2017-07-24 13:28 ` Carlos O'Donell 2017-07-24 13:32 ` Mike FABIAN 2017-07-24 14:47 ` Carlos O'Donell 2017-07-24 15:03 ` Mike FABIAN 2017-07-24 15:45 ` Carlos O'Donell 2017-07-24 22:39 ` Rafal Luzynski 2017-07-24 22:55 ` Carlos O'Donell 2017-07-24 14:49 ` Andreas Schwab 2017-07-24 15:07 ` Carlos O'Donell 2017-07-24 17:07 ` Florian Weimer 2017-07-24 20:07 ` Carlos O'Donell 2017-07-24 22:34 ` Florian Weimer 2017-07-24 22:51 ` Rafal Luzynski 2017-07-25 5:40 ` Carlos O'Donell 2017-07-25 6:27 ` Mike FABIAN 2017-07-25 12:48 ` Carlos O'Donell 2017-07-25 14:21 ` Florian Weimer 2017-07-25 14:37 ` Carlos O'Donell 2017-07-25 19:05 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).