Hi Carlos et al. On Friday, June 7, 2019 2:57 AM, Carlos O'Donell wrote: > I have a weak preference for 1. However, I would change my preference if > someone showed me existing prior implementations that did 1 or 2. 1. gibc already translits letters and ligatures capitalized in locale/C-translit.h.in: "\x00c6" "AE" # LATIN CAPITAL LETTER AE "\x0132" "IJ" # LATIN CAPITAL LIGATURE IJ 2. I would just like to quote myself from 2018: "collisions due to "one symbol capitalization" would cause irreversible damage to data. For a library like glibc this seems to be a very relevant issue to consider..." On 03.11.18 00:27, Egor Kobylkin wrote:> On 02.11.18 23:22, Rafal Luzynski wrote: >>> * Consistently transliterate single uppercase Cyrillic letters to >>> sequences of all uppercase Latin letters in all languages >>> (whenever a Cyrillic letter is transliterated to more than one >>> Latin letter), for example "Ї" is now transliterated as "YI" rather >>> than "Yi". >> I think you have not yet explained whether this is required by any >> existing standard (please provide links) or whether this is your >> genuine idea to distinguish between the cases like "Ш" transliterated > to "Sh" and > "Сх" also transliterated to "Sh". > > I remember seeing this form of the capitalization it in actual > transliterated texts long time ago but can't find a formal description > as of now. Just don't want to claim this to be my original idea. > >>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for >>> example for "Сх" and "Ш" that would both transliterate to Sh: >>> With SH:"Схема"->"Shema" but "Шема"->"SHema" >>> With Sh:"Схема"->"Shema" and "Шема"->"Shema". Collision! >>> This is important e.g. for renaming files, grouping as in using uniq >> etc. > As for the users - I am a user and I have demonstrated the use cases > where the collisions due to "one symbol capitalization" would cause > irreversible damage to data. For a library like glibc this seems like a > relevant issue to consider. > > The "two symbol capitalization" on the other hand would prevent > collision and can be easily corrected in the userspace if needed > with something like > > foo="SHema" > foo="${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})" > echo "$foo" > Shema > > It looks like everyone really using transliteration for something > sensitive already have done it the userspace since at least 2006 when > this bug was first logged. So we won't break the official use cases > where the capitalization should be done in a certain way. But we will > prevent new bugs due to collision if we use "two symbol capitalization" > indeed. > > Happy to hear arguments to the contrary. Bests, Diego ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, June 7, 2019 2:57 AM, Carlos O'Donell wrote: > On 6/6/19 5:31 PM, Rafal Luzynski wrote: > > > > > Possible answers (Cyrillic -> Latin Extended -> ASCII): > > > > > > > > 1. "Ш" -> "Š" -> "SH" > > > > e.g.: "Шема" -> "Šema" -> "SHema" > > > > "Схема" ----------> "Shema" > > > > > > > > 2. "Ш" -> "Š" -> "Sh" > > > > e.g.: "Шема" -> "Šema" -> "Shema" > > > > "Схема" ----------> "Shema" > > > > > > > > > > > > Personally I don't like the answer 1. because "SHema" looks weird > > > > to me. Egor in turn does not like the answer 2. because the output > > > > string becomes ambiguous. > > > > Should we maybe have a smart algorithm which would select the title > > > > case or the upper case of the output characters depending on the > > > > context in the word? Note that it would not resolve the problem of > > > > the output text being ambiguous. > > > > > > It seems clear that there is no one right/wrong answer but it's a matter > > > of preference, especially the way this currently works. It might be an > > > improvement to output (for instance) SH instead of Sh if all the other > > > letters of a word are upper-case as well but not sure what would help > > > with the result being unambiguous. > > > > I think you refer to the idea of implementing a smart algorithm which would > > adapt the lower/upper case depending on the context but indeed it would > > not resolve the problem of ambiguity. > > So, the smart algorithm aside, what should be the preferred transliteration > > rule? > > I have a weak preference for 1. However, I would change my preference if > someone showed me existing prior implementations that did 1 or 2. > > --------------------------------------------------------------------------------------------------------------------------------------------- > > Cheers, > Carlos.