On 07/27/2017 01:56 PM, Steven Penny wrote: > On Thu, 27 Jul 2017 12:08:53, Eric Blake wrote: >> I've got some time today to look at building readline, but for the life >> of me, I can't figure out what I'm supposed to be debugging. You have >> so many emails saying "see this earlier URL" that I am lost in what you >> are saying is wrong or how to reproduce it. > > Thanks for this. Between your 2 emails, youve put a lot on the table. > Instead > of getting overwhelmed, I will just start my side of the convo by > replaying the > problem. Then if you need more from me I am happy to help. So, here is an > example problem using LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6): > > $ chcp.com 65001 I still don't know your environment (it's really hard to reproduce issues if I don't know the steps to reproduce them). This looks like a bash prompt, but are you running bash inside mintty, or directly in a cmd window? When I first open a mintty window to get bash, I see: $ chcp.com Active code page: 437 and in that environment, typing displays nothing, but hitting then displays: -bash: $'\302\224': command not found which maps to \xc2\x94; I can confirm that with 'od -tx1'. Trying gives a different character (¦), as \xc2\xa6. When I then do $chcp.com 65001 Active code page 65001 I don't see any change in behavior. But if I first open a cmd window, with NO bash in the mix, I see: c:\cygwin\bin> chcp Active code page: 437 where both and output ö, and where 'od -tx1' confirms both sequences produce \xc3\xb6. Then switching code pages: c:\cygwin\bin> chcp 65001 Active code page: 65001 directly typing prints nothing, while 'od -tx1' still shows that it received \xc3\xb6. I have no idea how alt- sequences are mapped to code points (it is not as trivial as a conversion of base to get either the Unicode code-point of 0x96 or to the UTF-8 encoding), but it appears that the input within cmd is the same, while the choice of code page determines what the output will be. I also have no idea why the alt- sequences produce different inputs under cmd than under mintty. So knowing WHAT environment you are using is VITAL to me understanding the results you are seeing. At any rate, I definitely know that U+00F6 is encoded as \xc3\xb6 in UTF-8 (I confirmed that on Linux, with echo $'\xc3\xb6'). I _don't_ know what it is encoded as in Windows code page 437 or 65001. But a quick google later, and I see that for code page 437 (https://en.wikipedia.org/wiki/Code_page_437), ö is at codepoint 0x94 (decimal 148, octal 0224); meanwhile, 0xf6 is equal to decimal 246. Aha - maybe that explains the two alt- sequences under codepage 437: without a leading zero, you are typing the decimal position which looks up the character from the current code page; WITH a leading zero you are directly requesting the decimal encoding of a Unicode character. And trying some other sequences, I note that õ (LATIN SMALL LETTER O WITH TILDE' (U+00F5)) is not part of code page 437; so there is nothing I can type without a leading 0 to print one; conversely, trying which requests the same unicode character displays merely 'o' (apparently U+006f), which, when you lack o-with-tilde, is a reasonable fallback compared to printing nothing at all. Either way, the character requested by the alt-sequence in the cmd window is then transformed by Cygwin into the appropriate UTF-8 input for the tty stdin of the Cygwin child process. Hmm; repeating those sequences under 'od -tx1', when I try , I see something interesting: the moment I press 5 (while still holding alt), the display prints [G; then releasing alt prints o; the transcription is then 0000000 1b 1b 5b 47 c3 b5 0a which is ESC ESC [ G (hmm - that's the ANSI terminal escape sequence for moving to column 0), followed by the actual Unicode õ, before my ending newline. No idea why that is leaking through to Cygwin to pick up as input. Is windows trying to beep at me to tell me my Unicode request doesn't exist in the current code page? Except that beep is Ctrl-G (U+0007). But when I switch to code page 65001, wikipedia redirects me to UTF-8. So in that code page, presumably all ALT sequences represent themselves, whether or not there is a leading 0? No, experimentation shows otherwise: shows nothing (and not the smiley face from codepage 437); while shows ^B (where ctrl-b really is code point 2). I have no idea WHAT sequence would thus give you ö. > Now you might say, why not just use codepage 437? Which is exactly what > Corinna > did say: > > http://cygwin.com/ml/cygwin/2017-03/msg00193.html Well, obviously, the code page matters to cmd; and I have no idea what alt- sequences do (or are supposed to do) under mintty. So there may STILL be some lingering craziness on what Cygwin itself should do when it recognizes an alt- sequence coming in (if cygwin translates from the current code page to Unicode, where the current code page definitely affects which character is desired); and that's _in addition_ to what appears to be the craziness in bash when reconstructing the UTF-8 sequence for omega Ω as mentioned in my other mail. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org