From mboxrd@z Thu Jan 1 00:00:00 1970 From: Per Bothner To: gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Mon, 21 Dec 1998 19:16:00 -0000 Message-id: <199812220316.TAA06758@cygnus.com> References: <199812220210.SAA08262@shade.twinsun.com> X-SW-Source: 1998-12/msg00826.html I too am rather leary of using #pragma locale or any other in-band indicator or the character set. Paul mentions the problem of converting a set of text files from one encoding to another. Perhaps someone in Western Europe wants to examine a program with its documentation, but both were written in China. It makes sense to convert it to the local character set first. If the original program contains #pragma locale statements, these have to be translated also, but expecting a chracter-set translation tool to understand C syntax seems a bit much. If you *don't* do the translation, all your other tools (emacs, less, grep, etc) need to understand the #pragma locale statement, which again seems reasonable. Another problem is that switching character encoding in-band may be difficult. Many libraries do not support it. The Java FileReader class requires you to specify the encoding at *open* time. Of course there are various work-around. For example, you can try opening the file in UTF-8 mode, and if you see a #pragma locale statement, re-open it in the apprioriate mode. Still this is not something applications programmers shoudl have to deal with. The only general solution I think is for the *file system* and/or input library to do the translation. Perferably each file should specify its encoding out-of-bound, just like MIME does. As a back-up, the user should be able tospecify a default encoding (based on their lcoale), and perhaps over-ride it for individual files. Still, while #pragra locale does have its problems, and we must also support other ways for getting character encoding information, it might still be a useful *alternative* method for specifying the encoding. One useful data point is that the XML specification provides a command to specify the character encoding in use. See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl The XML spec also includes an appendix on auto-detection: http://www.w3.org/TR/PR-xml-971208#sec-guessing --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner