From mboxrd@z Thu Jan  1 00:00:00 1970
From: Per Bothner <bothner@cygnus.com>
To: gcc2@gnu.org, egcs@cygnus.com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 
Date: Mon, 21 Dec 1998 19:16:00 -0000
Message-id: <199812220316.TAA06758@cygnus.com>
References: <199812220210.SAA08262@shade.twinsun.com>
X-SW-Source: 1998-12/msg00826.html

I too am rather leary of using #pragma locale or any other in-band
indicator or the character set.

Paul mentions the problem of converting a set of text files from one
encoding to another.  Perhaps someone in Western Europe wants
to examine a program with its documentation, but both were written
in China.  It makes sense to convert it to the local character
set first.  If the original program contains #pragma locale statements,
these have to be translated also, but expecting a chracter-set
translation tool to understand C syntax seems a bit much.

If you *don't* do the translation, all your other tools (emacs,
less, grep, etc) need to understand the #pragma locale statement,
which again seems reasonable.

Another problem is that switching character encoding
in-band may be difficult.  Many libraries do not support it.
The Java FileReader class requires you to specify the encoding
at *open* time.  Of course there are various work-around.
For example, you can try opening the file in UTF-8 mode,
and if you see a #pragma locale statement, re-open it in the
apprioriate mode.  Still this is not something applications
programmers shoudl have to deal with.

The only general solution I think is for the *file system*
and/or input library to do the translation.  Perferably
each file should specify its encoding out-of-bound,
just like MIME does.  As a back-up, the user should be
able tospecify a default encoding (based on their lcoale),
and perhaps over-ride it for individual files.

Still, while #pragra locale does have its problems, and
we must also support other ways for getting character
encoding information, it might still be a useful
*alternative* method for specifying the encoding.

One useful data point is that the XML specification provides
a command to specify the character encoding in use.
See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl
The XML spec also includes an appendix on auto-detection:
http://www.w3.org/TR/PR-xml-971208#sec-guessing

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner