questions about new multibyte character support in EGCS/GCC2

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* questions about new multibyte character support in EGCS/GCC2
@ 1998-12-05 22:09 Paul Eggert
  1998-12-07  8:39 ` Dave Brolley
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Eggert @ 1998-12-05 22:09 UTC (permalink / raw)
  To: egcs, gcc2; +Cc: Dave Brolley

In July multibyte character support was added to EGCS, and these
changes recently got folded into GCC2.  E.g. now strings can contain
shift-JIS (which formerly was troublesome in strings since it uses '\'
bytes to encode Japanese characters).

I'm looking into adding draft-C9x support to the C preprocessor and
lexer.  Among other things, draft C9x specifies the relationship
between multibyte chars and \u escapes.  I have some questions about
the EGCS/GCC2 multibyte support, though.

* As far as I can tell, the multibyte functionality isn't documented;
  is this intentional?  Is it documented somewhere outside the EGCS
  distribution?

* The cccp.c startup code currently looks like this:

    literal_codeset = getenv ("LANG");

  but the usual way in other programs is to look at LC_ALL first, then
  LC_CTYPE, and then LANG last of all.  Why are LC_ALL and LC_CTYPE
  being ignored here?

* mbchar.c supports the quasi-LC_CTYPE locales "C-SJIS", "C-EUCJP",
  and "C-JIS".  Apparently one is supposed to set LANG to one of these
  values if you want to use this functionality -- if you use an
  ordinary value for LANG (e.g. "ja" in Solaris) then you get its
  interpretation.  Are the "C-*" quasi-locales meant for
  cross-compiling or something like that?  Is this undocumented
  functionality being used?

  It seems awkward to usurp LANG for something that is not strictly
  locale-related.  If this functionality is needed, perhaps it should
  be a compiler option instead?  Another possibility might be to use a
  different environment variable (e.g. CROSS_LANG) but allow it to use
  the same values as LANG.  If the functionality is not needed, it might
  be simpler to rename local_mblen to mblen, which would bypass the need
  for separately maintained multibyte functions; one could simply use
  the system functions.

* It appears to me that the multibyte lexing code could be sped up quite
  a bit by using the draft C9x multibyte functions, if available.  Any
  thoughts before I start hacking in this direction?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: questions about new multibyte character support in EGCS/GCC2
  1998-12-05 22:09 questions about new multibyte character support in EGCS/GCC2 Paul Eggert
@ 1998-12-07  8:39 ` Dave Brolley
  1998-12-07 17:47   ` Paul Eggert
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Brolley @ 1998-12-07  8:39 UTC (permalink / raw)
  To: Paul Eggert; +Cc: egcs, gcc2

Paul Eggert wrote:

> In July multibyte character support was added to EGCS, and these
> changes recently got folded into GCC2.  E.g. now strings can contain
> shift-JIS (which formerly was troublesome in strings since it uses '\'
> bytes to encode Japanese characters).
>
> I'm looking into adding draft-C9x support to the C preprocessor and
> lexer.  Among other things, draft C9x specifies the relationship
> between multibyte chars and \u escapes.  I have some questions about
> the EGCS/GCC2 multibyte support, though.
>
> * As far as I can tell, the multibyte functionality isn't documented;
>   is this intentional?  Is it documented somewhere outside the EGCS
>   distribution?

It's in gcc/invoke.texi

> * The cccp.c startup code currently looks like this:
>
>     literal_codeset = getenv ("LANG");
>
>   but the usual way in other programs is to look at LC_ALL first, then
>   LC_CTYPE, and then LANG last of all.  Why are LC_ALL and LC_CTYPE
>   being ignored here?

Hmmmmm, well I'll have to admit that the truth is that I didn't know this.
If this is, in fact, the standard sequence of variables then feel free to
fix it.

> * mbchar.c supports the quasi-LC_CTYPE locales "C-SJIS", "C-EUCJP",
>   and "C-JIS".  Apparently one is supposed to set LANG to one of these
>   values if you want to use this functionality -- if you use an
>   ordinary value for LANG (e.g. "ja" in Solaris) then you get its
>   interpretation.  Are the "C-*" quasi-locales meant for
>   cross-compiling or something like that?  Is this undocumented
>   functionality being used?

Yes, this is being used. It was meant for platforms where the full locale
support is not present.

>   It seems awkward to usurp LANG for something that is not strictly
>   locale-related.  If this functionality is needed, perhaps it should
>   be a compiler option instead?  Another possibility might be to use a
>   different environment variable (e.g. CROSS_LANG) but allow it to use
>   the same values as LANG.  If the functionality is not needed, it might
>   be simpler to rename local_mblen to mblen, which would bypass the need
>   for separately maintained multibyte functions; one could simply use
>   the system functions.

Well, LANG was chosen specifically because it is the normal way to pass this
kind of information to the compiler, even though on the systems where this
is used, there is no native locale support. I don't see a problem since on
platforms where the full locale exists, the user can specify that locale
name.

> * It appears to me that the multibyte lexing code could be sped up quite
>   a bit by using the draft C9x multibyte functions, if available.  Any
>   thoughts before I start hacking in this direction?

As long as things still work where these functions are not available, then I
think it's preferable to use the actual functions where possible.

Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: questions about new multibyte character support in EGCS/GCC2
  1998-12-07  8:39 ` Dave Brolley
@ 1998-12-07 17:47   ` Paul Eggert
  1998-12-08  9:22     ` Dave Brolley
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Eggert @ 1998-12-07 17:47 UTC (permalink / raw)
  To: brolley; +Cc: egcs, gcc2

   Date: Mon, 07 Dec 1998 11:42:00 -0500
   From: Dave Brolley <brolley@cygnus.com>

   > * It appears to me that the multibyte lexing code could be sped up quite
   >   a bit by using the draft C9x multibyte functions, if available.  Any
   >   thoughts before I start hacking in this direction?

   As long as things still work where these functions are not
   available, then I think it's preferable to use the actual functions
   where possible.

Sounds good, but there is a corollary.  For performance reasons, the
GCC multibyte code should invoke the actual multibyte functions
directly, instead of via an intermediary.  E.g. the code should invoke
mblen directly, instead of having local_mblen invoke mblen.  This is
because these functions are often implemented inline for speed, and
introducing an intermediary kills this performance optimization.  (The
GNU C library uses tricks like this.)

GCC should use local_mblen etc. only on translation hosts that don't
have proper multibyte functions.  I'll look into making this
performance optimization.  The documentation would have to be changed
slightly: it would say that e.g. LANG='C-EUCJP' is supported only on
hosts that don't have proper localization, and that if you're on, say,
Solaris you should do as the Solarians do and use LANG='ja'.  I don't
think this will be much of a problem, as it's standard practice.

Also, by the way, cpplib should not invoke mblen either directly or
indirectly, since mblen is not reentrant.  It should invoke mbrlen
instead, if available.  It might be more convenient to make the
local_* multibyte functions reentrant too.  I guess I'll look into
doing this as well...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: questions about new multibyte character support in EGCS/GCC2
  1998-12-07 17:47   ` Paul Eggert
@ 1998-12-08  9:22     ` Dave Brolley
  1998-12-08 21:48       ` Paul Eggert
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Brolley @ 1998-12-08  9:22 UTC (permalink / raw)
  To: Paul Eggert; +Cc: egcs, gcc2

Paul Eggert wrote:

> GCC should use local_mblen etc. only on translation hosts that don't
> have proper multibyte functions.

local_mblen etc. are also required on systems that may have proper multibyte
functions but no support for S-JIS, JIS and EUCJP. As you also mentioned, we also
use it for cross compilers where the native multibyte functions are of no use.

Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: questions about new multibyte character support in EGCS/GCC2
  1998-12-08  9:22     ` Dave Brolley
@ 1998-12-08 21:48       ` Paul Eggert
  0 siblings, 0 replies; 5+ messages in thread
From: Paul Eggert @ 1998-12-08 21:48 UTC (permalink / raw)
  To: brolley; +Cc: egcs, gcc2

   Date: Tue, 08 Dec 1998 12:24:21 -0500
   From: Dave Brolley <brolley@cygnus.com>

   Paul Eggert wrote:

   > GCC should use local_mblen etc. only on translation hosts that don't
   > have proper multibyte functions.

   local_mblen etc. are also required on systems that may have proper multibyte
   functions but no support for S-JIS, JIS and EUCJP.

Yes.  It may be best to have an installer's switch to specify whether
local_mblen is desired.  It's a tradeoff between functionality
(typically for cross-compilers, it sounds like) and efficiency.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~1998-12-08 21:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-05 22:09 questions about new multibyte character support in EGCS/GCC2 Paul Eggert
1998-12-07  8:39 ` Dave Brolley
1998-12-07 17:47   ` Paul Eggert
1998-12-08  9:22     ` Dave Brolley
1998-12-08 21:48       ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).