[Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated
@ 2023-11-03 17:17 bugzilla at tecnocode dot co.uk
  2023-12-05 14:15 ` [Bug locale/31030] " schwab@linux-m68k.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: bugzilla at tecnocode dot co.uk @ 2023-11-03 17:17 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

            Bug ID: 31030
           Summary: ERA segments are nul-separated rather than
                    semicolon-separated
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: bugzilla at tecnocode dot co.uk
  Target Milestone: ---

nl_langinfo(3) says “Era description segments are separated by semicolons”, but
it appears that `nl_langinfo (ERA)` returns them separated by nul bytes
instead, and the only way to parse them is to additionally call the
(undocumented) `nl_langinfo (_NL_TIME_ERA_NUM_ENTRIES)` to find out how many
segments there are meant to be.

This can be demonstrated with the following example program (compile with `gcc
-o era era.c -Wall`):
```
#include <langinfo.h>
#include <locale.h>
#include <stdint.h>
#include <stdio.h>

int
main (void)
{
  setlocale (LC_ALL, "");
  const char *era = nl_langinfo (ERA);
  int n_entries = (int) (intptr_t) nl_langinfo (_NL_TIME_ERA_NUM_ENTRIES);

  printf ("n_entries: %d, era: %s\n", n_entries, era);

  return 0;
}
```

If run with `LANG=th_TH.UTF-8 ./era` it prints
```
n_entries: 1, era: +:1:-543/01/01:+*:พ.ศ.:%EC %Ey
```
which is all good.

However, if run with `LANG=ja_JP.UTF-8 ./era`, it prints:
```
n_entries: 11, era: +:2:2020/01/01:+*:令和:%EC%Ey年
```

There clearly aren’t 11 segments in the ’era’ description there — only one.
Looking at the ja_JP locale definition, there are correctly 11 segments defined
in it:
https://github.com/bminor/glibc/blob/master/localedata/locales/ja_JP#L14949-L14977

If I read past 10 nul terminators in the `era` string, I can retrieve all 11
segments. So the locale definition does work. It just doesn’t match up to what
nl_langinfo(3) says, and requires using the undocumented
`_NL_TIME_ERA_NUM_ENTRIES` to read all segments.

Is there a reason `era` is using nul separators? Could it be switched to using
semicolons please? :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
@ 2023-12-05 14:15 ` schwab@linux-m68k.org
  2024-01-15 13:23 ` smcv at collabora dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: schwab@linux-m68k.org @ 2023-12-05 14:15 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

--- Comment #1 from Andreas Schwab <schwab@linux-m68k.org> ---
The ERA entry is actually terminated by a double null, and
_NL_TIME_ERA_NUM_ENTRIES is only related to _NL_TIME_ERA_ENTRIES (which
contains a structured representation of ERA and what is used internally
instead).

It is not clear whether changing the format of ERA would constitute an ABI
break.  The ALT_DIGITS entry has the same issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
  2023-12-05 14:15 ` [Bug locale/31030] " schwab@linux-m68k.org
@ 2024-01-15 13:23 ` smcv at collabora dot com
  2024-01-15 13:31 ` smcv at collabora dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: smcv at collabora dot com @ 2024-01-15 13:23 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

Simon McVittie <smcv at collabora dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smcv at collabora dot com

--- Comment #2 from Simon McVittie <smcv at collabora dot com> ---
GLib currently uses (intptr_t) nl_langinfo(_NL_TIME_ERA_NUM_ENTRIES) in its
workaround for the ERA not being as documented, but it seems that there is a
separate problem with that on 64-bit big-endian architectures:
https://gitlab.gnome.org/GNOME/glib/-/issues/3225

On x86_64, in the Japanese (ja_JP.utf8) locale, (intptr_t)
nl_langinfo(_NL_TIME_ERA_NUM_ENTRIES) is 0x0000'0000'0000'000b (decimal 11)
which seems to be what is expected.

But on s390x, the result I see is 0x0000'000b'0000'0000 (a very large number):
it looks as though the two 32-bit words have been swapped? There is a test
program attached to the GLib issue which might be useful.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
  2023-12-05 14:15 ` [Bug locale/31030] " schwab@linux-m68k.org
  2024-01-15 13:23 ` smcv at collabora dot com
@ 2024-01-15 13:31 ` smcv at collabora dot com
  2024-01-15 13:34 ` schwab@linux-m68k.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: smcv at collabora dot com @ 2024-01-15 13:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

--- Comment #3 from Simon McVittie <smcv at collabora dot com> ---
(In reply to Andreas Schwab from comment #1)
> The ERA entry is actually terminated by a double null

Is that an API guarantee that glibc users can rely on, or an implementation
detail?

My concern is that if we rely on a double NUL, and then a future glibc version
changes to the documented behaviour with semicolon delimiters and single NUL
termination, we will be reading beyond the valid bounds of the returned string
and probably crash.

If glibc is willing to document/guarantee that the returned pointer is to
memory delimited by semicolons and/or single NULs, terminated by a double NUL,
then we could do something like this pseudocode:

    era = nl_langinfo (ERA);

    #ifdef __GLIBC__
    find length by searching for double NUL;
    era_copy = malloc (length);
    memcpy (era_copy, era, length);
    replace internal NULs with semicolons in era_copy;
    #else
    era_copy = strdup (era);
    #endif

The cost of doing that is that if glibc ever changes to returning the
documented format "segment;...;segment\0", it would have to use a slightly
longer buffer with a double NUL at the end, "segment;...;segment\0\0", to keep
that guarantee true.

(Obviously in real life the segments are more complicated than the word
"segment" but I'm sure you see what I mean.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
                   ` (2 preceding siblings ...)
  2024-01-15 13:31 ` smcv at collabora dot com
@ 2024-01-15 13:34 ` schwab@linux-m68k.org
  2024-01-15 13:35 ` schwab@linux-m68k.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: schwab@linux-m68k.org @ 2024-01-15 13:34 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

--- Comment #4 from Andreas Schwab <schwab@linux-m68k.org> ---
You cannot use _NL_TIME_ERA_NUM_ENTRIES like this since it is not a
pointer-sized entry.  Like all _NL_* definitions it is internal-only, not for
use with nl_langinfo.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
                   ` (3 preceding siblings ...)
  2024-01-15 13:34 ` schwab@linux-m68k.org
@ 2024-01-15 13:35 ` schwab@linux-m68k.org
  2024-01-15 13:50 ` bugzilla at tecnocode dot co.uk
  2024-01-16 21:58 ` aurelien at aurel32 dot net
  6 siblings, 0 replies; 8+ messages in thread
From: schwab@linux-m68k.org @ 2024-01-15 13:35 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

--- Comment #5 from Andreas Schwab <schwab@linux-m68k.org> ---
Like I said, it is an ABI problem.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
                   ` (4 preceding siblings ...)
  2024-01-15 13:35 ` schwab@linux-m68k.org
@ 2024-01-15 13:50 ` bugzilla at tecnocode dot co.uk
  2024-01-16 21:58 ` aurelien at aurel32 dot net
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla at tecnocode dot co.uk @ 2024-01-15 13:50 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

--- Comment #6 from Philip Withnall <bugzilla at tecnocode dot co.uk> ---
Do you have any recommendations for how `nl_langinfo (ERA)` is meant to be used
with glibc in a POSIX-compliant way?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug locale/31030] ERA segments are nul-separated rather than semicolon-separated
  2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
                   ` (5 preceding siblings ...)
  2024-01-15 13:50 ` bugzilla at tecnocode dot co.uk
@ 2024-01-16 21:58 ` aurelien at aurel32 dot net
  6 siblings, 0 replies; 8+ messages in thread
From: aurelien at aurel32 dot net @ 2024-01-16 21:58 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31030

Aurelien Jarno <aurelien at aurel32 dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aurelien at aurel32 dot net

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-01-16 21:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-03 17:17 [Bug locale/31030] New: ERA segments are nul-separated rather than semicolon-separated bugzilla at tecnocode dot co.uk
2023-12-05 14:15 ` [Bug locale/31030] " schwab@linux-m68k.org
2024-01-15 13:23 ` smcv at collabora dot com
2024-01-15 13:31 ` smcv at collabora dot com
2024-01-15 13:34 ` schwab@linux-m68k.org
2024-01-15 13:35 ` schwab@linux-m68k.org
2024-01-15 13:50 ` bugzilla at tecnocode dot co.uk
2024-01-16 21:58 ` aurelien at aurel32 dot net

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).