public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Alexandre Oliva <oliva@adacore.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: Jonathan Wakely <jwakely@redhat.com>,
	gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org
Subject: Re: [PATCH] [PR77760] [libstdc++] encode __time_get_state in tm
Date: Wed, 22 Feb 2023 11:27:29 -0300	[thread overview]
Message-ID: <orzg9522su.fsf@lxoliva.fsfla.org> (raw)
In-Reply-To: <Y+95gEhz3zQj3ngU@tucnak> (Jakub Jelinek's message of "Fri, 17 Feb 2023 13:56:32 +0100")

On Feb 17, 2023, Jakub Jelinek <jakub@redhat.com> wrote:

> My worry is that people often invoke the time_get APIs on uninitialized
> struct tm.

Yeah.  I thought you meant get(), but it looks like you meant do_get()
as well.  I seem to have overread the permissions to overwrite tm
members, to update its current contents, and to write unspecified values
in its members.

When I started down this path, I thought we might be able to hold the
state bits only in fields that were to be updated, but the century field
and allowing for 16-bit ints made that a bit challenging.


So, back to the drawing board, the other possibility that occurs to me
is to use a thread-local state-keeper chained stack, set up by get, and
that enabled state to be recovered and updated by do_get if relevant
incoming arguments match those recorded at the top of the stack (e.g.,
at least that we're looking at the same tm object).


Another approach is to encode state in tm in-place and progressively,
and compute derived fields when (i) we've just set one of the input
fields, and (ii) any other input fields are in range.  E.g., if we parse
%p, we make sure tm_hour will be in the 00..11 range for am or 12..23
for pm, using the value previously-held in tm_hour, modulo 12, only if
it was in the 00..23 range.  If it wasn't in range, we record 00 or 12.
When we parse %I, if we find that tm_hour is in the 12..23 range, we
take that as meaning a "pm" was parsed for %p before, and add 12 to the
parsed value.  For century, we'd encode it immediately in tm_year
(preserving the previous modulo-100 value), and when parsing %Y we'd use
the century data, both at the risk of using garbage in case of
uninitialized data.  That appears to be in line with the
standard-specified behavior.  Maybe this is what the specification
expects implementations to do?

> For the encoding in get, the questions are:
> 1) if it is ok if we clear some normally unused bits of certain tm_*
>    fields even if they wouldn't be otherwise touched; perhaps nothing
>    will care, but it is still a user visible change

Rereading the relevant passages in standards and drafts, ISTM that this
unconditional zeroing would be unexpected and not permitted by the
standard, indeed.

> 2) when those tm_* fields we want to encode the state into are unitialized,
>    don't we effectively invoke UB by using the uninitialized member
>    (admittedly we just overwrite some bits in it and leave others as is,
>    but won't that e.g. cause -Wuninitialized warnings)?

ISTM that https://eel.is/c++draft/locale.time.get#virtuals-15 allows
do_get to read values of members, and even to expect them to have been
zero-initialized.  That, and the allowance for unspecified values to be
stored in tm in case of error, seem to me to hint at using out-of-range
values of struct tm.

However, at the end of do_get, we *should* have specified values in
place in the just-read member, and get rules out modifying other fields.
Could get() build a temporary, automatic tm object, zero-initialize it,
set a bit in each field that indicates it is not set, use it throughout
multiple do_get calls to hold parsed values and internal state, and then
copy to the user-supplied tm object only fields that are marked as set
(the bit is clear) and in range?

> More importantly, in the do_get the questions are:
> 3) while do_get is protected, are we guaranteed that it will never be called
>    except from the public members of time_get?

That's IMHO the main element of risk in using bits from unrelated fields
to hold state.  Storing the expected values only implies encoding state
in tm where it would ultimately end up (i.e., am/pm in tm_hour / 12,
century in tm_year / 100 + 19), using a zero (or an equal-to-sign) bit
to flag "set", and leaving it for get() to compute derived fields (if
that's even standard-compliant; it's not clear that it is, desirable as
it seems)

> 4) I think we even shouldn't __state._M_finalize_state(__tm); in do_get
>    if it is nested, then you wouldn't need to tweak src/c++98/locale_facets.cc;
>    generally state should be finalized once everything is parsed together,
>    not many times in between (of course if user code will parse stuff
>    separately that will not work, say get with "%I" in one call, then get
>    with "%h" in another one etc., but that is user's decision).

Parsing "%h%p%I%p%h" should get into tm_hour the value that time_put
read to format with the same format string, so there's an argument for
"finalizing" tm_hour based on more than the state fields we hold atm,
perhaps even for progressive in-place encoding, as I have suggested
above.

> Just curious, why doesn't the pmf hack work on arm-vxworks7?

At first, I thought we were running into this just because we have to
define __clang__ because of some vxworks system headers aimed at clang.
But even as I tried to drop the #ifndef, the test still failed; I
suspected it had to do with ARM's encoding of ptrmemfunc_vbit_in_delta,
but I did not confirm that this was the case.

> Sure, clang++ doesn't support it and if one derives a class from time_get
> and overrides do_get and/or get things will not work properly either.

Yeah, I reasoned along these lines to try an alternate solution, even if
the alternate solution was to be used only when the test did not pass.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

  reply	other threads:[~2023-02-22 14:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-17  7:47 Alexandre Oliva
2023-02-17 12:56 ` Jakub Jelinek
2023-02-22 14:27   ` Alexandre Oliva [this message]
2023-02-23 17:55     ` Alexandre Oliva
2023-02-23 18:21       ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=orzg9522su.fsf@lxoliva.fsfla.org \
    --to=oliva@adacore.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=jwakely@redhat.com \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).