Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Dmitry Vyukov <dvyukov@google.com>
To: mathieu.desnoyers@efficios.com
Cc: David.Laight@aculab.com, alexander@mihalicyn.com,
	andrealmeid@igalia.com,  boqun.feng@gmail.com,
	brauner@kernel.org, carlos@redhat.com,  ckennelly@google.com,
	corbet@lwn.net, dave@stgolabs.net, dvhart@infradead.org,
	 fweimer@redhat.com, goldstein.w.n@gmail.com, hpa@zytor.com,
	 libc-alpha@sourceware.org, linux-api@vger.kernel.org,
	 linux-kernel@vger.kernel.org, longman@redhat.com,
	mingo@redhat.com,  paulmck@kernel.org, peterz@infradead.org,
	pjt@google.com, posk@posk.io,  rostedt@goodmis.org,
	tglx@linutronix.de
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
Date: Thu, 28 Sep 2023 07:47:15 -0700	[thread overview]
Message-ID: <CACT4Y+bJpUMpp9YNt4wvQdCHV0-br1d9rH8y01Q=Y_1hQW5yCA@mail.gmail.com> (raw)
In-Reply-To: <CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com>

On Tue, 26 Sept 2023 at 16:49, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> > >> I don't see why we can't stick this directly into struct rseq because
> > >> it's all public anyway.
> > >
> > > The motivation for moving this to a different cache line is to handle
> > > the prior comment from Boqun, who is concerned that busy-waiting
> > > repeatedly loading a field from struct rseq will cause false-sharing and
> > > make other stores to that cache line slower, especially stores to
> > > rseq_cs to begin rseq critical sections, thus slightly increasing the
> > > overhead of rseq critical sections taken while mutexes are held.
> > >
> > > If we want to embed this field into struct rseq with its own cache line,
> > > then we need to add a lot of padding, which is inconvenient.
> > >
> > > That being said, perhaps this is premature optimization, what do you think ?
> >
> > Hi Mathieu, Florian,
> >
> > This is exciting!
> >
> > I thought the motivation for moving rseq_sched_state out of struct rseq
> > is lifetime management problem. I assume when a thread locks a mutex,
> > it stores pointer to rseq_sched_state in the mutex state for other
> > threads to poll. So the waiting thread would do something along the following
> > lines:
> >
> > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
> > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
> >         futex_wait();
> >
> > Now if the state is struct rseq, which is stored in TLS,
> > then the owning thread can unlock the mutex, exit and unmap TLS in between.
> > Consequently, load of state->state will cause a paging fault.
> >
> > And we do want rseq in TLS to save 1 indirection.
> >
> > If rseq_sched_state is separated from struct rseq, then it can be allocated
> > in type stable memory that is never unmapped.
> >
> > What am I missing here?
> >
> > However, if we can store this state in struct rseq, then an alternative
> > interface would for the kernel to do:
> >
> > rseq->cpu_id = -1;
> >
> > to denote that the thread is not running on any CPU.
> > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
> > and -1 naturally means "not running at all". And we already store -1
> > right after init, so it shouldn't be a surprising value.
>
> As you may know we experimented with "virtual CPUs" in tcmalloc. The
> extension allows kernel to assign dense virtual CPU numbers to running
> threads instead of real sparse CPU numbers:
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41
>
> Recently I added another change that [ab]uses rseq in an interesting
> way. We want to get notifications about thread re-scheduling. A bit
> simplified version of this is as follows:
> we don't use rseq.cpu_id_start for its original purpose, so instead we
> store something else there with a high bit set. Real CPU numbers don't
> have a high bit set (at least while you have less than 2B CPUs :)).
> This allows us to distinguish the value we stored in rseq.cpu_id_start
> from real CPU id stored by the kernel.
> Inside of rseq critical section we check if rseq.cpu_id_start has high
> bit set, and if not, then we know that we were just rescheduled, so we
> can do some additional work and update rseq.cpu_id_start to have high
> bit set.
>
> In reality it's a bit more involved since the field is actually 8
> bytes and only partially overlaps with rseq.cpu_id_start (it's an
> 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
>
> I am thinking if we could extend the current proposed interface in a
> way that would be more flexible and would satisfy all of these use
> cases (spinlocks, and possibility of using virtual CPUs and
> rescheduling notifications). In the end they all need a very similar
> thing: kernel writing some value at some user address when a thread is
> de-scheduled.
>
> The minimal support we need for tcmalloc is an 8-byte user address +
> kernel writing 0 at that address when a thread is descheduled.
>
> The most flexible option to support multiple users
> (malloc/spinlocks/something else) would be as follows:
>
> User-space passes an array of structs with address + size (1/2/4/8
> bytes) + value.
> Kernel intereates over the array when the thread is de-scheduled and
> writes the specified value at the provided address/size.
> Something along the following lines (pseudo-code):
>
> struct rseq {
>     ...
>     struct rseq_desched_notif_t* desched_notifs;
>     int desched_notif_count;
> };
>
> struct rseq_desched_notif_t {
>     void* addr;
>     uint64_t value;
>     int size;
> };
>
> static inline void rseq_preempt(struct task_struct *t)
> {
>     ...
>     for (int i = 0; i < t->rseq->desched_notif_count; i++) {
>         switch (t->rseq->desched_notifs[i].size) {
>         case 1: put_user1(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 2: put_user2(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 4: put_user4(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 8: put_user8(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         }
>     }
> }

One thing I forgot to mention: ideally the kernel also writes a
timestamp of descheduling somewhere.
We are using this logic to assign per-CPU malloc caches to threads,
and it's useful to know which caches were used very recently (still
hot in cache) and which ones were not used for a long time.

next prev parent reply	other threads:[~2023-09-28 14:47 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29 19:14 [RFC PATCH v2 0/4] Extend rseq with sched_state_ptr field Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
2023-05-29 19:35   ` Florian Weimer
2023-05-29 19:48     ` Mathieu Desnoyers
2023-05-30  8:20       ` Florian Weimer
2023-05-30 14:25         ` Mathieu Desnoyers
2023-05-30 15:13           ` Mathieu Desnoyers
2023-09-26 20:52       ` Dmitry Vyukov
2023-09-26 23:49         ` Dmitry Vyukov
2023-09-26 23:54           ` Dmitry Vyukov
2023-09-27  4:51           ` Florian Weimer
2023-09-27 15:58             ` Dmitry Vyukov
2023-09-28  8:52               ` Florian Weimer
2023-09-28 14:44                 ` Dmitry Vyukov
2023-09-28 14:47           ` Dmitry Vyukov [this message]
2023-09-28 10:39   ` Peter Zijlstra
2023-09-28 11:22     ` David Laight
2023-09-28 13:20       ` Mathieu Desnoyers
2023-09-28 14:26         ` Peter Zijlstra
2023-09-28 14:33         ` David Laight
2023-09-28 15:05         ` André Almeida
2023-09-28 14:43     ` Steven Rostedt
2023-09-28 15:51       ` David Laight
2023-10-02 16:51         ` Steven Rostedt
2023-10-02 17:22           ` David Laight
2023-10-02 17:56             ` Steven Rostedt
2023-09-28 20:21   ` Thomas Gleixner
2023-09-28 20:43     ` Mathieu Desnoyers
2023-09-28 20:54   ` Thomas Gleixner
2023-09-28 22:11     ` Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
2023-05-29 19:14 ` [RFC PATCH v2 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
2023-09-28 19:55   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACT4Y+bJpUMpp9YNt4wvQdCHV0-br1d9rH8y01Q=Y_1hQW5yCA@mail.gmail.com' \
    --to=dvyukov@google.com \
    --cc=David.Laight@aculab.com \
    --cc=alexander@mihalicyn.com \
    --cc=andrealmeid@igalia.com \
    --cc=boqun.feng@gmail.com \
    --cc=brauner@kernel.org \
    --cc=carlos@redhat.com \
    --cc=ckennelly@google.com \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hpa@zytor.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).