public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail.com>
To: "André Almeida" <andrealmeid@collabora.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Darren Hart" <dvhart@infradead.org>,
	linux-kernel@vger.kernel.org, "Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Thomas Gleixner" <tglx@linutronix.de>
Cc: acme@kernel.org, Andrey Semashev <andrey.semashev@gmail.com>,
	corbet@lwn.net, Davidlohr Bueso <dave@stgolabs.net>,
	fweimer@redhat.com, joel@joelfernandes.org, kernel@collabora.com,
	krisman@collabora.com, libc-alpha@sourceware.org,
	linux-api@vger.kernel.org, linux-kselftest@vger.kernel.org,
	malteskarupke@fastmail.fm, pgriffais@valvesoftware.com,
	Peter Oskolkov <posk@posk.io>,
	shuah@kernel.org,  z.figura12@gmail.com
Subject: Re: [PATCH v4 00/15] Add futex2 syscalls
Date: Fri, 04 Jun 2021 21:36:38 +1000	[thread overview]
Message-ID: <1622799088.hsuspipe84.astroid@bobo.none> (raw)
In-Reply-To: <20210603195924.361327-1-andrealmeid@collabora.com>

Excerpts from André Almeida's message of June 4, 2021 5:59 am:
> Hi,
> 
> This patch series introduces the futex2 syscalls.
> 
> * What happened to the current futex()?
> 
> For some years now, developers have been trying to add new features to
> futex, but maintainers have been reluctant to accept then, given the
> multiplexed interface full of legacy features and tricky to do big
> changes. Some problems that people tried to address with patchsets are:
> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
> NUMA, for instance, just doesn't fit the current API in a reasonable
> way. Considering that, it's not possible to merge new features into the
> current futex.
> 
>  ** The NUMA problem
> 
>  At the current implementation, all futex kernel side infrastructure is
>  stored on a single node. Given that, all futex() calls issued by
>  processors that aren't located on that node will have a memory access
>  penalty when doing it.
> 
>  ** The 32bit sized futex problem
> 
>  Futexes are used to implement atomic operations in userspace.
>  Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to
>  implement all those sizes in a performant way. Thanks Boost devs for
>  feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php
> 
>  Embedded systems or anything with memory constrains could benefit of
>  using smaller sizes for the futex userspace integer.
> 
>  ** The wait on multiple problem
> 
>  The use case lies in the Wine implementation of the Windows NT interface
>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>  waiting on the first of a set of event sources (mutexes, timers, signal,
>  console input, etc) to signal.  Considering this is a primitive
>  synchronization operation for Windows applications, being able to quickly
>  signal events on the producer side, and quickly go to sleep on the
>  consumer side is essential for good performance of those running over Wine.
> 
> [0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
> [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/
> 
> * The solution
> 
> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
> is required to solve this, which must be designed with those features in
> mind. futex2() is that interface. As opposed to the current multiplexed
> interface, the new one should have one syscall per operation. This will
> allow the maintainability of the API if it gets extended, and will help
> users with type checking of arguments.
> 
> In particular, the new interface is extended to support the ability to
> wait on any of a list of futexes at a time, which could be seen as a
> vectored extension of the FUTEX_WAIT semantics.
> 
> [3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/
> 
> * The interface
> 
> The new interface can be seen in details in the following patches, but
> this is a high level summary of what the interface can do:
> 
>  - Supports wake/wait semantics, as in futex()
>  - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
>    individual flags for each address
>  - Supports waiting for a vector of futexes, using a new syscall named
>    futex_waitv()
>  - Supports variable sized futexes (8bits, 16bits, 32bits and 64bits)
>  - Supports NUMA-awareness operations, where the user can specify on
>    which memory node would like to operate

A few comments.

- Do you need a syscall for wait and for waitv, or can wait just be a
degenerate case of waitv (which could use the stack variable)?  I guess
it does save the user access unlock.

- Did you consider a wakev interface? An example is a read-write mutex 
which has read-blocking futexes split (e.g., per-node) for scalability 
then the writer may unlock and wake all readers with one call. We 
actually have some scalability challenges of this nature with certain 
large database programs.

- Great to see 64-bit support in, it is helpful to do some interesting 
things with locks without hacks (e.g., putting an address in the lock 
word).

- Are we really keen on squashing node ID into flags in this day and age?
I guess okay but seems like it would be nice to allow a bit more space
in general for the operations. I don't want to turn it into a whole big
multiplexing nightmare again with lots of such flags, or propose
complexity with no code behind it, but I think we need a bit of leeway
for unforeseen locking innovations to be added carefully. The pthread
locking today is still fairly primitive really, I don't think we know
what will work best for the next 10 years.

One scalability issue we are starting to hit and will only get worse is 
futex queue spinlock contention. Perhaps this is better addressed in 
userspace but the kernel could play a part so I like to leave some doors
open. One example is that the wait (or wake) side may like to depend not
just on the memory value, but on the success of a cmpxchg to avoid 
unqueueing and queueing spuriously, which increases lock contention but
also ends up putting the poor task on the back of the list -- yes RT
priorities can help the realtime case, but non-RT cases can get bad
outlier latencies if lock stealing is allowed (which can be very good
for performance).

- The private global futex hash table sucks for various reasons, and
having 128 waiters per thread makes it orders of magnitude easier for
userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the
per process hashing that Thomas suggested does fix the DoS but the
non-deterministic hash collisions still seem to be a problem for real
time response, and at the other end of the scale some apps (certain 
databases, etc) can have ten thousand futex waiters at once so birthday
paradox can also lead to guaranteed (low level) variable beahviour 
within a single process.

I know the kernel in general is not very robust against this kind of 
DoS/nondeterminism, but it's a bit sad to introduce new APIs with the 
problem still there. Yes we could address it later, but I think it's 
better done first because the solution might influence what the best 
syscall API is.

For example the idea of using the address as the handle for the wait 
queue _and_ the value is very convenient but hashing is annoying for
all the above reasons and the shared wait queue case is pretty clunky. 
It's also constraining in some corner cases to have the wait queue 
associated with the address 1:1. For example a type of NUMA mutex might 
want to have per-node waitqueues associated with a lock word, and wake
local node waiters 99% of the time. Not trivial to do with futexes and
seems to at least require bouncing of more cache lines, possibly more
atomics, etc.

Could anything else be done?

I'll be burned at the stake for suggesting it but it would be great if 
we could use file descriptors. At least for the shared futex, maybe 
private could use a per-process futex allocator. It solves all of the
above, although I'm sure has many of its own problem. It may not play
so nicely with the pthread mutex API because of the whole static
initialiser problem, but the first futex proposal did use fds. But it's
an example of an alternate API.

And.. thanks for the great work, apologies if I missed some discussion
related to the above comments, I did some reading but I know there has
been a lot of discussions going on.

Thanks,
Nick

  parent reply	other threads:[~2021-06-04 11:36 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-03 19:59 André Almeida
2021-06-03 19:59 ` [PATCH v4 01/15] futex2: Implement wait and wake functions André Almeida
2021-06-03 19:59 ` [PATCH v4 02/15] futex2: Add support for shared futexes André Almeida
2021-06-03 19:59 ` [PATCH v4 03/15] futex2: Implement vectorized wait André Almeida
2021-06-03 19:59 ` [PATCH v4 04/15] futex2: Implement requeue operation André Almeida
2021-06-03 19:59 ` [PATCH v4 05/15] futex2: Implement support for different futex sizes André Almeida
2021-06-06 19:12   ` Davidlohr Bueso
2021-06-06 23:01     ` Andrey Semashev
2021-06-03 19:59 ` [PATCH v4 06/15] futex2: Add compatibility entry point for x86_x32 ABI André Almeida
2021-06-03 19:59 ` [PATCH v4 07/15] docs: locking: futex2: Add documentation André Almeida
2021-06-06 19:23   ` Davidlohr Bueso
2021-06-03 19:59 ` [PATCH v4 08/15] selftests: futex2: Add wake/wait test André Almeida
2021-06-03 19:59 ` [PATCH v4 09/15] selftests: futex2: Add timeout test André Almeida
2021-06-03 19:59 ` [PATCH v4 10/15] selftests: futex2: Add wouldblock test André Almeida
2021-06-03 19:59 ` [PATCH v4 11/15] selftests: futex2: Add waitv test André Almeida
2021-06-03 19:59 ` [PATCH v4 12/15] selftests: futex2: Add requeue test André Almeida
2021-06-03 19:59 ` [PATCH v4 13/15] selftests: futex2: Add futex sizes test André Almeida
2021-06-03 19:59 ` [PATCH v4 14/15] perf bench: Add futex2 benchmark tests André Almeida
2021-06-03 19:59 ` [PATCH v4 15/15] kernel: Enable waitpid() for futex2 André Almeida
2021-06-04  4:51 ` [PATCH v4 00/15] Add futex2 syscalls Zebediah Figura
2021-06-04 17:04   ` André Almeida
2021-06-04 11:36 ` Nicholas Piggin [this message]
2021-06-04 20:01   ` André Almeida
2021-06-05  1:09     ` Nicholas Piggin
2021-06-05  8:56       ` Andrey Semashev
2021-06-06 11:57         ` Nicholas Piggin
2021-06-06 13:15           ` Andrey Semashev
2021-06-08  1:25             ` Nicholas Piggin
2021-06-08 11:03               ` Andrey Semashev
2021-06-08 11:13                 ` Greg KH
2021-06-08 11:44                   ` Peter Zijlstra
2021-06-08 14:31                     ` Davidlohr Bueso
2021-06-08 12:06                   ` Andrey Semashev
2021-06-08 12:33                     ` Greg KH
2021-06-08 12:35                     ` Greg KH
2021-06-08 13:18                       ` Andrey Semashev
2021-06-08 13:27                         ` Greg KH
2021-06-08 13:41                           ` Andrey Semashev
2021-06-08 17:06                         ` Zebediah Figura
2021-06-08 14:14                   ` André Almeida
2021-06-07 15:40       ` André Almeida
2021-06-08  1:31         ` Nicholas Piggin
2021-06-08  2:33         ` Davidlohr Bueso
2021-06-08  4:45           ` Nicholas Piggin
2021-06-08 12:26         ` Sebastian Andrzej Siewior
2021-06-08 14:23           ` Peter Zijlstra
2021-06-08 14:57             ` Sebastian Andrzej Siewior
2021-06-08 15:04             ` André Almeida
2021-06-08 18:08             ` Adhemerval Zanella
2021-06-08 18:19               ` Florian Weimer
2021-06-08 18:22                 ` Adhemerval Zanella
2021-06-09 16:26             ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1622799088.hsuspipe84.astroid@bobo.none \
    --to=npiggin@gmail.com \
    --cc=acme@kernel.org \
    --cc=andrealmeid@collabora.com \
    --cc=andrey.semashev@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=joel@joelfernandes.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=malteskarupke@fastmail.fm \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pgriffais@valvesoftware.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=z.figura12@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).