public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "André Almeida" <andrealmeid@collabora.com>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: acme@kernel.org, Andrey Semashev <andrey.semashev@gmail.com>,
	corbet@lwn.net, Darren Hart <dvhart@infradead.org>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	fweimer@redhat.com, joel@joelfernandes.org, kernel@collabora.com,
	krisman@collabora.com, libc-alpha@sourceware.org,
	linux-api@vger.kernel.org, linux-kselftest@vger.kernel.org,
	malteskarupke@fastmail.fm, pgriffais@valvesoftware.com,
	Peter Oskolkov <posk@posk.io>,
	shuah@kernel.org, z.figura12@gmail.com,
	Steven Rostedt <rostedt@goodmis.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [PATCH v4 00/15] Add futex2 syscalls
Date: Fri, 4 Jun 2021 17:01:01 -0300	[thread overview]
Message-ID: <fb85fb20-5421-b095-e68b-955afa105467@collabora.com> (raw)
In-Reply-To: <1622799088.hsuspipe84.astroid@bobo.none>

Às 08:36 de 04/06/21, Nicholas Piggin escreveu:
> Excerpts from André Almeida's message of June 4, 2021 5:59 am:
>> Hi,
>>
>> This patch series introduces the futex2 syscalls.
>>
>> * What happened to the current futex()?
>>
>> For some years now, developers have been trying to add new features to
>> futex, but maintainers have been reluctant to accept then, given the
>> multiplexed interface full of legacy features and tricky to do big
>> changes. Some problems that people tried to address with patchsets are:
>> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
>> NUMA, for instance, just doesn't fit the current API in a reasonable
>> way. Considering that, it's not possible to merge new features into the
>> current futex.
>>
>>  ** The NUMA problem
>>
>>  At the current implementation, all futex kernel side infrastructure is
>>  stored on a single node. Given that, all futex() calls issued by
>>  processors that aren't located on that node will have a memory access
>>  penalty when doing it.
>>
>>  ** The 32bit sized futex problem
>>
>>  Futexes are used to implement atomic operations in userspace.
>>  Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to
>>  implement all those sizes in a performant way. Thanks Boost devs for
>>  feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php
>>
>>  Embedded systems or anything with memory constrains could benefit of
>>  using smaller sizes for the futex userspace integer.
>>
>>  ** The wait on multiple problem
>>
>>  The use case lies in the Wine implementation of the Windows NT interface
>>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>>  waiting on the first of a set of event sources (mutexes, timers, signal,
>>  console input, etc) to signal.  Considering this is a primitive
>>  synchronization operation for Windows applications, being able to quickly
>>  signal events on the producer side, and quickly go to sleep on the
>>  consumer side is essential for good performance of those running over Wine.
>>
>> [0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
>> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
>> [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/
>>
>> * The solution
>>
>> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
>> is required to solve this, which must be designed with those features in
>> mind. futex2() is that interface. As opposed to the current multiplexed
>> interface, the new one should have one syscall per operation. This will
>> allow the maintainability of the API if it gets extended, and will help
>> users with type checking of arguments.
>>
>> In particular, the new interface is extended to support the ability to
>> wait on any of a list of futexes at a time, which could be seen as a
>> vectored extension of the FUTEX_WAIT semantics.
>>
>> [3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/
>>
>> * The interface
>>
>> The new interface can be seen in details in the following patches, but
>> this is a high level summary of what the interface can do:
>>
>>  - Supports wake/wait semantics, as in futex()
>>  - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
>>    individual flags for each address
>>  - Supports waiting for a vector of futexes, using a new syscall named
>>    futex_waitv()
>>  - Supports variable sized futexes (8bits, 16bits, 32bits and 64bits)
>>  - Supports NUMA-awareness operations, where the user can specify on
>>    which memory node would like to operate
> 
> A few comments.
> 
> - Do you need a syscall for wait and for waitv, or can wait just be a
> degenerate case of waitv (which could use the stack variable)?  I guess
> it does save the user access unlock.
> 

Yes. waitv with one element has a overhead compared to wait, given the
extra user_copy(). Given that waiting on a single futex is the common
case, I think it's worth to have it.

> - Did you consider a wakev interface? An example is a read-write mutex 
> which has read-blocking futexes split (e.g., per-node) for scalability 
> then the writer may unlock and wake all readers with one call. We 
> actually have some scalability challenges of this nature with certain 
> large database programs.
> 

Not really, I haven't heard any use case for that until now. It should
be easy to implement it, though, and I think you have an interesting use
case here. Could you point me some of those database programs?

> - Great to see 64-bit support in, it is helpful to do some interesting 
> things with locks without hacks (e.g., putting an address in the lock 
> word).
> 
> - Are we really keen on squashing node ID into flags in this day and age?
> I guess okay but seems like it would be nice to allow a bit more space
> in general for the operations. I don't want to turn it into a whole big
> multiplexing nightmare again with lots of such flags, or propose
> complexity with no code behind it, but I think we need a bit of leeway
> for unforeseen locking innovations to be added carefully. The pthread
> locking today is still fairly primitive really, I don't think we know
> what will work best for the next 10 years.

In the interface that I'd proposed, the node ID isn't part of the flags.
You have a flag FUTEX_FLAG_NUMA, and when that is used, you pass in
`void *uaddr` a pointer to a `struct futex_numa { int value, int hint
}`, where hint should be the node ID you would like to work on, and
value is just the userspace futex. This is documented in more details in
patch 7 "docs: locking: futex2: Add documentation".

If you have any feedback about how this NUMA interface looks like, I
would like to hear.

Also, did something in my writing indicated that the node ID would be
part of the flags? I'll improve this it if so.

> 
> One scalability issue we are starting to hit and will only get worse is 
> futex queue spinlock contention. Perhaps this is better addressed in 
> userspace but the kernel could play a part so I like to leave some doors
> open. One example is that the wait (or wake) side may like to depend not
> just on the memory value, but on the success of a cmpxchg to avoid 
> unqueueing and queueing spuriously, which increases lock contention but
> also ends up putting the poor task on the back of the list -- yes RT
> priorities can help the realtime case, but non-RT cases can get bad
> outlier latencies if lock stealing is allowed (which can be very good
> for performance).
> 

Sorry, I'm not sure what do you mean here. Are you proposing to have a
cmpxchg in kernel side, so the lock would be taken by the kernel, and
not by the userspace like it's now?

> - The private global futex hash table sucks for various reasons, and
> having 128 waiters per thread makes it orders of magnitude easier for
> userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the
> per process hashing that Thomas suggested does fix the DoS but the
> non-deterministic hash collisions still seem to be a problem for real
> time response, and at the other end of the scale some apps (certain 
> databases, etc) can have ten thousand futex waiters at once so birthday
> paradox can also lead to guaranteed (low level) variable beahviour 
> within a single process.
> 
> I know the kernel in general is not very robust against this kind of 
> DoS/nondeterminism, but it's a bit sad to introduce new APIs with the 
> problem still there. Yes we could address it later, but I think it's 
> better done first because the solution might influence what the best 
> syscall API is.
> 
> For example the idea of using the address as the handle for the wait 
> queue _and_ the value is very convenient but hashing is annoying for
> all the above reasons and the shared wait queue case is pretty clunky. 
> It's also constraining in some corner cases to have the wait queue 
> associated with the address 1:1. For example a type of NUMA mutex might 
> want to have per-node waitqueues associated with a lock word, and wake
> local node waiters 99% of the time. Not trivial to do with futexes and
> seems to at least require bouncing of more cache lines, possibly more
> atomics, etc.
> 
> Could anything else be done?

I wasn't aware that userspace doing DoS is something to be concerned
from the kernel point of view. Is this scenario considering a malicious
actor? If so, there are plenty of resources to be denied, so not sure
how futex could be protected of this. Or is this just a program that
uses tons of futexes?

> 
> I'll be burned at the stake for suggesting it but it would be great if 
> we could use file descriptors. At least for the shared futex, maybe 
> private could use a per-process futex allocator. It solves all of the
> above, although I'm sure has many of its own problem. It may not play
> so nicely with the pthread mutex API because of the whole static
> initialiser problem, but the first futex proposal did use fds. But it's
> an example of an alternate API.
> 

FDs and futex doesn't play well, because for futex_wait() you need to
tell the kernel the expected value in the futex address to avoid
sleeping in a free lock. FD operations (poll, select) don't have this
`value` argument, so they could sleep forever, but I'm not sure if you
had taken this in consideration.

> And.. thanks for the great work, apologies if I missed some discussion
> related to the above comments, I did some reading but I know there has
> been a lot of discussions going on.
> 

:) and thank you for the valuable feedback and detailed ideas!

> Thanks,
> Nick
> 

  reply	other threads:[~2021-06-04 20:01 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-03 19:59 André Almeida
2021-06-03 19:59 ` [PATCH v4 01/15] futex2: Implement wait and wake functions André Almeida
2021-06-03 19:59 ` [PATCH v4 02/15] futex2: Add support for shared futexes André Almeida
2021-06-03 19:59 ` [PATCH v4 03/15] futex2: Implement vectorized wait André Almeida
2021-06-03 19:59 ` [PATCH v4 04/15] futex2: Implement requeue operation André Almeida
2021-06-03 19:59 ` [PATCH v4 05/15] futex2: Implement support for different futex sizes André Almeida
2021-06-06 19:12   ` Davidlohr Bueso
2021-06-06 23:01     ` Andrey Semashev
2021-06-03 19:59 ` [PATCH v4 06/15] futex2: Add compatibility entry point for x86_x32 ABI André Almeida
2021-06-03 19:59 ` [PATCH v4 07/15] docs: locking: futex2: Add documentation André Almeida
2021-06-06 19:23   ` Davidlohr Bueso
2021-06-03 19:59 ` [PATCH v4 08/15] selftests: futex2: Add wake/wait test André Almeida
2021-06-03 19:59 ` [PATCH v4 09/15] selftests: futex2: Add timeout test André Almeida
2021-06-03 19:59 ` [PATCH v4 10/15] selftests: futex2: Add wouldblock test André Almeida
2021-06-03 19:59 ` [PATCH v4 11/15] selftests: futex2: Add waitv test André Almeida
2021-06-03 19:59 ` [PATCH v4 12/15] selftests: futex2: Add requeue test André Almeida
2021-06-03 19:59 ` [PATCH v4 13/15] selftests: futex2: Add futex sizes test André Almeida
2021-06-03 19:59 ` [PATCH v4 14/15] perf bench: Add futex2 benchmark tests André Almeida
2021-06-03 19:59 ` [PATCH v4 15/15] kernel: Enable waitpid() for futex2 André Almeida
2021-06-04  4:51 ` [PATCH v4 00/15] Add futex2 syscalls Zebediah Figura
2021-06-04 17:04   ` André Almeida
2021-06-04 11:36 ` Nicholas Piggin
2021-06-04 20:01   ` André Almeida [this message]
2021-06-05  1:09     ` Nicholas Piggin
2021-06-05  8:56       ` Andrey Semashev
2021-06-06 11:57         ` Nicholas Piggin
2021-06-06 13:15           ` Andrey Semashev
2021-06-08  1:25             ` Nicholas Piggin
2021-06-08 11:03               ` Andrey Semashev
2021-06-08 11:13                 ` Greg KH
2021-06-08 11:44                   ` Peter Zijlstra
2021-06-08 14:31                     ` Davidlohr Bueso
2021-06-08 12:06                   ` Andrey Semashev
2021-06-08 12:33                     ` Greg KH
2021-06-08 12:35                     ` Greg KH
2021-06-08 13:18                       ` Andrey Semashev
2021-06-08 13:27                         ` Greg KH
2021-06-08 13:41                           ` Andrey Semashev
2021-06-08 17:06                         ` Zebediah Figura
2021-06-08 14:14                   ` André Almeida
2021-06-07 15:40       ` André Almeida
2021-06-08  1:31         ` Nicholas Piggin
2021-06-08  2:33         ` Davidlohr Bueso
2021-06-08  4:45           ` Nicholas Piggin
2021-06-08 12:26         ` Sebastian Andrzej Siewior
2021-06-08 14:23           ` Peter Zijlstra
2021-06-08 14:57             ` Sebastian Andrzej Siewior
2021-06-08 15:04             ` André Almeida
2021-06-08 18:08             ` Adhemerval Zanella
2021-06-08 18:19               ` Florian Weimer
2021-06-08 18:22                 ` Adhemerval Zanella
2021-06-09 16:26             ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb85fb20-5421-b095-e68b-955afa105467@collabora.com \
    --to=andrealmeid@collabora.com \
    --cc=acme@kernel.org \
    --cc=andrey.semashev@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=joel@joelfernandes.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=malteskarupke@fastmail.fm \
    --cc=mingo@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=pgriffais@valvesoftware.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=z.figura12@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).