From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <npiggin@gmail.com>
Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com
 [IPv6:2607:f8b0:4864:20::1034])
 by sourceware.org (Postfix) with ESMTPS id EB8E53857401
 for <libc-alpha@sourceware.org>; Sat,  5 Jun 2021 01:09:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EB8E53857401
Received: by mail-pj1-x1034.google.com with SMTP id
 h12-20020a17090aa88cb029016400fd8ad8so6821369pjq.3
 for <libc-alpha@sourceware.org>; Fri, 04 Jun 2021 18:09:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to
 :mime-version:message-id:content-transfer-encoding;
 bh=+ymcPBkFzODnVXwuUwD5PtjbB+QQ9q8/HCoXcbiRNhs=;
 b=uYOcEHLD5zD8/GrnUlLc3YXUqGYepf+GLnB+7BVwxe1lIUecLp2SLa1X8XAkvT3CVa
 gMmR6177pl0uDfOcMhWK/xqRMh0PxpML8bZx4I3WXsVgo+q3sYdwqILf2jHnrWK60cTD
 vpmKBQedaToU3yUMZReO9/ePEv/v6KsW9dzymn0XBOKlD0gEe21oxpQEYCjLeRy6Fw0Q
 HSo1t4lYIIbe/1IoACG3eJqcbfwy4Z/P6wvecp4HGMp1jOPfMd/4gDChv6K6dsPxDws+
 4fV4Ijd6Iyw0xiTMXtQzxfw8iYd1N0mH3NKdLHuLSOAs4pSyOLAlt9zRbxtwJOyAINcR
 8v+A==
X-Gm-Message-State: AOAM533yibVksG+xsxQ2acEJjBoGHmFm9VTj94ObdGZ6rZkSKIj5nhTx
 u64vHur71W9YyVc4nAfDboU=
X-Google-Smtp-Source: ABdhPJyB3RkL7XEtTxQkrqNjWJLnPnWa4hmpCz+oxnl97Ql7j+jLd33lGAKu1ki14J3yzUZiH+IRmg==
X-Received: by 2002:a17:90b:33d1:: with SMTP id
 lk17mr7759723pjb.154.1622855390928; 
 Fri, 04 Jun 2021 18:09:50 -0700 (PDT)
Received: from localhost (60-242-147-73.tpgi.com.au. [60.242.147.73])
 by smtp.gmail.com with ESMTPSA id s9sm2607668pfm.120.2021.06.04.18.09.49
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 04 Jun 2021 18:09:50 -0700 (PDT)
Date: Sat, 05 Jun 2021 11:09:45 +1000
From: Nicholas Piggin <npiggin@gmail.com>
Subject: Re: [PATCH v4 00/15] Add futex2 syscalls
To: =?iso-8859-1?b?QW5kcuk=?= Almeida <andrealmeid@collabora.com>
Cc: acme@kernel.org, Andrey Semashev <andrey.semashev@gmail.com>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>, corbet@lwn.net,
 Davidlohr Bueso <dave@stgolabs.net>, Darren Hart <dvhart@infradead.org>,
 fweimer@redhat.com, joel@joelfernandes.org, kernel@collabora.com,
 krisman@collabora.com, libc-alpha@sourceware.org, linux-api@vger.kernel.org,
 linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
 malteskarupke@fastmail.fm, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
 <peterz@infradead.org>, pgriffais@valvesoftware.com, Peter Oskolkov
 <posk@posk.io>, Steven Rostedt <rostedt@goodmis.org>, shuah@kernel.org,
 Thomas Gleixner <tglx@linutronix.de>, z.figura12@gmail.com
References: <20210603195924.361327-1-andrealmeid@collabora.com>
 <1622799088.hsuspipe84.astroid@bobo.none>
 <fb85fb20-5421-b095-e68b-955afa105467@collabora.com>
In-Reply-To: <fb85fb20-5421-b095-e68b-955afa105467@collabora.com>
MIME-Version: 1.0
Message-Id: <1622853816.mokf23xgnt.astroid@bobo.none>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sat, 05 Jun 2021 01:10:02 -0000

Excerpts from Andr=C3=A9 Almeida's message of June 5, 2021 6:01 am:
> =C3=80s 08:36 de 04/06/21, Nicholas Piggin escreveu:
>> Excerpts from Andr=C3=A9 Almeida's message of June 4, 2021 5:59 am:
>>> Hi,
>>>
>>> This patch series introduces the futex2 syscalls.
>>>
>>> * What happened to the current futex()?
>>>
>>> For some years now, developers have been trying to add new features to
>>> futex, but maintainers have been reluctant to accept then, given the
>>> multiplexed interface full of legacy features and tricky to do big
>>> changes. Some problems that people tried to address with patchsets are:
>>> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2=
].
>>> NUMA, for instance, just doesn't fit the current API in a reasonable
>>> way. Considering that, it's not possible to merge new features into the
>>> current futex.
>>>
>>>  ** The NUMA problem
>>>
>>>  At the current implementation, all futex kernel side infrastructure is
>>>  stored on a single node. Given that, all futex() calls issued by
>>>  processors that aren't located on that node will have a memory access
>>>  penalty when doing it.
>>>
>>>  ** The 32bit sized futex problem
>>>
>>>  Futexes are used to implement atomic operations in userspace.
>>>  Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to
>>>  implement all those sizes in a performant way. Thanks Boost devs for
>>>  feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php
>>>
>>>  Embedded systems or anything with memory constrains could benefit of
>>>  using smaller sizes for the futex userspace integer.
>>>
>>>  ** The wait on multiple problem
>>>
>>>  The use case lies in the Wine implementation of the Windows NT interfa=
ce
>>>  WaitMultipleObjects. This Windows API function allows a thread to slee=
p
>>>  waiting on the first of a set of event sources (mutexes, timers, signa=
l,
>>>  console input, etc) to signal.  Considering this is a primitive
>>>  synchronization operation for Windows applications, being able to quic=
kly
>>>  signal events on the producer side, and quickly go to sleep on the
>>>  consumer side is essential for good performance of those running over =
Wine.
>>>
>>> [0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de=
/
>>> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@we=
b.de/
>>> [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@co=
llabora.com/
>>>
>>> * The solution
>>>
>>> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
>>> is required to solve this, which must be designed with those features i=
n
>>> mind. futex2() is that interface. As opposed to the current multiplexed
>>> interface, the new one should have one syscall per operation. This will
>>> allow the maintainability of the API if it gets extended, and will help
>>> users with type checking of arguments.
>>>
>>> In particular, the new interface is extended to support the ability to
>>> wait on any of a list of futexes at a time, which could be seen as a
>>> vectored extension of the FUTEX_WAIT semantics.
>>>
>>> [3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programmin=
g.kicks-ass.net/
>>>
>>> * The interface
>>>
>>> The new interface can be seen in details in the following patches, but
>>> this is a high level summary of what the interface can do:
>>>
>>>  - Supports wake/wait semantics, as in futex()
>>>  - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but wit=
h
>>>    individual flags for each address
>>>  - Supports waiting for a vector of futexes, using a new syscall named
>>>    futex_waitv()
>>>  - Supports variable sized futexes (8bits, 16bits, 32bits and 64bits)
>>>  - Supports NUMA-awareness operations, where the user can specify on
>>>    which memory node would like to operate
>>=20
>> A few comments.
>>=20
>> - Do you need a syscall for wait and for waitv, or can wait just be a
>> degenerate case of waitv (which could use the stack variable)?  I guess
>> it does save the user access unlock.
>>=20
>=20
> Yes. waitv with one element has a overhead compared to wait, given the
> extra user_copy(). Given that waiting on a single futex is the common
> case, I think it's worth to have it.

Okay.

>> - Did you consider a wakev interface? An example is a read-write mutex=20
>> which has read-blocking futexes split (e.g., per-node) for scalability=20
>> then the writer may unlock and wake all readers with one call. We=20
>> actually have some scalability challenges of this nature with certain=20
>> large database programs.
>>=20
>=20
> Not really, I haven't heard any use case for that until now. It should
> be easy to implement it, though, and I think you have an interesting use
> case here. Could you point me some of those database programs?

Not source code unfortunately. I know that's not a very good answer, but=20
they are far ahead of what most open source apps are doing scalability=20
wise today, and they end up rolling their own complex locking. Hopefully
the example I give is simple enough to understand.

>=20
>> - Great to see 64-bit support in, it is helpful to do some interesting=20
>> things with locks without hacks (e.g., putting an address in the lock=20
>> word).
>>=20
>> - Are we really keen on squashing node ID into flags in this day and age=
?
>> I guess okay but seems like it would be nice to allow a bit more space
>> in general for the operations. I don't want to turn it into a whole big
>> multiplexing nightmare again with lots of such flags, or propose
>> complexity with no code behind it, but I think we need a bit of leeway
>> for unforeseen locking innovations to be added carefully. The pthread
>> locking today is still fairly primitive really, I don't think we know
>> what will work best for the next 10 years.
>=20
> In the interface that I'd proposed, the node ID isn't part of the flags.
> You have a flag FUTEX_FLAG_NUMA, and when that is used, you pass in
> `void *uaddr` a pointer to a `struct futex_numa { int value, int hint
> }`, where hint should be the node ID you would like to work on, and
> value is just the userspace futex. This is documented in more details in
> patch 7 "docs: locking: futex2: Add documentation".
>=20
> If you have any feedback about how this NUMA interface looks like, I
> would like to hear.
>=20
> Also, did something in my writing indicated that the node ID would be
> part of the flags? I'll improve this it if so.

Oh I did miss this, thank you. No it wasn't your writing, I think it was=20
me trying to read through a lot of messages and got confused with some
earlier conversations.

I'll look a bit more at the NUMA interface.

>=20
>>=20
>> One scalability issue we are starting to hit and will only get worse is=20
>> futex queue spinlock contention. Perhaps this is better addressed in=20
>> userspace but the kernel could play a part so I like to leave some doors
>> open. One example is that the wait (or wake) side may like to depend not
>> just on the memory value, but on the success of a cmpxchg to avoid=20
>> unqueueing and queueing spuriously, which increases lock contention but
>> also ends up putting the poor task on the back of the list -- yes RT
>> priorities can help the realtime case, but non-RT cases can get bad
>> outlier latencies if lock stealing is allowed (which can be very good
>> for performance).
>>=20
>=20
> Sorry, I'm not sure what do you mean here. Are you proposing to have a
> cmpxchg in kernel side, so the lock would be taken by the kernel, and
> not by the userspace like it's now?

Yes. Only in slow paths, of course, to reduce starvation / erratic
latencies and spurious wait queue manipulations.

Actually one other scalability thing while I remember it:

futex_wait currently requires that the lock word is tested under the=20
queue spin lock (to avoid consuming a wakeup). The problem with this is=20
that the lock word can be a very hot cache line if you have a lot of
concurrency, so accessing it under the queue lock can increase queue
lock hold time.

I would prefer if the new API was relaxed to avoid this restriction
(e.g., any wait call may consume a wakeup so it's up to userspace to
avoid that if it is a problem).

>> - The private global futex hash table sucks for various reasons, and
>> having 128 waiters per thread makes it orders of magnitude easier for
>> userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the
>> per process hashing that Thomas suggested does fix the DoS but the
>> non-deterministic hash collisions still seem to be a problem for real
>> time response, and at the other end of the scale some apps (certain=20
>> databases, etc) can have ten thousand futex waiters at once so birthday
>> paradox can also lead to guaranteed (low level) variable beahviour=20
>> within a single process.
>>=20
>> I know the kernel in general is not very robust against this kind of=20
>> DoS/nondeterminism, but it's a bit sad to introduce new APIs with the=20
>> problem still there. Yes we could address it later, but I think it's=20
>> better done first because the solution might influence what the best=20
>> syscall API is.
>>=20
>> For example the idea of using the address as the handle for the wait=20
>> queue _and_ the value is very convenient but hashing is annoying for
>> all the above reasons and the shared wait queue case is pretty clunky.=20
>> It's also constraining in some corner cases to have the wait queue=20
>> associated with the address 1:1. For example a type of NUMA mutex might=20
>> want to have per-node waitqueues associated with a lock word, and wake
>> local node waiters 99% of the time. Not trivial to do with futexes and
>> seems to at least require bouncing of more cache lines, possibly more
>> atomics, etc.
>>=20
>> Could anything else be done?
>=20
> I wasn't aware that userspace doing DoS is something to be concerned
> from the kernel point of view. Is this scenario considering a malicious
> actor? If so, there are plenty of resources to be denied, so not sure
> how futex could be protected of this. Or is this just a program that
> uses tons of futexes?

Both really. AFAIKS one of the efforts that prompted the futex=20
modernisation work was the RT latency issues from Thomas in 2016 when=20
the per process table was proposed.


>> I'll be burned at the stake for suggesting it but it would be great if=20
>> we could use file descriptors. At least for the shared futex, maybe=20
>> private could use a per-process futex allocator. It solves all of the
>> above, although I'm sure has many of its own problem. It may not play
>> so nicely with the pthread mutex API because of the whole static
>> initialiser problem, but the first futex proposal did use fds. But it's
>> an example of an alternate API.
>>=20
>=20
> FDs and futex doesn't play well, because for futex_wait() you need to
> tell the kernel the expected value in the futex address to avoid
> sleeping in a free lock. FD operations (poll, select) don't have this
> `value` argument, so they could sleep forever, but I'm not sure if you
> had taken this in consideration.

I had. The futex wait API would take a fd additional. The only=20
difference is the waitqueue that is used when a sleep or wake is=20
required is derived from the fd, not from an address.

I think the bigger sticking points would be if it's too heavyweight an=20
object to use (which could be somewhat mitigated with a simpler ida
allocator although that's difficult to do with shared), and whether libc
could sanely use them due to the static initialiser problem of pthread
mutexes.

Thanks,
Nick