From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dave@stgolabs.net>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id 169053857406
 for <libc-alpha@sourceware.org>; Sun,  6 Jun 2021 19:24:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 169053857406
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=stgolabs.net
Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=stgolabs.net
Received: from imap.suse.de (imap-alt.suse-dmz.suse.de [192.168.254.47])
 (using TLSv1.2 with cipher ECDHE-ECDSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id E5E3421A4B;
 Sun,  6 Jun 2021 19:24:06 +0000 (UTC)
Received: from imap3-int (imap-alt.suse-dmz.suse.de [192.168.254.47])
 by imap.suse.de (Postfix) with ESMTP id 01BAF118DD;
 Sun,  6 Jun 2021 19:24:01 +0000 (UTC)
Received: from director2.suse.de ([192.168.254.72]) by imap3-int with ESMTPSA
 id ukBFLdEgvWCcXwAALh3uQQ
 (envelope-from <dave@stgolabs.net>); Sun, 06 Jun 2021 19:24:01 +0000
Date: Sun, 6 Jun 2021 12:23:56 -0700
From: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Peter Zijlstra <peterz@infradead.org>,
 Darren Hart <dvhart@infradead.org>, linux-kernel@vger.kernel.org,
 Steven Rostedt <rostedt@goodmis.org>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
 kernel@collabora.com, krisman@collabora.com,
 pgriffais@valvesoftware.com, z.figura12@gmail.com,
 joel@joelfernandes.org, malteskarupke@fastmail.fm,
 linux-api@vger.kernel.org, fweimer@redhat.com,
 libc-alpha@sourceware.org, linux-kselftest@vger.kernel.org,
 shuah@kernel.org, acme@kernel.org, corbet@lwn.net,
 Peter Oskolkov <posk@posk.io>,
 Andrey Semashev <andrey.semashev@gmail.com>, mtk.manpages@gmail.com
Subject: Re: [PATCH v4 07/15] docs: locking: futex2: Add documentation
Message-ID: <20210606192356.4sjhhowa45bo6g4j@offworld>
65;5803;1cTo: =?utf-8?B?QW5kcsOvwr8=?= =?utf-8?B?wr0=?= Almeida
 <andrealmeid@collabora.com>
References: <20210603195924.361327-1-andrealmeid@collabora.com>
 <20210603195924.361327-8-andrealmeid@collabora.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <20210603195924.361327-8-andrealmeid@collabora.com>
User-Agent: NeoMutt/20201120
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, MISSING_HEADERS, SPF_HELO_NONE, SPF_SOFTFAIL,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Jun 2021 19:24:10 -0000

On Thu, 03 Jun 2021, Andr=EF=BF=BD Almeida wrote:

>Add a new documentation file specifying both userspace API and internal
>implementation details of futex2 syscalls.

I think equally important would be to provide a manpage for each new
syscall you are introducing, and keep mkt in the loop as in the past he
extensively documented and improved futex manpages, and overall has a
lot of experience with dealing with kernel interfaces.

Thanks,
Davidlohr

>
>Signed-off-by: Andr=E9 Almeida <andrealmeid@collabora.com>
>---
> Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
> Documentation/locking/index.rst  |   1 +
> 2 files changed, 199 insertions(+)
> create mode 100644 Documentation/locking/futex2.rst
>
>diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/fute=
x2.rst
>new file mode 100644
>index 000000000000..2f74d7c97a55
>--- /dev/null
>+++ b/Documentation/locking/futex2.rst
>@@ -0,0 +1,198 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+
>+=3D=3D=3D=3D=3D=3D
>+futex2
>+=3D=3D=3D=3D=3D=3D
>+
>+:Author: Andr=E9 Almeida <andrealmeid@collabora.com>
>+
>+futex, or fast user mutex, is a set of syscalls to allow userspace to cre=
ate
>+performant synchronization mechanisms, such as mutexes, semaphores and
>+conditional variables in userspace. C standard libraries, like glibc, use=
s it
>+as a means to implement more high level interfaces like pthreads.
>+
>+The interface
>+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>+
>+uAPI functions
>+--------------
>+
>+.. kernel-doc:: kernel/futex2.c
>+   :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_=
requeue
>+
>+uAPI structures
>+---------------
>+
>+.. kernel-doc:: include/uapi/linux/futex.h
>+
>+The ``flag`` argument
>+---------------------
>+
>+The flag is used to specify the size of the futex word
>+(FUTEX_[8, 16, 32, 64]). It's mandatory to define one, since there's no
>+default size.
>+
>+By default, the timeout uses a monotonic clock, but can be used as a real=
time
>+one by using the FUTEX_REALTIME_CLOCK flag.
>+
>+By default, futexes are of the private type, that means that this user ad=
dress
>+will be accessed by threads that share the same memory region. This allow=
s for
>+some internal optimizations, so they are faster. However, if the address =
needs
>+to be shared with different processes (like using ``mmap()`` or ``shm()``=
), they
>+need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to se=
t that.
>+
>+By default, the operation has no NUMA-awareness, meaning that the user ca=
n't
>+choose the memory node where the kernel side futex data will be stored. T=
he
>+user can choose the node where it wants to operate by setting the
>+FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, =
32 or
>+64)::
>+
>+ struct futexX_numa {
>+         __uX value;
>+         __sX hint;
>+ };
>+
>+This structure should be passed at the ``void *uaddr`` of futex functions=
=2E The
>+address of the structure will be used to be waited on/waken on, and the
>+``value`` will be compared to ``val`` as usual. The ``hint`` member is us=
ed to
>+define which node the futex will use. When waiting, the futex will be
>+registered on a kernel-side table stored on that node; when waking, the f=
utex
>+will be searched for on that given table. That means that there's no redu=
ndancy
>+between tables, and the wrong ``hint`` value will lead to undesired behav=
ior.
>+Userspace is responsible for dealing with node migrations issues that may
>+occur. ``hint`` can range from [0, MAX_NUMA_NODES), for specifying a node=
, or
>+-1, to use the same node the current process is using.
>+
>+When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored=
 on a
>+global table on allocated on the first node.
>+
>+The ``timo`` argument
>+---------------------
>+
>+As per the Y2038 work done in the kernel, new interfaces shouldn't add ti=
meout
>+options known to be buggy. Given that, ``timo`` should be a 64-bit timeou=
t at
>+all platforms, using an absolute timeout value.
>+
>+Implementation
>+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>+
>+The internal implementation follows a similar design to the original fute=
x.
>+Given that we want to replicate the same external behavior of current fut=
ex,
>+this should be somewhat expected.
>+
>+Waiting
>+-------
>+
>+For the wait operations, they are all treated as if you want to wait on N
>+futexes, so the path for futex_wait and futex_waitv is the basically the =
same.
>+For both syscalls, the first step is to prepare an internal list for the =
list
>+of futexes to wait for (using struct futexv_head). For futex_wait() calls=
, this
>+list will have a single object.
>+
>+We have a hash table, where waiters register themselves before sleeping. =
Then
>+the wake function checks this table looking for waiters at uaddr.  The ha=
sh
>+bucket to be used is determined by a struct futex_key, that stores inform=
ation
>+to uniquely identify an address from a given process. Given the huge addr=
ess
>+space, there'll be hash collisions, so we store information to be later u=
sed on
>+collision treatment.
>+
>+First, for every futex we want to wait on, we check if (``*uaddr =3D=3D v=
al``).
>+This check is done holding the bucket lock, so we are correctly serialize=
d with
>+any futex_wake() calls. If any waiter fails the check above, we dequeue a=
ll
>+futexes. The check (``*uaddr =3D=3D val``) can fail for two reasons:
>+
>+- The values are different, and we return -EAGAIN. However, if while
>+  dequeueing we found that some futexes were awakened, we prioritize this
>+  and return success.
>+
>+- When trying to access the user address, we do so with page faults
>+  disabled because we are holding a bucket's spin lock (and can't sleep
>+  while holding a spin lock). If there's an error, it might be a page
>+  fault, or an invalid address. We release the lock, dequeue everyone
>+  (because it's illegal to sleep while there are futexes enqueued, we
>+  could lose wakeups) and try again with page fault enabled. If we
>+  succeed, this means that the address is valid, but we need to do
>+  all the work again. For serialization reasons, we need to have the
>+  spin lock when getting the user value. Additionally, for shared
>+  futexes, we also need to recalculate the hash, since the underlying
>+  mapping mechanisms could have changed when dealing with page fault.
>+  If, even with page fault enabled, we can't access the address, it
>+  means it's an invalid user address, and we return -EFAULT. For this
>+  case, we prioritize the error, even if some futexes were awaken.
>+
>+If the check is OK, they are enqueued on a linked list in our bucket, and
>+proceed to the next one. If all waiters succeed, we put the thread to sle=
ep
>+until a futex_wake() call, timeout expires or we get a signal. After waki=
ng up,
>+we dequeue everyone, and check if some futex was awakened. This dequeue i=
s done
>+by iteratively walking at each element of struct futex_head list.
>+
>+All enqueuing/dequeuing operations requires to hold the bucket lock, to a=
void
>+racing while modifying the list.
>+
>+Waking
>+------
>+
>+We get the bucket that's storing the waiters at uaddr, and wake the requi=
red
>+number of waiters, checking for hash collision.
>+
>+There's an optimization that makes futex_wake() not take the bucket lock =
if
>+there's no one to be woken on that bucket. It checks an atomic counter th=
at each
>+bucket has, if it says 0, then the syscall exits. In order for this to wo=
rk, the
>+waiter thread increases it before taking the lock, so the wake thread will
>+correctly see that there's someone waiting and will continue the path to =
take
>+the bucket lock. To get the correct serialization, the waiter issues a me=
mory
>+barrier after increasing the bucket counter and the waker issues a memory
>+barrier before checking it.
>+
>+Requeuing
>+---------
>+
>+The requeue path first checks for each struct futex_requeue and their fla=
gs.
>+Then, it will compare the expected value with the one at uaddr1::uaddr.
>+Following the same serialization explained at Waking_, we increase the at=
omic
>+counter for the bucket of uaddr2 before taking the lock. We need to have =
both
>+buckets locks at same time so we don't race with other futex operation. To
>+ensure the locks are taken in the same order for all threads (and thus av=
oiding
>+deadlocks), every requeue operation takes the "smaller" bucket first, when
>+comparing both addresses.
>+
>+If the compare with user value succeeds, we proceed by waking ``nr_wake``
>+futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the u=
addr2.
>+This consists in a simple list deletion/addition and replacing the old fu=
tex key
>+with the new one.
>+
>+Futex keys
>+----------
>+
>+There are two types of futexes: private and shared ones. The private are =
futexes
>+meant to be used by threads that share the same memory space, are easier =
to be
>+uniquely identified and thus can have some performance optimization. The
>+elements for identifying one are: the start address of the page where the
>+address is, the address offset within the page and the current->mm pointe=
r.
>+
>+Now, for uniquely identifying a shared futex:
>+
>+- If the page containing the user address is an anonymous page, we can
>+  just use the same data used for private futexes (the start address of
>+  the page, the address offset within the page and the current->mm
>+  pointer); that will be enough for uniquely identifying such futex. We
>+  also set one bit at the key to differentiate if a private futex is
>+  used on the same address (mixing shared and private calls does not
>+  work).
>+
>+- If the page is file-backed, current->mm maybe isn't the same one for
>+  every user of this futex, so we need to use other data: the
>+  page->index, a UUID for the struct inode and the offset within the
>+  page.
>+
>+Note that members of futex_key don't have any particular meaning after th=
ey
>+are part of the struct - they are just bytes to identify a futex.  Given =
that,
>+we don't need to use a particular name or type that matches the original =
data,
>+we only need to care about the bitsize of each component and make both pr=
ivate
>+and shared fit in the same memory space.
>+
>+Source code documentation
>+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
>+
>+.. kernel-doc:: kernel/futex2.c
>+   :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_fut=
ex_requeue
>diff --git a/Documentation/locking/index.rst b/Documentation/locking/index=
=2Erst
>index 7003bd5aeff4..9bf03c7fa1ec 100644
>--- a/Documentation/locking/index.rst
>+++ b/Documentation/locking/index.rst
>@@ -24,6 +24,7 @@ locking
>     percpu-rw-semaphore
>     robust-futexes
>     robust-futex-ABI
>+    futex2
>
> .. only::  subproject and html
>
>--
>2.31.1
>