public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "André Almeida" <andrealmeid@collabora.com>
To: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Darren Hart <dvhart@infradead.org>,
	linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	kernel@collabora.com, krisman@collabora.com,
	pgriffais@valvesoftware.com, z.figura12@gmail.com,
	joel@joelfernandes.org, malteskarupke@fastmail.fm,
	linux-api@vger.kernel.org, fweimer@redhat.com,
	libc-alpha@sourceware.org, linux-kselftest@vger.kernel.org,
	shuah@kernel.org, acme@kernel.org, corbet@lwn.net,
	Peter Oskolkov <posk@posk.io>,
	Andrey Semashev <andrey.semashev@gmail.com>,
	mtk.manpages@gmail.com
Subject: Re:
Date: Mon, 7 Jun 2021 13:02:10 -0300	[thread overview]
Message-ID: <ed1ab464-7362-e1ee-fe3d-d0b1b910b81a@collabora.com> (raw)
In-Reply-To: <20210606191932.osytlsvsnks2cxwa@offworld>

Às 16:19 de 06/06/21, Davidlohr Bueso escreveu:
> Bcc:
> Subject: Re: [PATCH v4 07/15] docs: locking: futex2: Add documentation
> Reply-To:
> In-Reply-To: <20210603195924.361327-8-andrealmeid@collabora.com>
> 
> On Thu, 03 Jun 2021, Andr� Almeida wrote:
> 
>> Add a new documentation file specifying both userspace API and internal
>> implementation details of futex2 syscalls.
> 
> I think equally important would be to provide a manpage for each new
> syscall you are introducing, and keep mkt in the loop as in the past he
> extensively documented and improved futex manpages, and overall has a
> lot of experience with dealing with kernel interfaces.

Right, I'll add the man pages in a future version and make sure to have
mkt in the loop, thanks for the tip.

> 
> Thanks,
> Davidlohr
> 
>>
>> Signed-off-by: André Almeida <andrealmeid@collabora.com>
>> ---
>> Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
>> Documentation/locking/index.rst  |   1 +
>> 2 files changed, 199 insertions(+)
>> create mode 100644 Documentation/locking/futex2.rst
>>
>> diff --git a/Documentation/locking/futex2.rst
>> b/Documentation/locking/futex2.rst
>> new file mode 100644
>> index 000000000000..2f74d7c97a55
>> --- /dev/null
>> +++ b/Documentation/locking/futex2.rst
>> @@ -0,0 +1,198 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +======
>> +futex2
>> +======
>> +
>> +:Author: André Almeida <andrealmeid@collabora.com>
>> +
>> +futex, or fast user mutex, is a set of syscalls to allow userspace to
>> create
>> +performant synchronization mechanisms, such as mutexes, semaphores and
>> +conditional variables in userspace. C standard libraries, like glibc,
>> uses it
>> +as a means to implement more high level interfaces like pthreads.
>> +
>> +The interface
>> +=============
>> +
>> +uAPI functions
>> +--------------
>> +
>> +.. kernel-doc:: kernel/futex2.c
>> +   :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv
>> sys_futex_requeue
>> +
>> +uAPI structures
>> +---------------
>> +
>> +.. kernel-doc:: include/uapi/linux/futex.h
>> +
>> +The ``flag`` argument
>> +---------------------
>> +
>> +The flag is used to specify the size of the futex word
>> +(FUTEX_[8, 16, 32, 64]). It's mandatory to define one, since there's no
>> +default size.
>> +
>> +By default, the timeout uses a monotonic clock, but can be used as a
>> realtime
>> +one by using the FUTEX_REALTIME_CLOCK flag.
>> +
>> +By default, futexes are of the private type, that means that this
>> user address
>> +will be accessed by threads that share the same memory region. This
>> allows for
>> +some internal optimizations, so they are faster. However, if the
>> address needs
>> +to be shared with different processes (like using ``mmap()`` or
>> ``shm()``), they
>> +need to be defined as shared and the flag FUTEX_SHARED_FLAG is used
>> to set that.
>> +
>> +By default, the operation has no NUMA-awareness, meaning that the
>> user can't
>> +choose the memory node where the kernel side futex data will be
>> stored. The
>> +user can choose the node where it wants to operate by setting the
>> +FUTEX_NUMA_FLAG and using the following structure (where X can be 8,
>> 16, 32 or
>> +64)::
>> +
>> + struct futexX_numa {
>> +         __uX value;
>> +         __sX hint;
>> + };
>> +
>> +This structure should be passed at the ``void *uaddr`` of futex
>> functions. The
>> +address of the structure will be used to be waited on/waken on, and the
>> +``value`` will be compared to ``val`` as usual. The ``hint`` member
>> is used to
>> +define which node the futex will use. When waiting, the futex will be
>> +registered on a kernel-side table stored on that node; when waking,
>> the futex
>> +will be searched for on that given table. That means that there's no
>> redundancy
>> +between tables, and the wrong ``hint`` value will lead to undesired
>> behavior.
>> +Userspace is responsible for dealing with node migrations issues that
>> may
>> +occur. ``hint`` can range from [0, MAX_NUMA_NODES), for specifying a
>> node, or
>> +-1, to use the same node the current process is using.
>> +
>> +When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be
>> stored on a
>> +global table on allocated on the first node.
>> +
>> +The ``timo`` argument
>> +---------------------
>> +
>> +As per the Y2038 work done in the kernel, new interfaces shouldn't
>> add timeout
>> +options known to be buggy. Given that, ``timo`` should be a 64-bit
>> timeout at
>> +all platforms, using an absolute timeout value.
>> +
>> +Implementation
>> +==============
>> +
>> +The internal implementation follows a similar design to the original
>> futex.
>> +Given that we want to replicate the same external behavior of current
>> futex,
>> +this should be somewhat expected.
>> +
>> +Waiting
>> +-------
>> +
>> +For the wait operations, they are all treated as if you want to wait
>> on N
>> +futexes, so the path for futex_wait and futex_waitv is the basically
>> the same.
>> +For both syscalls, the first step is to prepare an internal list for
>> the list
>> +of futexes to wait for (using struct futexv_head). For futex_wait()
>> calls, this
>> +list will have a single object.
>> +
>> +We have a hash table, where waiters register themselves before
>> sleeping. Then
>> +the wake function checks this table looking for waiters at uaddr. 
>> The hash
>> +bucket to be used is determined by a struct futex_key, that stores
>> information
>> +to uniquely identify an address from a given process. Given the huge
>> address
>> +space, there'll be hash collisions, so we store information to be
>> later used on
>> +collision treatment.
>> +
>> +First, for every futex we want to wait on, we check if (``*uaddr ==
>> val``).
>> +This check is done holding the bucket lock, so we are correctly
>> serialized with
>> +any futex_wake() calls. If any waiter fails the check above, we
>> dequeue all
>> +futexes. The check (``*uaddr == val``) can fail for two reasons:
>> +
>> +- The values are different, and we return -EAGAIN. However, if while
>> +  dequeueing we found that some futexes were awakened, we prioritize
>> this
>> +  and return success.
>> +
>> +- When trying to access the user address, we do so with page faults
>> +  disabled because we are holding a bucket's spin lock (and can't sleep
>> +  while holding a spin lock). If there's an error, it might be a page
>> +  fault, or an invalid address. We release the lock, dequeue everyone
>> +  (because it's illegal to sleep while there are futexes enqueued, we
>> +  could lose wakeups) and try again with page fault enabled. If we
>> +  succeed, this means that the address is valid, but we need to do
>> +  all the work again. For serialization reasons, we need to have the
>> +  spin lock when getting the user value. Additionally, for shared
>> +  futexes, we also need to recalculate the hash, since the underlying
>> +  mapping mechanisms could have changed when dealing with page fault.
>> +  If, even with page fault enabled, we can't access the address, it
>> +  means it's an invalid user address, and we return -EFAULT. For this
>> +  case, we prioritize the error, even if some futexes were awaken.
>> +
>> +If the check is OK, they are enqueued on a linked list in our bucket,
>> and
>> +proceed to the next one. If all waiters succeed, we put the thread to
>> sleep
>> +until a futex_wake() call, timeout expires or we get a signal. After
>> waking up,
>> +we dequeue everyone, and check if some futex was awakened. This
>> dequeue is done
>> +by iteratively walking at each element of struct futex_head list.
>> +
>> +All enqueuing/dequeuing operations requires to hold the bucket lock,
>> to avoid
>> +racing while modifying the list.
>> +
>> +Waking
>> +------
>> +
>> +We get the bucket that's storing the waiters at uaddr, and wake the
>> required
>> +number of waiters, checking for hash collision.
>> +
>> +There's an optimization that makes futex_wake() not take the bucket
>> lock if
>> +there's no one to be woken on that bucket. It checks an atomic
>> counter that each
>> +bucket has, if it says 0, then the syscall exits. In order for this
>> to work, the
>> +waiter thread increases it before taking the lock, so the wake thread
>> will
>> +correctly see that there's someone waiting and will continue the path
>> to take
>> +the bucket lock. To get the correct serialization, the waiter issues
>> a memory
>> +barrier after increasing the bucket counter and the waker issues a
>> memory
>> +barrier before checking it.
>> +
>> +Requeuing
>> +---------
>> +
>> +The requeue path first checks for each struct futex_requeue and their
>> flags.
>> +Then, it will compare the expected value with the one at uaddr1::uaddr.
>> +Following the same serialization explained at Waking_, we increase
>> the atomic
>> +counter for the bucket of uaddr2 before taking the lock. We need to
>> have both
>> +buckets locks at same time so we don't race with other futex
>> operation. To
>> +ensure the locks are taken in the same order for all threads (and
>> thus avoiding
>> +deadlocks), every requeue operation takes the "smaller" bucket first,
>> when
>> +comparing both addresses.
>> +
>> +If the compare with user value succeeds, we proceed by waking
>> ``nr_wake``
>> +futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to
>> the uaddr2.
>> +This consists in a simple list deletion/addition and replacing the
>> old futex key
>> +with the new one.
>> +
>> +Futex keys
>> +----------
>> +
>> +There are two types of futexes: private and shared ones. The private
>> are futexes
>> +meant to be used by threads that share the same memory space, are
>> easier to be
>> +uniquely identified and thus can have some performance optimization. The
>> +elements for identifying one are: the start address of the page where
>> the
>> +address is, the address offset within the page and the current->mm
>> pointer.
>> +
>> +Now, for uniquely identifying a shared futex:
>> +
>> +- If the page containing the user address is an anonymous page, we can
>> +  just use the same data used for private futexes (the start address of
>> +  the page, the address offset within the page and the current->mm
>> +  pointer); that will be enough for uniquely identifying such futex. We
>> +  also set one bit at the key to differentiate if a private futex is
>> +  used on the same address (mixing shared and private calls does not
>> +  work).
>> +
>> +- If the page is file-backed, current->mm maybe isn't the same one for
>> +  every user of this futex, so we need to use other data: the
>> +  page->index, a UUID for the struct inode and the offset within the
>> +  page.
>> +
>> +Note that members of futex_key don't have any particular meaning
>> after they
>> +are part of the struct - they are just bytes to identify a futex. 
>> Given that,
>> +we don't need to use a particular name or type that matches the
>> original data,
>> +we only need to care about the bitsize of each component and make
>> both private
>> +and shared fit in the same memory space.
>> +
>> +Source code documentation
>> +=========================
>> +
>> +.. kernel-doc:: kernel/futex2.c
>> +   :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv
>> sys_futex_requeue
>> diff --git a/Documentation/locking/index.rst
>> b/Documentation/locking/index.rst
>> index 7003bd5aeff4..9bf03c7fa1ec 100644
>> --- a/Documentation/locking/index.rst
>> +++ b/Documentation/locking/index.rst
>> @@ -24,6 +24,7 @@ locking
>>     percpu-rw-semaphore
>>     robust-futexes
>>     robust-futex-ABI
>> +    futex2
>>
>> .. only::  subproject and html
>>
>> -- 
>> 2.31.1
>>

  reply	other threads:[~2021-06-07 16:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-06 19:19 Davidlohr Bueso
2021-06-07 16:02 ` André Almeida [this message]
  -- strict thread matches above, loose matches on Subject: below --
2023-05-27 18:46 [PATCH v10 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-07-10  5:23 ` Sajan Karumanchi
2023-07-10 15:58   ` Noah Goldstein
2023-07-14  2:21     ` Re: Noah Goldstein
2023-07-14  7:39     ` Re: sajan karumanchi
2023-01-13  5:41 Re: father.dominic
     [not found] <1481545990-7247-1-git-send-email-adhemerval.zanella@linaro.org>
2016-12-19 18:18 ` Re: Adhemerval Zanella
2005-07-15 21:51 Re: ИнфоПространство

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ed1ab464-7362-e1ee-fe3d-d0b1b910b81a@collabora.com \
    --to=andrealmeid@collabora.com \
    --cc=acme@kernel.org \
    --cc=andrey.semashev@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=corbet@lwn.net \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=joel@joelfernandes.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=malteskarupke@fastmail.fm \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=peterz@infradead.org \
    --cc=pgriffais@valvesoftware.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=z.figura12@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).