From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 169053857406 for ; Sun, 6 Jun 2021 19:24:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 169053857406 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=stgolabs.net Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=stgolabs.net Received: from imap.suse.de (imap-alt.suse-dmz.suse.de [192.168.254.47]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id E5E3421A4B; Sun, 6 Jun 2021 19:24:06 +0000 (UTC) Received: from imap3-int (imap-alt.suse-dmz.suse.de [192.168.254.47]) by imap.suse.de (Postfix) with ESMTP id 01BAF118DD; Sun, 6 Jun 2021 19:24:01 +0000 (UTC) Received: from director2.suse.de ([192.168.254.72]) by imap3-int with ESMTPSA id ukBFLdEgvWCcXwAALh3uQQ (envelope-from ); Sun, 06 Jun 2021 19:24:01 +0000 Date: Sun, 6 Jun 2021 12:23:56 -0700 From: Davidlohr Bueso Cc: Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Darren Hart , linux-kernel@vger.kernel.org, Steven Rostedt , Sebastian Andrzej Siewior , kernel@collabora.com, krisman@collabora.com, pgriffais@valvesoftware.com, z.figura12@gmail.com, joel@joelfernandes.org, malteskarupke@fastmail.fm, linux-api@vger.kernel.org, fweimer@redhat.com, libc-alpha@sourceware.org, linux-kselftest@vger.kernel.org, shuah@kernel.org, acme@kernel.org, corbet@lwn.net, Peter Oskolkov , Andrey Semashev , mtk.manpages@gmail.com Subject: Re: [PATCH v4 07/15] docs: locking: futex2: Add documentation Message-ID: <20210606192356.4sjhhowa45bo6g4j@offworld> 65;5803;1cTo: =?utf-8?B?QW5kcsOvwr8=?= =?utf-8?B?wr0=?= Almeida References: <20210603195924.361327-1-andrealmeid@collabora.com> <20210603195924.361327-8-andrealmeid@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <20210603195924.361327-8-andrealmeid@collabora.com> User-Agent: NeoMutt/20201120 X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, MISSING_HEADERS, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Jun 2021 19:24:10 -0000 On Thu, 03 Jun 2021, Andr=EF=BF=BD Almeida wrote: >Add a new documentation file specifying both userspace API and internal >implementation details of futex2 syscalls. I think equally important would be to provide a manpage for each new syscall you are introducing, and keep mkt in the loop as in the past he extensively documented and improved futex manpages, and overall has a lot of experience with dealing with kernel interfaces. Thanks, Davidlohr > >Signed-off-by: Andr=E9 Almeida >--- > Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++ > Documentation/locking/index.rst | 1 + > 2 files changed, 199 insertions(+) > create mode 100644 Documentation/locking/futex2.rst > >diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/fute= x2.rst >new file mode 100644 >index 000000000000..2f74d7c97a55 >--- /dev/null >+++ b/Documentation/locking/futex2.rst >@@ -0,0 +1,198 @@ >+.. SPDX-License-Identifier: GPL-2.0 >+ >+=3D=3D=3D=3D=3D=3D >+futex2 >+=3D=3D=3D=3D=3D=3D >+ >+:Author: Andr=E9 Almeida >+ >+futex, or fast user mutex, is a set of syscalls to allow userspace to cre= ate >+performant synchronization mechanisms, such as mutexes, semaphores and >+conditional variables in userspace. C standard libraries, like glibc, use= s it >+as a means to implement more high level interfaces like pthreads. >+ >+The interface >+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >+ >+uAPI functions >+-------------- >+ >+.. kernel-doc:: kernel/futex2.c >+ :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_= requeue >+ >+uAPI structures >+--------------- >+ >+.. kernel-doc:: include/uapi/linux/futex.h >+ >+The ``flag`` argument >+--------------------- >+ >+The flag is used to specify the size of the futex word >+(FUTEX_[8, 16, 32, 64]). It's mandatory to define one, since there's no >+default size. >+ >+By default, the timeout uses a monotonic clock, but can be used as a real= time >+one by using the FUTEX_REALTIME_CLOCK flag. >+ >+By default, futexes are of the private type, that means that this user ad= dress >+will be accessed by threads that share the same memory region. This allow= s for >+some internal optimizations, so they are faster. However, if the address = needs >+to be shared with different processes (like using ``mmap()`` or ``shm()``= ), they >+need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to se= t that. >+ >+By default, the operation has no NUMA-awareness, meaning that the user ca= n't >+choose the memory node where the kernel side futex data will be stored. T= he >+user can choose the node where it wants to operate by setting the >+FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, = 32 or >+64):: >+ >+ struct futexX_numa { >+ __uX value; >+ __sX hint; >+ }; >+ >+This structure should be passed at the ``void *uaddr`` of futex functions= =2E The >+address of the structure will be used to be waited on/waken on, and the >+``value`` will be compared to ``val`` as usual. The ``hint`` member is us= ed to >+define which node the futex will use. When waiting, the futex will be >+registered on a kernel-side table stored on that node; when waking, the f= utex >+will be searched for on that given table. That means that there's no redu= ndancy >+between tables, and the wrong ``hint`` value will lead to undesired behav= ior. >+Userspace is responsible for dealing with node migrations issues that may >+occur. ``hint`` can range from [0, MAX_NUMA_NODES), for specifying a node= , or >+-1, to use the same node the current process is using. >+ >+When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored= on a >+global table on allocated on the first node. >+ >+The ``timo`` argument >+--------------------- >+ >+As per the Y2038 work done in the kernel, new interfaces shouldn't add ti= meout >+options known to be buggy. Given that, ``timo`` should be a 64-bit timeou= t at >+all platforms, using an absolute timeout value. >+ >+Implementation >+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >+ >+The internal implementation follows a similar design to the original fute= x. >+Given that we want to replicate the same external behavior of current fut= ex, >+this should be somewhat expected. >+ >+Waiting >+------- >+ >+For the wait operations, they are all treated as if you want to wait on N >+futexes, so the path for futex_wait and futex_waitv is the basically the = same. >+For both syscalls, the first step is to prepare an internal list for the = list >+of futexes to wait for (using struct futexv_head). For futex_wait() calls= , this >+list will have a single object. >+ >+We have a hash table, where waiters register themselves before sleeping. = Then >+the wake function checks this table looking for waiters at uaddr. The ha= sh >+bucket to be used is determined by a struct futex_key, that stores inform= ation >+to uniquely identify an address from a given process. Given the huge addr= ess >+space, there'll be hash collisions, so we store information to be later u= sed on >+collision treatment. >+ >+First, for every futex we want to wait on, we check if (``*uaddr =3D=3D v= al``). >+This check is done holding the bucket lock, so we are correctly serialize= d with >+any futex_wake() calls. If any waiter fails the check above, we dequeue a= ll >+futexes. The check (``*uaddr =3D=3D val``) can fail for two reasons: >+ >+- The values are different, and we return -EAGAIN. However, if while >+ dequeueing we found that some futexes were awakened, we prioritize this >+ and return success. >+ >+- When trying to access the user address, we do so with page faults >+ disabled because we are holding a bucket's spin lock (and can't sleep >+ while holding a spin lock). If there's an error, it might be a page >+ fault, or an invalid address. We release the lock, dequeue everyone >+ (because it's illegal to sleep while there are futexes enqueued, we >+ could lose wakeups) and try again with page fault enabled. If we >+ succeed, this means that the address is valid, but we need to do >+ all the work again. For serialization reasons, we need to have the >+ spin lock when getting the user value. Additionally, for shared >+ futexes, we also need to recalculate the hash, since the underlying >+ mapping mechanisms could have changed when dealing with page fault. >+ If, even with page fault enabled, we can't access the address, it >+ means it's an invalid user address, and we return -EFAULT. For this >+ case, we prioritize the error, even if some futexes were awaken. >+ >+If the check is OK, they are enqueued on a linked list in our bucket, and >+proceed to the next one. If all waiters succeed, we put the thread to sle= ep >+until a futex_wake() call, timeout expires or we get a signal. After waki= ng up, >+we dequeue everyone, and check if some futex was awakened. This dequeue i= s done >+by iteratively walking at each element of struct futex_head list. >+ >+All enqueuing/dequeuing operations requires to hold the bucket lock, to a= void >+racing while modifying the list. >+ >+Waking >+------ >+ >+We get the bucket that's storing the waiters at uaddr, and wake the requi= red >+number of waiters, checking for hash collision. >+ >+There's an optimization that makes futex_wake() not take the bucket lock = if >+there's no one to be woken on that bucket. It checks an atomic counter th= at each >+bucket has, if it says 0, then the syscall exits. In order for this to wo= rk, the >+waiter thread increases it before taking the lock, so the wake thread will >+correctly see that there's someone waiting and will continue the path to = take >+the bucket lock. To get the correct serialization, the waiter issues a me= mory >+barrier after increasing the bucket counter and the waker issues a memory >+barrier before checking it. >+ >+Requeuing >+--------- >+ >+The requeue path first checks for each struct futex_requeue and their fla= gs. >+Then, it will compare the expected value with the one at uaddr1::uaddr. >+Following the same serialization explained at Waking_, we increase the at= omic >+counter for the bucket of uaddr2 before taking the lock. We need to have = both >+buckets locks at same time so we don't race with other futex operation. To >+ensure the locks are taken in the same order for all threads (and thus av= oiding >+deadlocks), every requeue operation takes the "smaller" bucket first, when >+comparing both addresses. >+ >+If the compare with user value succeeds, we proceed by waking ``nr_wake`` >+futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the u= addr2. >+This consists in a simple list deletion/addition and replacing the old fu= tex key >+with the new one. >+ >+Futex keys >+---------- >+ >+There are two types of futexes: private and shared ones. The private are = futexes >+meant to be used by threads that share the same memory space, are easier = to be >+uniquely identified and thus can have some performance optimization. The >+elements for identifying one are: the start address of the page where the >+address is, the address offset within the page and the current->mm pointe= r. >+ >+Now, for uniquely identifying a shared futex: >+ >+- If the page containing the user address is an anonymous page, we can >+ just use the same data used for private futexes (the start address of >+ the page, the address offset within the page and the current->mm >+ pointer); that will be enough for uniquely identifying such futex. We >+ also set one bit at the key to differentiate if a private futex is >+ used on the same address (mixing shared and private calls does not >+ work). >+ >+- If the page is file-backed, current->mm maybe isn't the same one for >+ every user of this futex, so we need to use other data: the >+ page->index, a UUID for the struct inode and the offset within the >+ page. >+ >+Note that members of futex_key don't have any particular meaning after th= ey >+are part of the struct - they are just bytes to identify a futex. Given = that, >+we don't need to use a particular name or type that matches the original = data, >+we only need to care about the bitsize of each component and make both pr= ivate >+and shared fit in the same memory space. >+ >+Source code documentation >+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >+ >+.. kernel-doc:: kernel/futex2.c >+ :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_fut= ex_requeue >diff --git a/Documentation/locking/index.rst b/Documentation/locking/index= =2Erst >index 7003bd5aeff4..9bf03c7fa1ec 100644 >--- a/Documentation/locking/index.rst >+++ b/Documentation/locking/index.rst >@@ -24,6 +24,7 @@ locking > percpu-rw-semaphore > robust-futexes > robust-futex-ABI >+ futex2 > > .. only:: subproject and html > >-- >2.31.1 >