From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by sourceware.org (Postfix) with ESMTPS id 048393858D34 for ; Thu, 2 May 2024 08:45:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 048393858D34 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=kernel.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 048393858D34 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2604:1380:40e1:4800::1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1714639558; cv=none; b=By/2mRDTKapa1wLArS8ZGoecKdR+VtQltbaLTp89/QajLqb+zZK0QybFJlTSGHbDsooqBqagf4D136chlebiAfiAJUr7hlqukIKst7lk+KX2SPUgFWFvG5pOU+eowNh8u1zJA0476S2zAKuAdQAc9qpuz9jCU600Vw2FlKDPTfs= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1714639558; c=relaxed/simple; bh=oqBfWlYmUrwyyjapwJqNsqrTwEaOMkpHH7sOf45lMOA=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=SwSWsnIzzAMMB9iWOsr3Rz5P9+jx0F9+WIEohE8oD5/dTiOdyXCSliNDh2fHSpBC9d6MEIlHfB0S5WKxctFnElNwr+hHB2ncRdtXUVsq2AKsvdXCQtXuCkWKC6RFmptyxDcOWWmyWpURBTzdcsukrAtAqM4mA7+KDYzgeJm0GkY= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 4A3A3CE1151; Thu, 2 May 2024 08:45:52 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B0F8CC113CC; Thu, 2 May 2024 08:45:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1714639550; bh=oqBfWlYmUrwyyjapwJqNsqrTwEaOMkpHH7sOf45lMOA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=GXE/LURx0zaeD0T28AK/+RCTb/3lsoHyOSECYs4mzTcXWSTXTfcNW13HTcjoFu2sb 0/zdv/9zKCz7YuBbh3exdsoynAkG+t5c8DAJSqEIMtWqVv256foGUmvZ0fnsQFvK5V /t/o0qEeNAtnsuPRyYpJeiEdDz4dVxdw57nHY7Da6MfJRfNkCazlacPeAZyf1HP02U xbgeLj8lAD59StL68W+Ue8EjnSiTZMijwZjihBSfXYFZJv4F0Lmq8wDmuKIyNzbh6F LNJ3SL1ept4+Kv9Xd5m8E/PRVB35v/RoBTsTvjvchqiMAJQiko27irRUFZowLLs7M1 1cH1+7cTCq7sQ== Date: Thu, 2 May 2024 10:45:41 +0200 From: Christian Brauner To: =?utf-8?B?QW5kcsOp?= Almeida Cc: Mathieu Desnoyers , Peter Zijlstra , Thomas Gleixner , linux-kernel@vger.kernel.org, "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Florian Weimer , David.Laight@aculab.com, carlos@redhat.com, Peter Oskolkov , Alexander Mikhalitsyn , Chris Kennelly , Ingo Molnar , Darren Hart , Davidlohr Bueso , libc-alpha@sourceware.org, Steven Rostedt , Jonathan Corbet , Noah Goldstein , Daniel Colascione , longman@redhat.com, kernel-dev@igalia.com Subject: Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation Message-ID: <20240502-gezeichnet-besonderen-d277879cd669@brauner> References: <20240425204332.221162-1-andrealmeid@igalia.com> <20240426-gaumen-zweibeinig-3490b06e86c2@brauner> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, May 01, 2024 at 08:44:36PM -0300, André Almeida wrote: > Hi Christian, > > Em 26/04/2024 07:26, Christian Brauner escreveu: > > On Thu, Apr 25, 2024 at 05:43:31PM -0300, André Almeida wrote: > > > Hi, > > > > > > In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the > > > rseq interface to be able to implement spin locks in userspace correctly. Thomas > > > Gleixner agreed that this is something that Linux could improve, but asked for > > > an alternative proposal first: a futex operation that allows to spin a user > > > lock inside the kernel. This patchset implements a prototype of this idea for > > > further discussion. > > > > > > With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to > > > be the PID of the lock owner. Then, the kernel gets the task_struct of the > > > corresponding PID, and checks if it's running. It spins until the futex > > > is awaken, the task is scheduled out or if a timeout happens. If the lock owner > > > is scheduled out at any time, then the syscall follows the normal path of > > > sleeping as usual. > > > > > > If the futex is awaken and we are spinning, we can return to userspace quickly, > > > avoid the scheduling out and in again to wake from a futex_wait(), thus > > > speeding up the wait operation. > > > > > > I didn't manage to find a good mechanism to prevent race conditions between > > > setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel > > > space, giving that there's enough room for the original PID owner exit and such > > > PID to be relocated to another unrelated task in the system. I didn't performed > > > > One option would be to also allow pidfds. Starting with v6.9 they can be > > used to reference individual threads. > > > > So for the really fast case where you have multiple threads and you > > somehow may really do care about the impact of the atomic_long_inc() on > > pidfd_file->f_count during fdget() (for the single-threaded case the > > increment is elided), callers can pass the TID. But in cases where the > > inc and put aren't a performance sensitive, you can use pidfds. > > > > Thank you very much for making the effort here, much appreciated :) > > While I agree that pidfds would fix the PID race conditions, I will move > this interface to support TIDs instead, as noted by Florian and Peter. With > TID the race conditions are diminished I reckon? Unless I'm missing something the question here is PID (as in TGID aka thread-group leader id gotten via getpid()) vs TID (thread specific id gotten via gettid()). You want the thread-specific id as you want to interact with the futex state of a specific thread not the thread-group leader. Aside from that TIDs are subject to the same race conditions that PIDs are. They are allocated from the same pool (see alloc_pid()).