public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug nptl/12683] New: Race conditions in pthread cancellation
@ 2011-04-18 22:28 bugdal at aerifal dot cx
  2011-04-18 22:35 ` [Bug nptl/12683] " bugdal at aerifal dot cx
                   ` (33 more replies)
  0 siblings, 34 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2011-04-18 22:28 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

           Summary: Race conditions in pthread cancellation
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: critical
          Priority: P2
         Component: nptl
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: bugdal@aerifal.cx


Created attachment 5676
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5676
Demonstration of file descriptor leak due to problem 1

The current approach to implementing pthread cancellation points is to enable
asynchronous cancellation prior to making the syscall, and restore the previous
cancellation type once the syscall returns. I've asked around and heard
conflicting answers as to whether this violates the requirements in POSIX (I
believe it does), but either way, from a quality of implementation standpoint
this approach is very undesirable due to at least 2 problems, the latter of
which is very serious:

1. Cancellation can act after the syscall has returned from kernelspace, but
before userspace saves the return value. This results in a resource leak if the
syscall allocated a resource, and there is no way to patch over it with
cancellation handlers. Even if the syscall did not allocate a resource, it may
have had an effect (like consuming data from a socket/pipe/terminal buffer)
which the application will never see.

2. If a signal is handled while the thread is blocked at a cancellable syscall,
the entire signal handler runs with asynchronous cancellation enabled. This
could be extremely dangerous, since the signal handler may call functions which
are async-signal-safe but not async-cancel-safe. Even worse, the signal handler
may call functions which are not even async-signal-safe (like stdio) if it
knows the interrupted code could only be using async-signal-safe functions, and
having a thread asynchronously terminated while modifying such functions'
internal data structures could lead to serious program malfunction.

I am attaching simple programs which demonstrate both issues.

The solution to problem 2 is making the thread's current execution context
(e.g. stack pointer) at syscall time part of the cancellability state, so that
cancellation requests received while the cancellation point is interrupted by a
signal handler can identify that the thread is not presently in the cancellable
context.

The solution to problem 1 is making successful return from kernelspace and
exiting the cancellable state an atomic operation. While at first this seems
impossible without kernel support, I have a working implementation in musl
(http://www.etalabs.net/musl) which solves both problems.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
@ 2011-04-18 22:35 ` bugdal at aerifal dot cx
  2011-09-21 18:30 ` bugdal at aerifal dot cx
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2011-04-18 22:35 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #1 from Rich Felker <bugdal at aerifal dot cx> 2011-04-18 22:34:44 UTC ---
Created attachment 5677
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5677
Demonstration of problem 2

This program should hang, or possibly print x=0 if scheduling is really wacky.
If it exits printing a nonzero value of the volatile variable x, this means the
signal handler wrongly executed under asynchronous cancellation.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
  2011-04-18 22:35 ` [Bug nptl/12683] " bugdal at aerifal dot cx
@ 2011-09-21 18:30 ` bugdal at aerifal dot cx
  2012-04-29  2:56 ` bugdal at aerifal dot cx
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2011-09-21 18:30 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #2 from Rich Felker <bugdal at aerifal dot cx> 2011-09-21 18:30:01 UTC ---
It's been 5 months since I filed this bug and there's been no response. I
believe this issue it important enough to at least deserve a response. From my
perspective, it makes NPTL's pthread_cancel essentially unusable. I've even
included a proposed solution (albeit not a patch). Getting a confirmation that
you acknowledge the issue exists and are open to a solution would open the door
for somebody to start the work to integrate the solution with glibc/NPTL and
eventually get it fixed.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
  2011-04-18 22:35 ` [Bug nptl/12683] " bugdal at aerifal dot cx
  2011-09-21 18:30 ` bugdal at aerifal dot cx
@ 2012-04-29  2:56 ` bugdal at aerifal dot cx
  2012-04-29  2:57 ` bugdal at aerifal dot cx
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2012-04-29  2:56 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #3 from Rich Felker <bugdal at aerifal dot cx> 2012-04-29 02:55:59 UTC ---
Ping. Now that there's some will to revisit bugs that have been long-ignored,
is anyone willing to look into and confirm the problem I've reported here? I
believe problem 2 is extremely serious and could lead to timing-based attacks
that corrupt memory and result in deadlocks or worse. Problem 1 is also serious
for long-lived processes with high reliability requirements that use thread
cancellation, as rare but undetectable resource leaks are nearly inevitable and
will accumulate over time.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (2 preceding siblings ...)
  2012-04-29  2:56 ` bugdal at aerifal dot cx
@ 2012-04-29  2:57 ` bugdal at aerifal dot cx
  2012-09-22 23:13 ` bugdal at aerifal dot cx
                   ` (29 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2012-04-29  2:57 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

Rich Felker <bugdal at aerifal dot cx> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|drepper.fsp at gmail dot    |unassigned at sourceware
                   |com                         |dot org

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (3 preceding siblings ...)
  2012-04-29  2:57 ` bugdal at aerifal dot cx
@ 2012-09-22 23:13 ` bugdal at aerifal dot cx
  2013-08-16 15:32 ` carlos at redhat dot com
                   ` (28 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2012-09-22 23:13 UTC (permalink / raw)
  To: glibc-bugs


http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #4 from Rich Felker <bugdal at aerifal dot cx> 2012-09-22 23:13:30 UTC ---
I just added a detailed analysis of this bug on my blog at
http://ewontfix.com/2/

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (4 preceding siblings ...)
  2012-09-22 23:13 ` bugdal at aerifal dot cx
@ 2013-08-16 15:32 ` carlos at redhat dot com
  2013-08-16 15:34 ` carlos at redhat dot com
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2013-08-16 15:32 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
There is interest in fixing this issue. It's just that it's complicated :-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (6 preceding siblings ...)
  2013-08-16 15:34 ` carlos at redhat dot com
@ 2013-08-16 15:34 ` carlos at redhat dot com
  2013-08-16 16:22 ` bugdal at aerifal dot cx
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2013-08-16 15:34 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |siddhesh at redhat dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (5 preceding siblings ...)
  2013-08-16 15:32 ` carlos at redhat dot com
@ 2013-08-16 15:34 ` carlos at redhat dot com
  2013-08-16 15:34 ` carlos at redhat dot com
                   ` (26 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2013-08-16 15:34 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |2.19

--- Comment #6 from Carlos O'Donell <carlos at redhat dot com> ---
Let us see if I can't get resources to fix this for 2.19. We've seen some
tst-cancel17 problems specifically around this issue where cancellation is
delivered between syscall return and error storage and that causes problems.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (7 preceding siblings ...)
  2013-08-16 15:34 ` carlos at redhat dot com
@ 2013-08-16 16:22 ` bugdal at aerifal dot cx
  2013-08-16 16:59 ` carlos at redhat dot com
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2013-08-16 16:22 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #7 from Rich Felker <bugdal at aerifal dot cx> ---
Glad to hear that. Have you taken a look at musl's cancellation implementation?
The same mechanism could be used in glibc, or I think it could be modified
somewhat to use DWARF2 CFI instead of the asm labels. The basic approach is
that the cancellation signal handler examines the saved program counter
register and determines whether it's in the critical range starting just before
the pre-syscall check of the cancellation flag and the syscall instruction
(based on asm labels for these two endpoints). The kernel then handles the
atomicity of side effects for us: if the signal interrupts the syscall, the
kernel must either complete what it's doing and return (positioning the program
counter just past the address range that would allow cancellation to be acted
upon), or reset the program counter to just before the syscall instruction and
setup the register contents for restarting after the signal handler (in which
case cancellation can be acted upon).

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (8 preceding siblings ...)
  2013-08-16 16:22 ` bugdal at aerifal dot cx
@ 2013-08-16 16:59 ` carlos at redhat dot com
  2013-08-16 17:14 ` bugdal at aerifal dot cx
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2013-08-16 16:59 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #8 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #7)
> Glad to hear that. Have you taken a look at musl's cancellation
> implementation? The same mechanism could be used in glibc, or I think it
> could be modified somewhat to use DWARF2 CFI instead of the asm labels. The
> basic approach is that the cancellation signal handler examines the saved
> program counter register and determines whether it's in the critical range
> starting just before the pre-syscall check of the cancellation flag and the
> syscall instruction (based on asm labels for these two endpoints). The
> kernel then handles the atomicity of side effects for us: if the signal
> interrupts the syscall, the kernel must either complete what it's doing and
> return (positioning the program counter just past the address range that
> would allow cancellation to be acted upon), or reset the program counter to
> just before the syscall instruction and setup the register contents for
> restarting after the signal handler (in which case cancellation can be acted
> upon).

I have not looked at musl's cancellation implementation.

I assume you are parsing the rt_sigframe set down by the Linux kernel and
extracting information from that?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (9 preceding siblings ...)
  2013-08-16 16:59 ` carlos at redhat dot com
@ 2013-08-16 17:14 ` bugdal at aerifal dot cx
  2013-08-16 18:09 ` carlos at redhat dot com
                   ` (22 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2013-08-16 17:14 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #9 from Rich Felker <bugdal at aerifal dot cx> ---
> I have not looked at musl's cancellation implementation.
> 
> I assume you are parsing the rt_sigframe set down by the Linux kernel and
> extracting information from that?

We are using the ucontext_t received via the third argument to the
SA_SIGINFO type signal handler.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (10 preceding siblings ...)
  2013-08-16 17:14 ` bugdal at aerifal dot cx
@ 2013-08-16 18:09 ` carlos at redhat dot com
  2014-01-10 20:25 ` carlos at redhat dot com
                   ` (21 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2013-08-16 18:09 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #10 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #9)
> > I have not looked at musl's cancellation implementation.
> > 
> > I assume you are parsing the rt_sigframe set down by the Linux kernel and
> > extracting information from that?
> 
> We are using the ucontext_t received via the third argument to the
> SA_SIGINFO type signal handler.

Same thing. Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (11 preceding siblings ...)
  2013-08-16 18:09 ` carlos at redhat dot com
@ 2014-01-10 20:25 ` carlos at redhat dot com
  2014-01-10 21:31 ` carlos at redhat dot com
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-01-10 20:25 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |14147

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (12 preceding siblings ...)
  2014-01-10 20:25 ` carlos at redhat dot com
@ 2014-01-10 21:31 ` carlos at redhat dot com
  2014-01-10 22:37 ` bugdal at aerifal dot cx
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-01-10 21:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #11 from Carlos O'Donell <carlos at redhat dot com> ---
Alex Oliva and I talked about this particular issue today.

We believe that an entirely userspace solution is possible without assistance
from the kernel, but it requires signal wrappers.

Signal wrappers are code that execute before and after a signal handler and
does things like save and restore errno (the one use we have for them
currently). The signal wrappers would assist in handling deferred cancellation.

The proposed solution would look like this:

* Stop enabling/disabling asynchronous cancellation around syscalls.

* When a blocking library function who is also a cancellation point is entered
a word in the thread's TCB (call it IN_SYSCALL) is set to the value of the
stack pointer (we assume no further stack adjustments are made before the
function exits). The value of IN_SYSCALL is cleared just before the function
returns. Deferred cancellation is still checked before and after the syscall.

* Add a signal wrapper to all signals that checks to see if IN_SYSCALL == SP
stored in the ucontext_t and if it does it immediately cancels the thread. The
check is done upon entry and exit of the wrapper to reduce cancellation
latency. Just before unwinding the IN_SYCALL value is cleared.

* When a thread starts we install a SIGCANCEL (SIGRTMIN) handler like we did
before, but this handler checks to see if the thread's IN_SYSCALL matches the
SP stored in ucontext_t, indicating that cancellation was requested while
executing in the cancellation region of a blocking syscall (and no other signal
handler executing). In that case the signal handler cancels the thread
immediately. If IN_SYSCALL != SP then another signal handler is running and we
defer the cancellation to the signal wrapper or syscall wrapper. The SIGCANCEL
handler operates as it previously did when asynchronous cancellation was
enabled.

Resolved use cases:

- Cancellation delivered between first instruction of function and IN_SYSCALL
set: Syscall wrapper code will check for cancellation and act upon it.

- Cancellation delivered between IN_SYSCALL set and syscall: The SIGCANCEL
handler will immediately cancel the thread.

- Cancellation delivered between syscall and clearing IN_SYSCALL: The SIGCANCEL
handler will immediately cancel the thread.

- Cancellation delivered between clearing of IN_SYSCALL and function return:
The next cancellation point will act upon the cancellation (still meets POSIX
requirement given escape clause of "The thread is suspended at a cancellation
point and the event for which it is waiting occurs").

- Cancellation delivered and thread stopped at syscall is executing multiple
nested signal handlers and the first signal handler has not checked IN_SYSCALL
yet: Only the first signal delivered will have IN_SYSCALL == SP be true. The
SIGCANCEL handler will do nothing. The first signal handler's wrapper will
detect the cancellation is active and act upon it as it exits (only after all
the other signal handlers have completed).

- Cancellation delivered and thread stopped at syscall is executing multiple
nested signal handlers and the first signal handler is exiting and has already
checked IN_SYSCALL: The syscall will be interrupted and return. The syscall
wrapper will act upon the cancellation request. The goal here is to have the
signal handlers finish executing without interruption.

Unresolved use cases:

- Related to bug 14147 -- Cancellation delivered while thread is blocked on an
async-safe function (in fact it's only executing async-safe functions during
the time a signal can be delivered for this to be valid) and executing a signal
handler that longjmp's out of the function. In this case IN_SYSCALL is still
set to SP and not cleared. If by luck SP ends up the same, and another thread
delivers a cancellation request the SIGCANCEL handler will immediately cancel
the thread even though it was not in a cancellation region.

- What if you are executing fork and someone tries to cancel you?

A potential resolution to the first unresolved use case is to use a cleanup
handler to reset IN_SYSCALL since such a handler is run when longjmp unwinds
the frames. However we then need to consider cancellation during the execution
of the cleanup.

I haven't fully thought through what to do with the forking a multithreaded
program case, but we should try to see if we can make it work.

Note: Setting IN_SYSCALL must be atomic.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (13 preceding siblings ...)
  2014-01-10 21:31 ` carlos at redhat dot com
@ 2014-01-10 22:37 ` bugdal at aerifal dot cx
  2014-01-12 18:31 ` carlos at redhat dot com
                   ` (18 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2014-01-10 22:37 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #12 from Rich Felker <bugdal at aerifal dot cx> ---
Your proposed solution is a lot more complex and invasive than mine; it's
actually almost equivalent to the first-generation solution I used in musl for
the problem, which turned out to be a bad idea, and thus got scrapped.

Most importantly, aside from being complex and ugly, it does not actually solve
the worst problem, because this case is wrong:

"Cancellation delivered between syscall and clearing IN_SYSCALL: The SIGCANCEL
handler will immediately cancel the thread."

In this case, unless the syscall failed with EINTR, you must not act on the
cancellation request. Doing so is non-conforming to the requirement that the
side effects upon cancellation match the side effects on EINTR (which is just a
fancy way of saying, approximately, that cancellation can only take place if
the syscall has already done its job, e.g. closing a fd, transferring some
bytes, etc.).

In addition, I suspect your solution has further flaws like what happens when
you longjmp out of a signal handler that interrupted an AS-safe syscall which
is a cancellation point. These issues can be solved with more complexity (extra
work in longjmp), but the solution I've proposed is much simpler and has no
corner cases that are difficult to handle.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (14 preceding siblings ...)
  2014-01-10 22:37 ` bugdal at aerifal dot cx
@ 2014-01-12 18:31 ` carlos at redhat dot com
  2014-01-12 23:55 ` bugdal at aerifal dot cx
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-01-12 18:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #13 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #12)
> Your proposed solution is a lot more complex and invasive than mine; it's
> actually almost equivalent to the first-generation solution I used in musl
> for the problem, which turned out to be a bad idea, and thus got scrapped.

Experience is knowing what not to do :-)

> Most importantly, aside from being complex and ugly, it does not actually
> solve the worst problem, because this case is wrong:

I like it when we can talk concretely about use cases.

> "Cancellation delivered between syscall and clearing IN_SYSCALL: The
> SIGCANCEL handler will immediately cancel the thread."
> 
> In this case, unless the syscall failed with EINTR, you must not act on the
> cancellation request. Doing so is non-conforming to the requirement that the
> side effects upon cancellation match the side effects on EINTR (which is
> just a fancy way of saying, approximately, that cancellation can only take
> place if the syscall has already done its job, e.g. closing a fd,
> transferring some bytes, etc.).

I was not aware of this requirement. Is this written in POSIX or did this come
about from discussion with the Austin group around the problems with close()
being cancelled? Can you provide a reference to this?

> In addition, I suspect your solution has further flaws like what happens
> when you longjmp out of a signal handler that interrupted an AS-safe syscall
> which is a cancellation point. These issues can be solved with more
> complexity (extra work in longjmp), but the solution I've proposed is much
> simpler and has no corner cases that are difficult to handle.

Could you propose your design as a glibc wiki page so that we can look at it
and critique it? I'd be happy to adopt your solution, but I Want to review it
and put it through the same kind of use-cases as we discussed here.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (15 preceding siblings ...)
  2014-01-12 18:31 ` carlos at redhat dot com
@ 2014-01-12 23:55 ` bugdal at aerifal dot cx
  2014-01-13  1:52 ` carlos at redhat dot com
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2014-01-12 23:55 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #14 from Rich Felker <bugdal at aerifal dot cx> ---
>From XSH 2.9.5 Thread Cancellation:

"The side-effects of acting upon a cancellation request while suspended during
a call of a function are the same as the side-effects that may be seen in a
single-threaded program when a call to a function is interrupted by a signal
and the given function returns [EINTR]. Any such side-effects occur before any
cancellation cleanup handlers are called."

This is the important paragraph. By requiring that the side-effects on
cancellation match the side effects on EINTR, the standard requires that
cancellation cannot be acted upon if other irreversible side effects have
already taken place. For example, if a file descriptor has been closed, data
transferred, etc. then cancellation can't happen. The following paragraph
explains further:

"Whenever a thread has cancelability enabled and a cancellation request has
been made with that thread as the target, and the thread then calls any
function that is a cancellation point (such as pthread_testcancel() or read()),
the cancellation request shall be acted upon before the function returns."

This is simple. If cancellation is already pending when a cancellation point is
called, it must be acted upon. The next part is less clear:

"If a thread has cancelability enabled and a cancellation request is made with
the thread as a target while the thread is suspended at a cancellation point,
the thread shall be awakened and the cancellation request shall be acted upon.
It is unspecified whether the cancellation request is acted upon or whether the
cancellation request remains pending and the thread resumes normal execution
if:

* The thread is suspended at a cancellation point and the event for which it is
waiting occurs

* A specified timeout expired

before the cancellation request is acted upon."

This is covering the case of blocking syscalls. If a cancellation request
arrives during a blocking syscall, it's normally acted upon, but there's one
race condition being described: it's possible that the "event being waited for"
arrives just before the cancellation request arrives, but before the target
thread unblocks. In this case, it's implementation-defined whether cancellation
is acted upon (in which case, by the first paragraph, the event remains
pending) or the event is acted upon (in which case the cancellation request
remains pending). The reason for there being two bullet points above is that
some blocking syscalls wait for either an event or a timeout (think of
sem_timedwait or recv with a timeout set by setsockopt), and in that case, the
timeout can also be 'consumed' and cause the cancellation to remain pending.

Anyway, this race condition is the whole matter at hand here. The two
possibilities allowed (again, due to the limitations imposed by the first
paragrah) are acting on the event while leaving cancellation pending, or acting
on cancellation while leaving the event pending. But glibc also has a race
window where it can act on both the event, producing side effects (because the
kernel already has), and act on cancellation. This makes it non-conforming and
makes it impossible to use cancellation safely.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (16 preceding siblings ...)
  2014-01-12 23:55 ` bugdal at aerifal dot cx
@ 2014-01-13  1:52 ` carlos at redhat dot com
  2014-01-13  4:37 ` bugdal at aerifal dot cx
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-01-13  1:52 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #15 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #14)

Thanks for the recap.

> Anyway, this race condition is the whole matter at hand here. The two
> possibilities allowed (again, due to the limitations imposed by the first
> paragrah) are acting on the event while leaving cancellation pending, or
> acting on cancellation while leaving the event pending. But glibc also has a
> race window where it can act on both the event, producing side effects
> (because the kernel already has), and act on cancellation. This makes it
> non-conforming and makes it impossible to use cancellation safely.

So does this imply that the cancellation *must* happen at some point after
errno is known? Thus if a cancellation arrives and we're already in the syscall
there is nothing to do but let the syscall return and let the syscall wrapper
handle the cancellation. That seems reasonable to me.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (17 preceding siblings ...)
  2014-01-13  1:52 ` carlos at redhat dot com
@ 2014-01-13  4:37 ` bugdal at aerifal dot cx
  2014-01-14 14:51 ` carlos at redhat dot com
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2014-01-13  4:37 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #16 from Rich Felker <bugdal at aerifal dot cx> ---
There are several points at which the cancellation signal could arrive:

1. Before the final "testcancel" before the syscall is made.
2. Between the "testcancel" and the syscall.
3. While the syscall is blocked and no side effects have yet taken place.
4. While the syscall is blocked but with some side effects already having taken
place (e.g. a partial read or write).
5. After the syscall has returned.

You want to act on cancellation in cases 1-3 but not in case 4 or 5. Handling
case 1 is of course trivial, since you're about to do a conditional branch
based on whether the thread has received a cancellation request; nothing needs
to be done in the signal handler (but it also wouldn't hurt to handle it from
the signal handler). Case 2 can be caught by the signal handler determining
that the saved program counter (from the ucontext_t) is in some address range
beginning just before the "testcancel" and ending with the syscall instruction.

The rest of the cases are the "tricky" part but it turns out they too are easy:

Case 3: In this case, except for certain syscalls that ALWAYS fail with EINTR
even for non-interrupting signals, the kernel will reset the program counter to
point at the syscall instruction during signal handling, so that the syscall is
restarted when the signal handler returns. So, from the signal handler's
standpoint, this looks the same as case 2, and thus it's taken care of.

Case 4: In this case, the kernel cannot restart the syscall; when it's
interrupted by a signal, the kernel must cause the syscall to return with
whatever partial result it obtained (e.g. partial read or write). In this case,
the saved program counter points just after the syscall instruction, so the
signal handler won't act on cancellation.

Case 5: OK, I lied. This one is trivial too since the program counter is past
the syscall instruction already.

What about syscalls that fail with EINTR even when the signal handler is
non-interrupting? In this case, the syscall wrapper code can just check the
cancellation flag when the errno result is EINTR, and act on cancellation if
it's set. Note that an exception needs to be made for close(), where EINTR
should be treated as EINPROGRESS and thus not permit cancellation to take
place.

BTW, I should justify why the signal handler should be non-interrupting
(SA_RESTART): if it weren't, you would risk causing spurious EINTR in programs
not written to handle it, e.g. if the user incorrectly send signal 32/33 to the
process or if pthread_cancel were called while cancellation is disabled in the
target thread. The kernel folks have spent a great deal of effort getting rid
of spurious EINTRs (which cause all sorts of ugly bugs) and it would be a shame
to reintroduce them. Also it doesn't buy you anything moving the cancellation
action to the EINTR check after the syscall returns; the same check in the
signal handler that handles case 2 above also handles the case of restartable
syscalls correctly, for free.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (18 preceding siblings ...)
  2014-01-13  4:37 ` bugdal at aerifal dot cx
@ 2014-01-14 14:51 ` carlos at redhat dot com
  2014-02-16 19:42 ` jackie.rosen at hushmail dot com
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-01-14 14:51 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #17 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #16)
> There are several points at which the cancellation signal could arrive:
> 
> 1. Before the final "testcancel" before the syscall is made.
> 2. Between the "testcancel" and the syscall.
> 3. While the syscall is blocked and no side effects have yet taken place.
> 4. While the syscall is blocked but with some side effects already having
> taken place (e.g. a partial read or write).
> 5. After the syscall has returned.
> 
> You want to act on cancellation in cases 1-3 but not in case 4 or 5.
> Handling case 1 is of course trivial, since you're about to do a conditional
> branch based on whether the thread has received a cancellation request;
> nothing needs to be done in the signal handler (but it also wouldn't hurt to
> handle it from the signal handler). Case 2 can be caught by the signal
> handler determining that the saved program counter (from the ucontext_t) is
> in some address range beginning just before the "testcancel" and ending with
> the syscall instruction.
> 
> The rest of the cases are the "tricky" part but it turns out they too are
> easy:
> 
> Case 3: In this case, except for certain syscalls that ALWAYS fail with
> EINTR even for non-interrupting signals, the kernel will reset the program
> counter to point at the syscall instruction during signal handling, so that
> the syscall is restarted when the signal handler returns. So, from the
> signal handler's standpoint, this looks the same as case 2, and thus it's
> taken care of.
> 
> Case 4: In this case, the kernel cannot restart the syscall; when it's
> interrupted by a signal, the kernel must cause the syscall to return with
> whatever partial result it obtained (e.g. partial read or write). In this
> case, the saved program counter points just after the syscall instruction,
> so the signal handler won't act on cancellation.
> 
> Case 5: OK, I lied. This one is trivial too since the program counter is
> past the syscall instruction already.

Excellent. I like your idea then. It seems like a list of PC's using either
markers or dwarf2 is the way to go here.

> What about syscalls that fail with EINTR even when the signal handler is
> non-interrupting? In this case, the syscall wrapper code can just check the
> cancellation flag when the errno result is EINTR, and act on cancellation if
> it's set. Note that an exception needs to be made for close(), where EINTR
> should be treated as EINPROGRESS and thus not permit cancellation to take
> place.

We'll need a big disclaimer about close and a detailed comment. I know some of
the details there, specifically that although EINTR has been returned the close
will complete.

> BTW, I should justify why the signal handler should be non-interrupting
> (SA_RESTART): if it weren't, you would risk causing spurious EINTR in
> programs not written to handle it, e.g. if the user incorrectly send signal
> 32/33 to the process or if pthread_cancel were called while cancellation is
> disabled in the target thread. The kernel folks have spent a great deal of
> effort getting rid of spurious EINTRs (which cause all sorts of ugly bugs)
> and it would be a shame to reintroduce them. Also it doesn't buy you
> anything moving the cancellation action to the EINTR check after the syscall
> returns; the same check in the signal handler that handles case 2 above also
> handles the case of restartable syscalls correctly, for free.

That makes sense.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (19 preceding siblings ...)
  2014-01-14 14:51 ` carlos at redhat dot com
@ 2014-02-16 19:42 ` jackie.rosen at hushmail dot com
  2014-05-28 19:47 ` schwab at sourceware dot org
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jackie.rosen at hushmail dot com @ 2014-02-16 19:42 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Jackie Rosen <jackie.rosen at hushmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jackie.rosen at hushmail dot com

--- Comment #18 from Jackie Rosen <jackie.rosen at hushmail dot com> ---
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (21 preceding siblings ...)
  2014-05-28 19:47 ` schwab at sourceware dot org
@ 2014-05-28 19:47 ` schwab at sourceware dot org
  2014-06-27 13:35 ` fweimer at redhat dot com
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: schwab at sourceware dot org @ 2014-05-28 19:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Andreas Schwab <schwab at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|jackie.rosen at hushmail dot com   |

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (20 preceding siblings ...)
  2014-02-16 19:42 ` jackie.rosen at hushmail dot com
@ 2014-05-28 19:47 ` schwab at sourceware dot org
  2014-05-28 19:47 ` schwab at sourceware dot org
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: schwab at sourceware dot org @ 2014-05-28 19:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Andreas Schwab <schwab at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|jackie.rosen at hushmail dot com   |

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (22 preceding siblings ...)
  2014-05-28 19:47 ` schwab at sourceware dot org
@ 2014-06-27 13:35 ` fweimer at redhat dot com
  2014-07-19 18:44 ` sstewartgallus00 at mylangara dot bc.ca
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: fweimer at redhat dot com @ 2014-06-27 13:35 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (23 preceding siblings ...)
  2014-06-27 13:35 ` fweimer at redhat dot com
@ 2014-07-19 18:44 ` sstewartgallus00 at mylangara dot bc.ca
  2014-07-19 18:54 ` bugdal at aerifal dot cx
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: sstewartgallus00 at mylangara dot bc.ca @ 2014-07-19 18:44 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Steven Stewart-Gallus <sstewartgallus00 at mylangara dot bc.ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sstewartgallus00@mylangara.
                   |                            |bc.ca

--- Comment #19 from Steven Stewart-Gallus <sstewartgallus00 at mylangara dot bc.ca> ---
I am confused but if the proposed fix for this bug is implemented than
that means that my bug at
https://sourceware.org/bugzilla/show_bug.cgi?id=17168 where I can't
cancel FUTEX_WAITs would be automatically fixed right? I would have to
do no extra effort to let my blocking system call be cancellable? So
this proposed fix would have a side benefit of giving me
cancellability for free? Or would this be a bug or at least a breaking
change?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (24 preceding siblings ...)
  2014-07-19 18:44 ` sstewartgallus00 at mylangara dot bc.ca
@ 2014-07-19 18:54 ` bugdal at aerifal dot cx
  2014-07-20 18:15 ` sstewartgallus00 at mylangara dot bc.ca
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2014-07-19 18:54 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #20 from Rich Felker <bugdal at aerifal dot cx> ---
Steven, I don't think this bug is related to your issue. If bug 9712
(of which your 17168 seems to be a duplicate) is resolved by adding
futex and glibc makes futex a cancellation point, THEN the resolution
of this bug (12683) would make it safe to use cancellation with the
futex function in a way that's race-free.

I think this is a strong argument for resolving 9712 by adding futex:
unless it's part of libc, there's no safe way to make it cancellable,
because wrapping syscall() with async cancellation will introduce an
application-level bug comparable to this bug.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (25 preceding siblings ...)
  2014-07-19 18:54 ` bugdal at aerifal dot cx
@ 2014-07-20 18:15 ` sstewartgallus00 at mylangara dot bc.ca
  2014-07-20 18:41 ` bugdal at aerifal dot cx
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: sstewartgallus00 at mylangara dot bc.ca @ 2014-07-20 18:15 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #21 from Steven Stewart-Gallus <sstewartgallus00 at mylangara dot bc.ca> ---
Okay Rich Felker, I was confused because the implementation in Musl
seems to look it'd would make syscall users that block cancellation
points. So, if GLibc does something similar to your solution they
would have to explicitly block cancels in the syscall function to
preserve compatibility (and in the future GLibc might possibly
consider moving over to making syscall automatically cancellable for
blocking system calls but that'd be a separate issue)?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (26 preceding siblings ...)
  2014-07-20 18:15 ` sstewartgallus00 at mylangara dot bc.ca
@ 2014-07-20 18:41 ` bugdal at aerifal dot cx
  2014-08-19 14:08 ` azanella at linux dot vnet.ibm.com
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2014-07-20 18:41 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #22 from Rich Felker <bugdal at aerifal dot cx> ---
Steven, I'm not sure I understand what you're saying. This issue report is not
about changing which syscalls/functions are cancellable. For the standard
functions that is specified by POSIX, and for extensions, the natural choices
were already made and changing them would be problematic to their users. The
topic at hand is just fixing the mechanism by which cancellation is performed
so that there are not race conditions.

If your question is about the syscall() function that applications can use to
make syscalls directly, there is no open issue for making it cancellable, and
as above, changing this would be problematic. One could envision a request for
a separate version of the syscall() function which is cancellable, but as far
as I know nobody has requested this and I think it's a bad idea to be adding
features that encourage applications to make syscalls directly (since this is
usually non-portable between archs due to subtle differences in the calling
conventions and other issues like whether the libc-level structs match the
syscall-level ones for a given arch).

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (27 preceding siblings ...)
  2014-07-20 18:41 ` bugdal at aerifal dot cx
@ 2014-08-19 14:08 ` azanella at linux dot vnet.ibm.com
  2014-08-28 15:02 ` carlos at redhat dot com
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: azanella at linux dot vnet.ibm.com @ 2014-08-19 14:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Adhemerval Zanella Netto <azanella at linux dot vnet.ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |azanella at linux dot vnet.ibm.com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (28 preceding siblings ...)
  2014-08-19 14:08 ` azanella at linux dot vnet.ibm.com
@ 2014-08-28 15:02 ` carlos at redhat dot com
  2015-01-15 13:20 ` dan at censornet dot com
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: carlos at redhat dot com @ 2014-08-28 15:02 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #24 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Adhemerval Zanella Netto from comment #23)
> I am currently working on a fix based on musl implementation, which from
> comments #16 and #17 seems a good approach.  My initial idea is to use PC
> markers instead of DWARF2, since it see it as more clean approach. However,
> this is require a lot of cleanup.
> 
> I plan push implementations for powerpc64 and x86_64 and ask for arch
> maintainers for more arch specific work. I also plan to write a wiki page
> describing the work done and summarizing the discussion on this bug report.

The biggest problem with DWARF2 is the parser, and making it accessible from
the signal handler. I strongly suggest using PC, and a list of exception
regions generated from markers in the assembly (similar to kernel exception
regions).

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (29 preceding siblings ...)
  2014-08-28 15:02 ` carlos at redhat dot com
@ 2015-01-15 13:20 ` dan at censornet dot com
  2015-01-15 13:31 ` bugdal at aerifal dot cx
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: dan at censornet dot com @ 2015-01-15 13:20 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Dan Searle <dan at censornet dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dan at censornet dot com

--- Comment #25 from Dan Searle <dan at censornet dot com> ---
I think we have stubmled upon this bug, or something related to it. Can someone
please confirm I'm on the right track here?

We have a multithreaded server application which calls recv() and poll() from
async cancellable threads, each thread handles a single connection with a
master thread accpeting new connections and adding them to a job queue.

More and more often now we are seeing the server lock up and on inspection two
or more threads seem deadlocked in some race condition inside libc recv() and
or poll().

One example here shows two back traces from gdb from the two threads that
seemed deadlocked chewing 100% CPU:

Thread 1 bt:
#0  __pthread_disable_asynccancel () at
../nptl/sysdeps/unix/sysv/linux/x86_64/cancellation.S:98
#1  0x00007f895ba987fd in __libc_recv (fd=0, fd@entry=33,
buf=buf@entry=0x7cada02b, n=n@entry=1024, flags=1537837035,
    flags@entry=16384) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:35
#2  0x000000000040ec54 in recv (__flags=16384, __n=1024, __buf=0x7cada02b,
__fd=33)
    at /usr/include/x86_64-linux-gnu/bits/socket2.h:44
[snip]

Thread 2 bt:
#0  0x00007f895ba987eb in __libc_recv (fd=fd@entry=31,
buf=buf@entry=0x7ca5e02b, n=n@entry=1024, flags=-1, flags@entry=16384)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1  0x000000000040ec54 in recv (__flags=16384, __n=1024, __buf=0x7ca5e02b,
__fd=31)
    at /usr/include/x86_64-linux-gnu/bits/socket2.h:44
[snip]

There can be more than two threads involved, but I'm unsure if it can happen
with just one thread locked up, but it's always inside recv() or poll() and
sometimes in __pthread_disable_asynccancel() within either of those.

Could I work around this problem by changing the threads to syncronmous
cancellable or try to work around the need to cancel the treads at all?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (30 preceding siblings ...)
  2015-01-15 13:20 ` dan at censornet dot com
@ 2015-01-15 13:31 ` bugdal at aerifal dot cx
  2015-01-15 14:01 ` dan at censornet dot com
  2020-06-08 14:04 ` fweimer at redhat dot com
  33 siblings, 0 replies; 35+ messages in thread
From: bugdal at aerifal dot cx @ 2015-01-15 13:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #26 from Rich Felker <bugdal at aerifal dot cx> ---
I don't think this is the bug you're seeing. If it were, use of async
cancellation would only make it worse. But the symptoms you'd see from this bug
would be things like side effects of a function having happened despite it
getting cancelled.

If you're seeing 100% cpu load from threads in recv, the most likely
explanation is that the socket you're reading from is in EOF status (remote
sending end closed), so that recv immediately returns zero. Repeatedly
attempting to read in this situation would be an application bug, not anything
related to glibc.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (31 preceding siblings ...)
  2015-01-15 13:31 ` bugdal at aerifal dot cx
@ 2015-01-15 14:01 ` dan at censornet dot com
  2020-06-08 14:04 ` fweimer at redhat dot com
  33 siblings, 0 replies; 35+ messages in thread
From: dan at censornet dot com @ 2015-01-15 14:01 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

--- Comment #27 from Dan Searle <dan at censornet dot com> ---
(In reply to Rich Felker from comment #26)
> I don't think this is the bug you're seeing. If it were, use of async
> cancellation would only make it worse. But the symptoms you'd see from this
> bug would be things like side effects of a function having happened despite
> it getting cancelled.
> 
> If you're seeing 100% cpu load from threads in recv, the most likely
> explanation is that the socket you're reading from is in EOF status (remote
> sending end closed), so that recv immediately returns zero. Repeatedly
> attempting to read in this situation would be an application bug, not
> anything related to glibc.

Thanks Rich, your suggestion made me think to look through the code paths again
and you are quite right, there was an infinite loop in there, not obvious but I
found it.

In light of the current problems with cancellable threads and syscalls, I'm
going to disable cancelation during the main job execution (where all the
recv() can poll()) call are, just in case this bug is causing problems I'm
unaware of.

Many thanks, you saved me a lot of hair pulling :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug nptl/12683] Race conditions in pthread cancellation
  2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
                   ` (32 preceding siblings ...)
  2015-01-15 14:01 ` dan at censornet dot com
@ 2020-06-08 14:04 ` fweimer at redhat dot com
  33 siblings, 0 replies; 35+ messages in thread
From: fweimer at redhat dot com @ 2020-06-08 14:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=12683

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://bugzilla.redhat.com
                   |                            |/show_bug.cgi?id=976368

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-06-08 14:04 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-18 22:28 [Bug nptl/12683] New: Race conditions in pthread cancellation bugdal at aerifal dot cx
2011-04-18 22:35 ` [Bug nptl/12683] " bugdal at aerifal dot cx
2011-09-21 18:30 ` bugdal at aerifal dot cx
2012-04-29  2:56 ` bugdal at aerifal dot cx
2012-04-29  2:57 ` bugdal at aerifal dot cx
2012-09-22 23:13 ` bugdal at aerifal dot cx
2013-08-16 15:32 ` carlos at redhat dot com
2013-08-16 15:34 ` carlos at redhat dot com
2013-08-16 15:34 ` carlos at redhat dot com
2013-08-16 16:22 ` bugdal at aerifal dot cx
2013-08-16 16:59 ` carlos at redhat dot com
2013-08-16 17:14 ` bugdal at aerifal dot cx
2013-08-16 18:09 ` carlos at redhat dot com
2014-01-10 20:25 ` carlos at redhat dot com
2014-01-10 21:31 ` carlos at redhat dot com
2014-01-10 22:37 ` bugdal at aerifal dot cx
2014-01-12 18:31 ` carlos at redhat dot com
2014-01-12 23:55 ` bugdal at aerifal dot cx
2014-01-13  1:52 ` carlos at redhat dot com
2014-01-13  4:37 ` bugdal at aerifal dot cx
2014-01-14 14:51 ` carlos at redhat dot com
2014-02-16 19:42 ` jackie.rosen at hushmail dot com
2014-05-28 19:47 ` schwab at sourceware dot org
2014-05-28 19:47 ` schwab at sourceware dot org
2014-06-27 13:35 ` fweimer at redhat dot com
2014-07-19 18:44 ` sstewartgallus00 at mylangara dot bc.ca
2014-07-19 18:54 ` bugdal at aerifal dot cx
2014-07-20 18:15 ` sstewartgallus00 at mylangara dot bc.ca
2014-07-20 18:41 ` bugdal at aerifal dot cx
2014-08-19 14:08 ` azanella at linux dot vnet.ibm.com
2014-08-28 15:02 ` carlos at redhat dot com
2015-01-15 13:20 ` dan at censornet dot com
2015-01-15 13:31 ` bugdal at aerifal dot cx
2015-01-15 14:01 ` dan at censornet dot com
2020-06-08 14:04 ` fweimer at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).