From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 623263856DED for ; Tue, 3 May 2022 17:36:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 623263856DED Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from mail.ispras.ru (unknown [83.149.199.84]) by mail.ispras.ru (Postfix) with ESMTPSA id C2E23407624D; Tue, 3 May 2022 17:36:28 +0000 (UTC) MIME-Version: 1.0 Date: Tue, 03 May 2022 20:36:28 +0300 From: Alexey Izbyshev To: Adhemerval Zanella Cc: libc-alpha@sourceware.org, Carlos O'Donell , Florian Weimer Subject: Re: [PATCH v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115) In-Reply-To: <20220503170032.2531471-1-adhemerval.zanella@linaro.org> References: <20220503170032.2531471-1-adhemerval.zanella@linaro.org> User-Agent: Roundcube Webmail/1.4.4 Message-ID: <66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru> X-Sender: izbyshev@ispras.ru Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2022 17:36:33 -0000 On 2022-05-03 20:00, Adhemerval Zanella wrote: > Even though it might characterizeis as a kernel bug (incomplete kernel > support for CLONE_VM used along with CLONE_VFORK), time namespace > support > is already presented in a some kernel releases (since 2020). > > Although I agree with Carlos that it results in unexpected surprise > behaviour, I also see that providing a reasonable fallback that does > not > add possible issues like race conditions (as some syscall fallbacks in > the > past) provides a easier transition and make the interfaca to work on > multiple scenarios. > > The fix itself does adds some corner cases that need to be handled, but > I still think it is maintanable and a general improvement. > > -- > > Linux clone with CLONE_VM may fail for some namespace restriction (for > instance if kernel does not allow processes in different time > namespaces > to share the sameaddress space). > > In this case clone fails with EINVAL and posix_spawn can not spawn a > new > process. However the same process can be spawned with fork and exec. > > The patch fixes by retrying the clone syscall with just CLONE_VFORK > if clone fails with a non transient failure (ENOMEM and EAGAIN still > returns failure to caller). > > Error on preparation phase or execve is returned by a pipe now that > there is no shared memory that ca be used. It requires some extra care > for some file operations: > > * If the file operation would clobber the pipe file descriptor, the > helper process dup the pipe onto an unoccupied file descriptor. > > * dup2 file action that targets the pipe file descriptor returns > EBADF. > > * If closefrom argument overlaps the pipe file descriptor, it is > splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou]. > > Failure on prepare phase in helper process does not trigger the fork > and exec fallback. > > Checked on x86_64-linux-gnu. > --- > v3: Fixed closerange fallback,. > v2: Fixed closerange file actions, handle EPIPE, return all errors, > handle if pipe fd is used in file actions. > --- > include/unistd.h | 4 +- > io/closefrom.c | 2 +- > sysdeps/unix/sysv/linux/closefrom_fallback.c | 4 +- > sysdeps/unix/sysv/linux/spawni.c | 235 +++++++++++++++---- > 4 files changed, 196 insertions(+), 49 deletions(-) > > diff --git a/include/unistd.h b/include/unistd.h > index af795a37c8..2643991a09 100644 > --- a/include/unistd.h > +++ b/include/unistd.h > @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int > __lowfd, _Bool dirfd_fallback) > return false; > } > # else > -extern _Bool __closefrom_fallback (int __lowfd, _Bool) > attribute_hidden; > +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int > __to, > + _Bool) > + attribute_hidden; > # endif > extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); > libc_hidden_proto (__read) > diff --git a/io/closefrom.c b/io/closefrom.c > index 48cac2e3d1..a07de06e9b 100644 > --- a/io/closefrom.c > +++ b/io/closefrom.c > @@ -30,7 +30,7 @@ __closefrom (int lowfd) > if (r == 0) > return ; > > - if (!__closefrom_fallback (l, true)) > + if (!__closefrom_fallback (l, ~0U, true)) > __fortify_fail ("closefrom failed to close a file descriptor"); > } > weak_alias (__closefrom, closefrom) > diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c > b/sysdeps/unix/sysv/linux/closefrom_fallback.c > index a9dd0c46b2..ad301e1910 100644 > --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c > +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c > @@ -28,7 +28,7 @@ > /proc/self/fd open will trigger a fallback that tries to close a > file > descriptor before proceed. */ > _Bool > -__closefrom_fallback (int from, _Bool dirfd_fallback) > +__closefrom_fallback (unsigned int from, unsigned int to, _Bool > dirfd_fallback) > { > int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | > O_DIRECTORY, > 0); > @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback) > > if (fd == dirfd || fd < from) > continue; > + if (fd > to) > + break; > __closefrom_fallback() currently ignores "to" if it can't open /proc/self/fd and dirfd_fallback is true. This path is not used by posix_spawn(), but it's a potential future bug. > /* We ignore close errors because EBADF, EINTR, and EIO > means the > descriptor has been released. */ > diff --git a/sysdeps/unix/sysv/linux/spawni.c > b/sysdeps/unix/sysv/linux/spawni.c > index d6f5ca89cd..0481791013 100644 > --- a/sysdeps/unix/sysv/linux/spawni.c > +++ b/sysdeps/unix/sysv/linux/spawni.c > @@ -44,7 +44,11 @@ > third issue is done by a stack allocation in parent, and by using a > field in struct spawn_args where the child can write an error > code. CLONE_VFORK ensures that the parent does not run until the > - child has either exec'ed successfully or exited. */ > + child has either exec'ed successfully or exited. > + > + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel > limitation > + such as time namespace), only CLONE_VFORK is used instead and the > + preparation and execve failures are communicated with a pipe. */ > > > /* The Unix standard contains a long explanation of the way to signal > @@ -67,6 +71,7 @@ struct posix_spawn_args > char *const *envp; > int xflags; > int err; > + int pipe[2]; > }; > > /* Older version requires that shell script without shebang definition > @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args > *args) > } > } > > +/* If the file operation would clobber the pipe fd used to communicate > with > + parent, dup the pipe onto an unoccupied file descriptor. */ > +static inline bool > +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) > +{ > + int fd; > + > + switch (fa->tag) > + { > + case spawn_do_close: > + fd = fa->action.close_action.fd; > + break; > + case spawn_do_open: > + fd = fa->action.open_action.fd; > + break; > + case spawn_do_dup2: > + fd = fa->action.dup2_action.newfd; > + break; > + case spawn_do_fchdir: > + fd = fa->action.fchdir_action.fd; > + default: > + return true; > + } > + > + if (fd == p[1]) > + { > + int r = __fcntl (p[1], F_DUPFD_CLOEXEC); > + if (r < 0) > + return false; > + __close_nocancel (p[1]); > + p[1] = r; > + } > + > + return true; > +} > + > +static inline bool > +spawni_fa_closerange (int from, int to) > +{ > +#if 0 > + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); > + return r == 0 || __closefrom_fallback (from, to, false); > +#else > + return __closefrom_fallback (from, to, false); > +#endif > +} > + > /* Function used in the clone call to setup the signals mask, > posix_spawn > attributes, and file actions. It run on its own stack (provided by > the > posix_spawn call). */ > -static int > -__spawni_child (void *arguments) > +static _Noreturn int > +spawni_child (void *arguments) > { > struct posix_spawn_args *args = arguments; > const posix_spawnattr_t *restrict attr = args->attr; > const posix_spawn_file_actions_t *file_actions = args->fa; > + bool use_pipe = args->pipe[0] != -1; > > /* The child must ensure that no signal handler are enabled because > it shared > memory with parent, so the signal disposition must be either > SIG_DFL or > @@ -113,6 +166,9 @@ __spawni_child (void *arguments) > struct sigaction sa; > memset (&sa, '\0', sizeof (sa)); > > + if (use_pipe) > + __close (args->pipe[0]); > + > sigset_t hset; > __sigprocmask (SIG_BLOCK, 0, &hset); > for (int sig = 1; sig < _NSIG; ++sig) > @@ -181,6 +237,9 @@ __spawni_child (void *arguments) > { > struct __spawn_action *action = &file_actions->__actions[cnt]; > > + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) > + goto fail; > + > switch (action->tag) > { > case spawn_do_close: > @@ -233,6 +292,11 @@ __spawni_child (void *arguments) > break; > > case spawn_do_dup2: > + if (use_pipe && action->action.dup2_action.fd == args->pipe[1]) > + { > + errno = EBADF; > + goto fail; > + } > /* Austin Group issue #411 requires adddup2 action with source > and destination being equal to remove close-on-exec flag. */ > if (action->action.dup2_action.fd > @@ -264,8 +328,18 @@ __spawni_child (void *arguments) > case spawn_do_closefrom: > { > int lowfd = action->action.closefrom_action.from; > - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0); > - if (r != 0 && !__closefrom_fallback (lowfd, false)) > + /* Skip the pipe descriptor if it is used. No need to handle > + it since it is created with O_CLOEXEC. */ > + if (use_pipe && args->pipe[1] >= lowfd) > + { > + if (args->pipe[1] == lowfd > + && !spawni_fa_closerange (lowfd + 1, ~0U)) > + goto fail; > + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1) > + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) The "else" branch will also run if "args->pipe[1] == lowfd" and spawni_fa_closerange (lowfd + 1, ~0U) succeeds. > + goto fail; > + } > + else if (!spawni_fa_closerange (lowfd, ~0U)) > goto fail; > } break; > > @@ -300,10 +374,112 @@ fail: > (EINTERNALBUG) describing that, use ECHILD. Another option would > be to set args->err to some negative sentinel and have the parent > abort(), but that seems needlessly harsh. */ > - args->err = errno ? : ECHILD; > + int ret = errno ? : ECHILD; > + if (use_pipe) > + { > + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0) > + if (errno == EPIPE || errno == EBADF) > + break; > + } > + else > + args->err = ret; > + > _exit (SPAWN_ERROR); > } > > +static pid_t > +clone_call (struct posix_spawn_args *args, int flags, void *stack, > + size_t stack_size) > +{ > + struct clone_args clone_args = > + { > + .flags = flags, > + .exit_signal = SIGCHLD, > + .stack = (uintptr_t) stack, > + .stack_size = stack_size, > + }; > + return __clone_internal (&clone_args, spawni_child, args); > +} > + > +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to > optimize > + memory and overcommit) and return TRUE if the helper was created or > if the > + failure was not due resource exhaustion. */ > +static bool > +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec, > + void *stack, size_t stack_size) > +{ > + /* The clone flags used will create a new child that will run in the > same > + memory space (CLONE_VM) and the execution of calling thread will > be > + suspend until the child calls execve or _exit. > + > + Also since the calling thread execution will be suspend, there is > not > + need for CLONE_SETTLS. Although parent and child share the same > TLS > + namespace, there will be no concurrent access for TLS variables > (errno > + for instance). */ > + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, > stack_size); > + > + /* It needs to collect the case where the auxiliary process was > created > + but failed to execute the file (due either any preparation step > or > + for execve itself). */ > + if (*new_pid > 0) > + { > + /* Also, it handles the unlikely case where the auxiliary > process was > + terminated before calling execve as if it was successfully. The > + args.err is set to 0 as default and changed to a positive value > + only in case of failure, so in case of premature termination > + due a signal args.err will remain zeroed and it will be up to > + caller to actually collect it. */ > + *ec = args->err; > + if (*ec > 0) > + /* There still an unlikely case where the child is cancelled after > + setting args.err, due to a positive error value. Also there is > + possible pid reuse race (where the kernel allocated the same pid > + to an unrelated process). Unfortunately due synchronization > + issues where the kernel might not have the process collected > + the waitpid below can not use WNOHANG. */ > + __waitpid (*new_pid, NULL, 0); > + } > + else > + *ec = errno; > + > + /* There is no much point in retrying with fork and exec if kernel > returns a > + failure due resource exhaustion. */ > + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); > +} > + > +/* Fallback spawn case which does not use CLONE_VM. Any preparation > step or > + execve failure is passed with a pipe, which requires additional > care by > + the helper stating process since it additional file descriptors > handle. */ > +static void > +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, > + char *stack, size_t stack_size) > +{ > + if (__pipe2 (args->pipe, O_CLOEXEC) != 0) > + { > + *ec = errno; > + return; > + } > + > + /* Do not trigger atfork handler nor any internal state reset since > the > + helper process will call execve. */ > + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); > + > + __close (args->pipe[1]); > + > + if (*new_pid > 0) > + { > + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) > + /* A successful execve will close the helper process pipe end. */ > + *ec = 0; > + else > + __waitpid (*new_pid, NULL, 0); > + } > + else > + *ec = errno; > + > + __close (args->pipe[0]); > +} > + > /* Spawn a new process executing PATH with the attributes describes in > *ATTRP. > Before running the process perform the actions described in > FILE-ACTIONS. */ > static int > @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, > args.argc = argc; > args.envp = envp; > args.xflags = xflags; > + args.pipe[0] = args.pipe[1] = -1; > > __libc_signal_block_all (&args.oldmask); > > - /* The clone flags used will create a new child that will run in the > same > - memory space (CLONE_VM) and the execution of calling thread will > be > - suspend until the child calls execve or _exit. > - > - Also since the calling thread execution will be suspend, there is > not > - need for CLONE_SETTLS. Although parent and child share the same > TLS > - namespace, there will be no concurrent access for TLS variables > (errno > - for instance). */ > - struct clone_args clone_args = > - { > - .flags = CLONE_VM | CLONE_VFORK, > - .exit_signal = SIGCHLD, > - .stack = (uintptr_t) stack, > - .stack_size = stack_size, > - }; > - new_pid = __clone_internal (&clone_args, __spawni_child, &args); > - > - /* It needs to collect the case where the auxiliary process was > created > - but failed to execute the file (due either any preparation step > or > - for execve itself). */ > - if (new_pid > 0) > - { > - /* Also, it handles the unlikely case where the auxiliary > process was > - terminated before calling execve as if it was successfully. The > - args.err is set to 0 as default and changed to a positive value > - only in case of failure, so in case of premature termination > - due a signal args.err will remain zeroed and it will be up to > - caller to actually collect it. */ > - ec = args.err; > - if (ec > 0) > - /* There still an unlikely case where the child is cancelled after > - setting args.err, due to a positive error value. Also there is > - possible pid reuse race (where the kernel allocated the same pid > - to an unrelated process). Unfortunately due synchronization > - issues where the kernel might not have the process collected > - the waitpid below can not use WNOHANG. */ > - __waitpid (new_pid, NULL, 0); > - } > - else > - ec = errno; > + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace > restriction > + (for instance Linux does not allow processes in different time > namespaces > + to share address space) and in this case clone fails with EINVAL. > Retry > + with fork and exec. */ > + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) > + spawni_fork (&args, &new_pid, &ec, stack, stack_size); > > __munmap (stack, stack_size);