From: Alexey Izbyshev <izbyshev@ispras.ru>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, Carlos O'Donell <carlos@redhat.com>,
Florian Weimer <fweimer@redhat.com>
Subject: Re: [PATCH v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115)
Date: Tue, 03 May 2022 20:36:28 +0300 [thread overview]
Message-ID: <66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru> (raw)
In-Reply-To: <20220503170032.2531471-1-adhemerval.zanella@linaro.org>
On 2022-05-03 20:00, Adhemerval Zanella wrote:
> Even though it might characterizeis as a kernel bug (incomplete kernel
> support for CLONE_VM used along with CLONE_VFORK), time namespace
> support
> is already presented in a some kernel releases (since 2020).
>
> Although I agree with Carlos that it results in unexpected surprise
> behaviour, I also see that providing a reasonable fallback that does
> not
> add possible issues like race conditions (as some syscall fallbacks in
> the
> past) provides a easier transition and make the interfaca to work on
> multiple scenarios.
>
> The fix itself does adds some corner cases that need to be handled, but
> I still think it is maintanable and a general improvement.
>
> --
>
> Linux clone with CLONE_VM may fail for some namespace restriction (for
> instance if kernel does not allow processes in different time
> namespaces
> to share the sameaddress space).
>
> In this case clone fails with EINVAL and posix_spawn can not spawn a
> new
> process. However the same process can be spawned with fork and exec.
>
> The patch fixes by retrying the clone syscall with just CLONE_VFORK
> if clone fails with a non transient failure (ENOMEM and EAGAIN still
> returns failure to caller).
>
> Error on preparation phase or execve is returned by a pipe now that
> there is no shared memory that ca be used. It requires some extra care
> for some file operations:
>
> * If the file operation would clobber the pipe file descriptor, the
> helper process dup the pipe onto an unoccupied file descriptor.
>
> * dup2 file action that targets the pipe file descriptor returns
> EBADF.
>
> * If closefrom argument overlaps the pipe file descriptor, it is
> splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
>
> Failure on prepare phase in helper process does not trigger the fork
> and exec fallback.
>
> Checked on x86_64-linux-gnu.
> ---
> v3: Fixed closerange fallback,.
> v2: Fixed closerange file actions, handle EPIPE, return all errors,
> handle if pipe fd is used in file actions.
> ---
> include/unistd.h | 4 +-
> io/closefrom.c | 2 +-
> sysdeps/unix/sysv/linux/closefrom_fallback.c | 4 +-
> sysdeps/unix/sysv/linux/spawni.c | 235 +++++++++++++++----
> 4 files changed, 196 insertions(+), 49 deletions(-)
>
> diff --git a/include/unistd.h b/include/unistd.h
> index af795a37c8..2643991a09 100644
> --- a/include/unistd.h
> +++ b/include/unistd.h
> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
> __lowfd, _Bool dirfd_fallback)
> return false;
> }
> # else
> -extern _Bool __closefrom_fallback (int __lowfd, _Bool)
> attribute_hidden;
> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int
> __to,
> + _Bool)
> + attribute_hidden;
> # endif
> extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
> libc_hidden_proto (__read)
> diff --git a/io/closefrom.c b/io/closefrom.c
> index 48cac2e3d1..a07de06e9b 100644
> --- a/io/closefrom.c
> +++ b/io/closefrom.c
> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
> if (r == 0)
> return ;
>
> - if (!__closefrom_fallback (l, true))
> + if (!__closefrom_fallback (l, ~0U, true))
> __fortify_fail ("closefrom failed to close a file descriptor");
> }
> weak_alias (__closefrom, closefrom)
> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> index a9dd0c46b2..ad301e1910 100644
> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> @@ -28,7 +28,7 @@
> /proc/self/fd open will trigger a fallback that tries to close a
> file
> descriptor before proceed. */
> _Bool
> -__closefrom_fallback (int from, _Bool dirfd_fallback)
> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool
> dirfd_fallback)
> {
> int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY |
> O_DIRECTORY,
> 0);
> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
>
> if (fd == dirfd || fd < from)
> continue;
> + if (fd > to)
> + break;
>
__closefrom_fallback() currently ignores "to" if it can't open
/proc/self/fd and dirfd_fallback is true. This path is not used by
posix_spawn(), but it's a potential future bug.
> /* We ignore close errors because EBADF, EINTR, and EIO
> means the
> descriptor has been released. */
> diff --git a/sysdeps/unix/sysv/linux/spawni.c
> b/sysdeps/unix/sysv/linux/spawni.c
> index d6f5ca89cd..0481791013 100644
> --- a/sysdeps/unix/sysv/linux/spawni.c
> +++ b/sysdeps/unix/sysv/linux/spawni.c
> @@ -44,7 +44,11 @@
> third issue is done by a stack allocation in parent, and by using a
> field in struct spawn_args where the child can write an error
> code. CLONE_VFORK ensures that the parent does not run until the
> - child has either exec'ed successfully or exited. */
> + child has either exec'ed successfully or exited.
> +
> + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel
> limitation
> + such as time namespace), only CLONE_VFORK is used instead and the
> + preparation and execve failures are communicated with a pipe. */
>
>
> /* The Unix standard contains a long explanation of the way to signal
> @@ -67,6 +71,7 @@ struct posix_spawn_args
> char *const *envp;
> int xflags;
> int err;
> + int pipe[2];
> };
>
> /* Older version requires that shell script without shebang definition
> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args
> *args)
> }
> }
>
> +/* If the file operation would clobber the pipe fd used to communicate
> with
> + parent, dup the pipe onto an unoccupied file descriptor. */
> +static inline bool
> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
> +{
> + int fd;
> +
> + switch (fa->tag)
> + {
> + case spawn_do_close:
> + fd = fa->action.close_action.fd;
> + break;
> + case spawn_do_open:
> + fd = fa->action.open_action.fd;
> + break;
> + case spawn_do_dup2:
> + fd = fa->action.dup2_action.newfd;
> + break;
> + case spawn_do_fchdir:
> + fd = fa->action.fchdir_action.fd;
> + default:
> + return true;
> + }
> +
> + if (fd == p[1])
> + {
> + int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
> + if (r < 0)
> + return false;
> + __close_nocancel (p[1]);
> + p[1] = r;
> + }
> +
> + return true;
> +}
> +
> +static inline bool
> +spawni_fa_closerange (int from, int to)
> +{
> +#if 0
> + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
> + return r == 0 || __closefrom_fallback (from, to, false);
> +#else
> + return __closefrom_fallback (from, to, false);
> +#endif
> +}
> +
> /* Function used in the clone call to setup the signals mask,
> posix_spawn
> attributes, and file actions. It run on its own stack (provided by
> the
> posix_spawn call). */
> -static int
> -__spawni_child (void *arguments)
> +static _Noreturn int
> +spawni_child (void *arguments)
> {
> struct posix_spawn_args *args = arguments;
> const posix_spawnattr_t *restrict attr = args->attr;
> const posix_spawn_file_actions_t *file_actions = args->fa;
> + bool use_pipe = args->pipe[0] != -1;
>
> /* The child must ensure that no signal handler are enabled because
> it shared
> memory with parent, so the signal disposition must be either
> SIG_DFL or
> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
> struct sigaction sa;
> memset (&sa, '\0', sizeof (sa));
>
> + if (use_pipe)
> + __close (args->pipe[0]);
> +
> sigset_t hset;
> __sigprocmask (SIG_BLOCK, 0, &hset);
> for (int sig = 1; sig < _NSIG; ++sig)
> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
> {
> struct __spawn_action *action = &file_actions->__actions[cnt];
>
> + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
> + goto fail;
> +
> switch (action->tag)
> {
> case spawn_do_close:
> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
> break;
>
> case spawn_do_dup2:
> + if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
> + {
> + errno = EBADF;
> + goto fail;
> + }
> /* Austin Group issue #411 requires adddup2 action with source
> and destination being equal to remove close-on-exec flag. */
> if (action->action.dup2_action.fd
> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
> case spawn_do_closefrom:
> {
> int lowfd = action->action.closefrom_action.from;
> - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
> - if (r != 0 && !__closefrom_fallback (lowfd, false))
> + /* Skip the pipe descriptor if it is used. No need to handle
> + it since it is created with O_CLOEXEC. */
> + if (use_pipe && args->pipe[1] >= lowfd)
> + {
> + if (args->pipe[1] == lowfd
> + && !spawni_fa_closerange (lowfd + 1, ~0U))
> + goto fail;
> + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
> + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
The "else" branch will also run if "args->pipe[1] == lowfd" and
spawni_fa_closerange (lowfd + 1, ~0U) succeeds.
> + goto fail;
> + }
> + else if (!spawni_fa_closerange (lowfd, ~0U))
> goto fail;
> } break;
>
> @@ -300,10 +374,112 @@ fail:
> (EINTERNALBUG) describing that, use ECHILD. Another option would
> be to set args->err to some negative sentinel and have the parent
> abort(), but that seems needlessly harsh. */
> - args->err = errno ? : ECHILD;
> + int ret = errno ? : ECHILD;
> + if (use_pipe)
> + {
> + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
> + if (errno == EPIPE || errno == EBADF)
> + break;
> + }
> + else
> + args->err = ret;
> +
> _exit (SPAWN_ERROR);
> }
>
> +static pid_t
> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
> + size_t stack_size)
> +{
> + struct clone_args clone_args =
> + {
> + .flags = flags,
> + .exit_signal = SIGCHLD,
> + .stack = (uintptr_t) stack,
> + .stack_size = stack_size,
> + };
> + return __clone_internal (&clone_args, spawni_child, args);
> +}
> +
> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to
> optimize
> + memory and overcommit) and return TRUE if the helper was created or
> if the
> + failure was not due resource exhaustion. */
> +static bool
> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> + void *stack, size_t stack_size)
> +{
> + /* The clone flags used will create a new child that will run in the
> same
> + memory space (CLONE_VM) and the execution of calling thread will
> be
> + suspend until the child calls execve or _exit.
> +
> + Also since the calling thread execution will be suspend, there is
> not
> + need for CLONE_SETTLS. Although parent and child share the same
> TLS
> + namespace, there will be no concurrent access for TLS variables
> (errno
> + for instance). */
> + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack,
> stack_size);
> +
> + /* It needs to collect the case where the auxiliary process was
> created
> + but failed to execute the file (due either any preparation step
> or
> + for execve itself). */
> + if (*new_pid > 0)
> + {
> + /* Also, it handles the unlikely case where the auxiliary
> process was
> + terminated before calling execve as if it was successfully. The
> + args.err is set to 0 as default and changed to a positive value
> + only in case of failure, so in case of premature termination
> + due a signal args.err will remain zeroed and it will be up to
> + caller to actually collect it. */
> + *ec = args->err;
> + if (*ec > 0)
> + /* There still an unlikely case where the child is cancelled after
> + setting args.err, due to a positive error value. Also there is
> + possible pid reuse race (where the kernel allocated the same pid
> + to an unrelated process). Unfortunately due synchronization
> + issues where the kernel might not have the process collected
> + the waitpid below can not use WNOHANG. */
> + __waitpid (*new_pid, NULL, 0);
> + }
> + else
> + *ec = errno;
> +
> + /* There is no much point in retrying with fork and exec if kernel
> returns a
> + failure due resource exhaustion. */
> + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
> +}
> +
> +/* Fallback spawn case which does not use CLONE_VM. Any preparation
> step or
> + execve failure is passed with a pipe, which requires additional
> care by
> + the helper stating process since it additional file descriptors
> handle. */
> +static void
> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> + char *stack, size_t stack_size)
> +{
> + if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
> + {
> + *ec = errno;
> + return;
> + }
> +
> + /* Do not trigger atfork handler nor any internal state reset since
> the
> + helper process will call execve. */
> + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
> +
> + __close (args->pipe[1]);
> +
> + if (*new_pid > 0)
> + {
> + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
> + /* A successful execve will close the helper process pipe end. */
> + *ec = 0;
> + else
> + __waitpid (*new_pid, NULL, 0);
> + }
> + else
> + *ec = errno;
> +
> + __close (args->pipe[0]);
> +}
> +
> /* Spawn a new process executing PATH with the attributes describes in
> *ATTRP.
> Before running the process perform the actions described in
> FILE-ACTIONS. */
> static int
> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
> args.argc = argc;
> args.envp = envp;
> args.xflags = xflags;
> + args.pipe[0] = args.pipe[1] = -1;
>
> __libc_signal_block_all (&args.oldmask);
>
> - /* The clone flags used will create a new child that will run in the
> same
> - memory space (CLONE_VM) and the execution of calling thread will
> be
> - suspend until the child calls execve or _exit.
> -
> - Also since the calling thread execution will be suspend, there is
> not
> - need for CLONE_SETTLS. Although parent and child share the same
> TLS
> - namespace, there will be no concurrent access for TLS variables
> (errno
> - for instance). */
> - struct clone_args clone_args =
> - {
> - .flags = CLONE_VM | CLONE_VFORK,
> - .exit_signal = SIGCHLD,
> - .stack = (uintptr_t) stack,
> - .stack_size = stack_size,
> - };
> - new_pid = __clone_internal (&clone_args, __spawni_child, &args);
> -
> - /* It needs to collect the case where the auxiliary process was
> created
> - but failed to execute the file (due either any preparation step
> or
> - for execve itself). */
> - if (new_pid > 0)
> - {
> - /* Also, it handles the unlikely case where the auxiliary
> process was
> - terminated before calling execve as if it was successfully. The
> - args.err is set to 0 as default and changed to a positive value
> - only in case of failure, so in case of premature termination
> - due a signal args.err will remain zeroed and it will be up to
> - caller to actually collect it. */
> - ec = args.err;
> - if (ec > 0)
> - /* There still an unlikely case where the child is cancelled after
> - setting args.err, due to a positive error value. Also there is
> - possible pid reuse race (where the kernel allocated the same pid
> - to an unrelated process). Unfortunately due synchronization
> - issues where the kernel might not have the process collected
> - the waitpid below can not use WNOHANG. */
> - __waitpid (new_pid, NULL, 0);
> - }
> - else
> - ec = errno;
> + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace
> restriction
> + (for instance Linux does not allow processes in different time
> namespaces
> + to share address space) and in this case clone fails with EINVAL.
> Retry
> + with fork and exec. */
> + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
> + spawni_fork (&args, &new_pid, &ec, stack, stack_size);
>
> __munmap (stack, stack_size);
next prev parent reply other threads:[~2022-05-03 17:36 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-03 17:00 Adhemerval Zanella
2022-05-03 17:36 ` Alexey Izbyshev [this message]
2022-05-03 17:50 ` Adhemerval Zanella
2022-05-03 22:49 ` Alexey Izbyshev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru \
--to=izbyshev@ispras.ru \
--cc=adhemerval.zanella@linaro.org \
--cc=carlos@redhat.com \
--cc=fweimer@redhat.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).