public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Alexey Izbyshev <izbyshev@ispras.ru>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, Carlos O'Donell <carlos@redhat.com>,
	Florian Weimer <fweimer@redhat.com>
Subject: Re: [PATCH v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115)
Date: Tue, 03 May 2022 20:36:28 +0300	[thread overview]
Message-ID: <66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru> (raw)
In-Reply-To: <20220503170032.2531471-1-adhemerval.zanella@linaro.org>

On 2022-05-03 20:00, Adhemerval Zanella wrote:
> Even though it might characterizeis as a kernel bug (incomplete kernel
> support for CLONE_VM used along with CLONE_VFORK), time namespace
> support
> is already presented in a some kernel releases (since 2020).
> 
> Although I agree with Carlos that it results in unexpected surprise
> behaviour, I also see that providing a reasonable fallback that does 
> not
> add possible issues like race conditions (as some syscall fallbacks in
> the
> past) provides a easier transition and make the interfaca to work on
> multiple scenarios.
> 
> The fix itself does adds some corner cases that need to be handled, but
> I still think it is maintanable and a general improvement.
> 
> --
> 
> Linux clone with CLONE_VM may fail for some namespace restriction (for
> instance if kernel does not allow processes in different time 
> namespaces
> to share the sameaddress space).
> 
> In this case clone fails with EINVAL and posix_spawn can not spawn a 
> new
> process.  However the same process can be spawned with fork and exec.
> 
> The patch fixes by retrying the clone syscall with just CLONE_VFORK
> if clone fails with a non transient failure (ENOMEM and EAGAIN still
> returns failure to caller).
> 
> Error on preparation phase or execve is returned by a pipe now that
> there is no shared memory that ca be used.  It requires some extra care
> for some file operations:
> 
>   * If the file operation would clobber the pipe file descriptor, the
>     helper process dup the pipe onto an unoccupied file descriptor.
> 
>   * dup2 file action that targets the pipe file descriptor returns
>     EBADF.
> 
>   * If closefrom argument overlaps the pipe file descriptor, it is
>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
> 
> Failure on prepare phase in helper process does not trigger the fork
> and exec fallback.
> 
> Checked on x86_64-linux-gnu.
> ---
> v3: Fixed closerange fallback,.
> v2: Fixed closerange file actions, handle EPIPE, return all errors,
>     handle if pipe fd is used in file actions.
> ---
>  include/unistd.h                             |   4 +-
>  io/closefrom.c                               |   2 +-
>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
>  sysdeps/unix/sysv/linux/spawni.c             | 235 +++++++++++++++----
>  4 files changed, 196 insertions(+), 49 deletions(-)
> 
> diff --git a/include/unistd.h b/include/unistd.h
> index af795a37c8..2643991a09 100644
> --- a/include/unistd.h
> +++ b/include/unistd.h
> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
> __lowfd, _Bool dirfd_fallback)
>    return false;
>  }
>  #  else
> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) 
> attribute_hidden;
> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int 
> __to,
> +				   _Bool)
> +     attribute_hidden;
>  #  endif
>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
>  libc_hidden_proto (__read)
> diff --git a/io/closefrom.c b/io/closefrom.c
> index 48cac2e3d1..a07de06e9b 100644
> --- a/io/closefrom.c
> +++ b/io/closefrom.c
> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
>    if (r == 0)
>      return ;
> 
> -  if (!__closefrom_fallback (l, true))
> +  if (!__closefrom_fallback (l, ~0U, true))
>      __fortify_fail ("closefrom failed to close a file descriptor");
>  }
>  weak_alias (__closefrom, closefrom)
> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> index a9dd0c46b2..ad301e1910 100644
> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
> @@ -28,7 +28,7 @@
>     /proc/self/fd open will trigger a fallback that tries to close a 
> file
>     descriptor before proceed.  */
>  _Bool
> -__closefrom_fallback (int from, _Bool dirfd_fallback)
> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool 
> dirfd_fallback)
>  {
>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | 
> O_DIRECTORY,
>                                 0);
> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
> 
>            if (fd == dirfd || fd < from)
>              continue;
> +	  if (fd > to)
> +	    break;
> 
__closefrom_fallback() currently ignores "to" if it can't open 
/proc/self/fd and dirfd_fallback is true. This path is not used by 
posix_spawn(), but it's a potential future bug.

>            /* We ignore close errors because EBADF, EINTR, and EIO 
> means the
>               descriptor has been released.  */
> diff --git a/sysdeps/unix/sysv/linux/spawni.c 
> b/sysdeps/unix/sysv/linux/spawni.c
> index d6f5ca89cd..0481791013 100644
> --- a/sysdeps/unix/sysv/linux/spawni.c
> +++ b/sysdeps/unix/sysv/linux/spawni.c
> @@ -44,7 +44,11 @@
>     third issue is done by a stack allocation in parent, and by using a
>     field in struct spawn_args where the child can write an error
>     code. CLONE_VFORK ensures that the parent does not run until the
> -   child has either exec'ed successfully or exited.  */
> +   child has either exec'ed successfully or exited.
> +
> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel 
> limitation
> +   such as time namespace), only CLONE_VFORK is used instead and the
> +   preparation and execve failures are communicated with a pipe.  */
> 
> 
>  /* The Unix standard contains a long explanation of the way to signal
> @@ -67,6 +71,7 @@ struct posix_spawn_args
>    char *const *envp;
>    int xflags;
>    int err;
> +  int pipe[2];
>  };
> 
>  /* Older version requires that shell script without shebang definition
> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args 
> *args)
>      }
>  }
> 
> +/* If the file operation would clobber the pipe fd used to communicate 
> with
> +   parent, dup the pipe onto an unoccupied file descriptor.  */
> +static inline bool
> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
> +{
> +  int fd;
> +
> +  switch (fa->tag)
> +    {
> +    case spawn_do_close:
> +      fd = fa->action.close_action.fd;
> +      break;
> +    case spawn_do_open:
> +      fd = fa->action.open_action.fd;
> +      break;
> +    case spawn_do_dup2:
> +      fd = fa->action.dup2_action.newfd;
> +      break;
> +    case spawn_do_fchdir:
> +      fd = fa->action.fchdir_action.fd;
> +    default:
> +      return true;
> +    }
> +
> +  if (fd == p[1])
> +    {
> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
> +      if (r < 0)
> +	return false;
> +      __close_nocancel (p[1]);
> +      p[1] = r;
> +    }
> +
> +  return true;
> +}
> +
> +static inline bool
> +spawni_fa_closerange (int from, int to)
> +{
> +#if 0
> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
> +  return r == 0 || __closefrom_fallback (from, to, false);
> +#else
> +  return __closefrom_fallback (from, to, false);
> +#endif
> +}
> +
>  /* Function used in the clone call to setup the signals mask, 
> posix_spawn
>     attributes, and file actions.  It run on its own stack (provided by 
> the
>     posix_spawn call).  */
> -static int
> -__spawni_child (void *arguments)
> +static _Noreturn int
> +spawni_child (void *arguments)
>  {
>    struct posix_spawn_args *args = arguments;
>    const posix_spawnattr_t *restrict attr = args->attr;
>    const posix_spawn_file_actions_t *file_actions = args->fa;
> +  bool use_pipe = args->pipe[0] != -1;
> 
>    /* The child must ensure that no signal handler are enabled because 
> it shared
>       memory with parent, so the signal disposition must be either 
> SIG_DFL or
> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
>    struct sigaction sa;
>    memset (&sa, '\0', sizeof (sa));
> 
> +  if (use_pipe)
> +    __close (args->pipe[0]);
> +
>    sigset_t hset;
>    __sigprocmask (SIG_BLOCK, 0, &hset);
>    for (int sig = 1; sig < _NSIG; ++sig)
> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
>  	{
>  	  struct __spawn_action *action = &file_actions->__actions[cnt];
> 
> +	  if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
> +	    goto fail;
> +
>  	  switch (action->tag)
>  	    {
>  	    case spawn_do_close:
> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
>  	      break;
> 
>  	    case spawn_do_dup2:
> +	      if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
> +		{
> +		  errno = EBADF;
> +		  goto fail;
> +		}
>  	      /* Austin Group issue #411 requires adddup2 action with source
>  		 and destination being equal to remove close-on-exec flag.  */
>  	      if (action->action.dup2_action.fd
> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
>  	    case spawn_do_closefrom:
>  	      {
>  		int lowfd = action->action.closefrom_action.from;
> -	        int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
> -		if (r != 0 && !__closefrom_fallback (lowfd, false))
> +		/* Skip the pipe descriptor if it is used.  No need to handle
> +		   it since it is created with O_CLOEXEC.  */
> +		if (use_pipe && args->pipe[1] >= lowfd)
> +		  {
> +		    if (args->pipe[1] == lowfd
> +			 && !spawni_fa_closerange (lowfd + 1, ~0U))
> +		      goto fail;
> +		    else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
> +			     || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))

The "else" branch will also run if "args->pipe[1] == lowfd" and 
spawni_fa_closerange (lowfd + 1, ~0U) succeeds.

> +		      goto fail;
> +		  }
> +		else if (!spawni_fa_closerange (lowfd, ~0U))
>  		  goto fail;
>  	      } break;
> 
> @@ -300,10 +374,112 @@ fail:
>       (EINTERNALBUG) describing that, use ECHILD.  Another option would
>       be to set args->err to some negative sentinel and have the parent
>       abort(), but that seems needlessly harsh.  */
> -  args->err = errno ? : ECHILD;
> +  int ret = errno ? : ECHILD;
> +  if (use_pipe)
> +    {
> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
> +	if (errno == EPIPE || errno == EBADF)
> +	  break;
> +    }
> +  else
> +    args->err = ret;
> +
>    _exit (SPAWN_ERROR);
>  }
> 
> +static pid_t
> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
> +	    size_t stack_size)
> +{
> +  struct clone_args clone_args =
> +    {
> +      .flags = flags,
> +      .exit_signal = SIGCHLD,
> +      .stack = (uintptr_t) stack,
> +      .stack_size = stack_size,
> +    };
> +  return __clone_internal (&clone_args, spawni_child, args);
> +}
> +
> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to 
> optimize
> +   memory and overcommit) and return TRUE if the helper was created or 
> if the
> +   failure was not due resource exhaustion.  */
> +static bool
> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> +	      void *stack, size_t stack_size)
> +{
> +  /* The clone flags used will create a new child that will run in the 
> same
> +     memory space (CLONE_VM) and the execution of calling thread will 
> be
> +     suspend until the child calls execve or _exit.
> +
> +     Also since the calling thread execution will be suspend, there is 
> not
> +     need for CLONE_SETTLS.  Although parent and child share the same 
> TLS
> +     namespace, there will be no concurrent access for TLS variables 
> (errno
> +     for instance).  */
> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, 
> stack_size);
> +
> +  /* It needs to collect the case where the auxiliary process was 
> created
> +     but failed to execute the file (due either any preparation step 
> or
> +     for execve itself).  */
> +  if (*new_pid > 0)
> +    {
> +      /* Also, it handles the unlikely case where the auxiliary 
> process was
> +	 terminated before calling execve as if it was successfully.  The
> +	 args.err is set to 0 as default and changed to a positive value
> +	 only in case of failure, so in case of premature termination
> +	 due a signal args.err will remain zeroed and it will be up to
> +	 caller to actually collect it.  */
> +      *ec = args->err;
> +      if (*ec > 0)
> +	/* There still an unlikely case where the child is cancelled after
> +	   setting args.err, due to a positive error value.  Also there is
> +	   possible pid reuse race (where the kernel allocated the same pid
> +	   to an unrelated process).  Unfortunately due synchronization
> +	   issues where the kernel might not have the process collected
> +	   the waitpid below can not use WNOHANG.  */
> +	__waitpid (*new_pid, NULL, 0);
> +    }
> +  else
> +    *ec = errno;
> +
> +  /* There is no much point in retrying with fork and exec if kernel 
> returns a
> +     failure due resource exhaustion.  */
> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
> +}
> +
> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation 
> step or
> +   execve failure is passed with a pipe, which requires additional 
> care by
> +   the helper stating process since it additional file descriptors 
> handle.  */
> +static void
> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
> +	     char *stack, size_t stack_size)
> +{
> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
> +    {
> +      *ec = errno;
> +      return;
> +    }
> +
> +  /* Do not trigger atfork handler nor any internal state reset since 
> the
> +     helper process will call execve.  */
> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
> +
> +  __close (args->pipe[1]);
> +
> +  if (*new_pid > 0)
> +    {
> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
> +	/* A successful execve will close the helper process pipe end.  */
> +	*ec = 0;
> +      else
> +	__waitpid (*new_pid, NULL, 0);
> +    }
> +  else
> +    *ec = errno;
> +
> +  __close (args->pipe[0]);
> +}
> +
>  /* Spawn a new process executing PATH with the attributes describes in 
> *ATTRP.
>     Before running the process perform the actions described in 
> FILE-ACTIONS. */
>  static int
> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
>    args.argc = argc;
>    args.envp = envp;
>    args.xflags = xflags;
> +  args.pipe[0] = args.pipe[1] = -1;
> 
>    __libc_signal_block_all (&args.oldmask);
> 
> -  /* The clone flags used will create a new child that will run in the 
> same
> -     memory space (CLONE_VM) and the execution of calling thread will 
> be
> -     suspend until the child calls execve or _exit.
> -
> -     Also since the calling thread execution will be suspend, there is 
> not
> -     need for CLONE_SETTLS.  Although parent and child share the same 
> TLS
> -     namespace, there will be no concurrent access for TLS variables 
> (errno
> -     for instance).  */
> -  struct clone_args clone_args =
> -    {
> -      .flags = CLONE_VM | CLONE_VFORK,
> -      .exit_signal = SIGCHLD,
> -      .stack = (uintptr_t) stack,
> -      .stack_size = stack_size,
> -    };
> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
> -
> -  /* It needs to collect the case where the auxiliary process was 
> created
> -     but failed to execute the file (due either any preparation step 
> or
> -     for execve itself).  */
> -  if (new_pid > 0)
> -    {
> -      /* Also, it handles the unlikely case where the auxiliary 
> process was
> -	 terminated before calling execve as if it was successfully.  The
> -	 args.err is set to 0 as default and changed to a positive value
> -	 only in case of failure, so in case of premature termination
> -	 due a signal args.err will remain zeroed and it will be up to
> -	 caller to actually collect it.  */
> -      ec = args.err;
> -      if (ec > 0)
> -	/* There still an unlikely case where the child is cancelled after
> -	   setting args.err, due to a positive error value.  Also there is
> -	   possible pid reuse race (where the kernel allocated the same pid
> -	   to an unrelated process).  Unfortunately due synchronization
> -	   issues where the kernel might not have the process collected
> -	   the waitpid below can not use WNOHANG.  */
> -	__waitpid (new_pid, NULL, 0);
> -    }
> -  else
> -    ec = errno;
> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace 
> restriction
> +     (for instance Linux does not allow processes in different time 
> namespaces
> +     to share address space) and in this case clone fails with EINVAL. 
>  Retry
> +     with fork and exec.  */
> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
> 
>    __munmap (stack, stack_size);

  reply	other threads:[~2022-05-03 17:36 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-03 17:00 Adhemerval Zanella
2022-05-03 17:36 ` Alexey Izbyshev [this message]
2022-05-03 17:50   ` Adhemerval Zanella
2022-05-03 22:49     ` Alexey Izbyshev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru \
    --to=izbyshev@ispras.ru \
    --cc=adhemerval.zanella@linaro.org \
    --cc=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).