From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <izbyshev@ispras.ru>
Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84])
 by sourceware.org (Postfix) with ESMTPS id 3628D3858C51
 for <libc-alpha@sourceware.org>; Tue,  3 May 2022 22:50:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3628D3858C51
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=ispras.ru
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru
Received: from mail.ispras.ru (unknown [83.149.199.84])
 by mail.ispras.ru (Postfix) with ESMTPSA id 3D1EE40D403D;
 Tue,  3 May 2022 22:49:55 +0000 (UTC)
MIME-Version: 1.0
Date: Wed, 04 May 2022 01:49:55 +0300
From: Alexey Izbyshev <izbyshev@ispras.ru>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, Carlos O'Donell <carlos@redhat.com>, Florian
 Weimer <fweimer@redhat.com>
Subject: Re: [PATCH v3] linux: Add fallback for clone failure on posix_spawn
 (BZ#29115)
In-Reply-To: <f8a61700-e89e-f38e-8d3d-818d641ca56a@linaro.org>
References: <20220503170032.2531471-1-adhemerval.zanella@linaro.org>
 <66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru>
 <f8a61700-e89e-f38e-8d3d-818d641ca56a@linaro.org>
User-Agent: Roundcube Webmail/1.4.4
Message-ID: <2f2030c8b774aac2e09c6538fab780a1@ispras.ru>
X-Sender: izbyshev@ispras.ru
Content-Type: text/plain; charset=UTF-8;
 format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, BODY_8BITS,
 GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 03 May 2022 22:50:05 -0000

On 2022-05-03 20:50, Adhemerval Zanella wrote:
> On 03/05/2022 14:36, Alexey Izbyshev wrote:
>> On 2022-05-03 20:00, Adhemerval Zanella wrote:
>>> Even though it might characterizeis as a kernel bug (incomplete 
>>> kernel
>>> support for CLONE_VM used along with CLONE_VFORK), time namespace
>>> support
>>> is already presented in a some kernel releases (since 2020).
>>> 
>>> Although I agree with Carlos that it results in unexpected surprise
>>> behaviour, I also see that providing a reasonable fallback that does 
>>> not
>>> add possible issues like race conditions (as some syscall fallbacks 
>>> in
>>> the
>>> past) provides a easier transition and make the interfaca to work on
>>> multiple scenarios.
>>> 
>>> The fix itself does adds some corner cases that need to be handled, 
>>> but
>>> I still think it is maintanable and a general improvement.
>>> 
>>> --
>>> 
>>> Linux clone with CLONE_VM may fail for some namespace restriction 
>>> (for
>>> instance if kernel does not allow processes in different time 
>>> namespaces
>>> to share the sameaddress space).
>>> 
>>> In this case clone fails with EINVAL and posix_spawn can not spawn a 
>>> new
>>> process.  However the same process can be spawned with fork and exec.
>>> 
>>> The patch fixes by retrying the clone syscall with just CLONE_VFORK
>>> if clone fails with a non transient failure (ENOMEM and EAGAIN still
>>> returns failure to caller).
>>> 
>>> Error on preparation phase or execve is returned by a pipe now that
>>> there is no shared memory that ca be used.  It requires some extra 
>>> care
>>> for some file operations:
>>> 
>>>   * If the file operation would clobber the pipe file descriptor, the
>>>     helper process dup the pipe onto an unoccupied file descriptor.
>>> 
>>>   * dup2 file action that targets the pipe file descriptor returns
>>>     EBADF.
>>> 
>>>   * If closefrom argument overlaps the pipe file descriptor, it is
>>>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].
>>> 
>>> Failure on prepare phase in helper process does not trigger the fork
>>> and exec fallback.
>>> 
>>> Checked on x86_64-linux-gnu.
>>> ---
>>> v3: Fixed closerange fallback,.
>>> v2: Fixed closerange file actions, handle EPIPE, return all errors,
>>>     handle if pipe fd is used in file actions.
>>> ---
>>>  include/unistd.h                             |   4 +-
>>>  io/closefrom.c                               |   2 +-
>>>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +-
>>>  sysdeps/unix/sysv/linux/spawni.c             | 235 
>>> +++++++++++++++----
>>>  4 files changed, 196 insertions(+), 49 deletions(-)
>>> 
>>> diff --git a/include/unistd.h b/include/unistd.h
>>> index af795a37c8..2643991a09 100644
>>> --- a/include/unistd.h
>>> +++ b/include/unistd.h
>>> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int
>>> __lowfd, _Bool dirfd_fallback)
>>>    return false;
>>>  }
>>>  #  else
>>> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) 
>>> attribute_hidden;
>>> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int 
>>> __to,
>>> +                   _Bool)
>>> +     attribute_hidden;
>>>  #  endif
>>>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
>>>  libc_hidden_proto (__read)
>>> diff --git a/io/closefrom.c b/io/closefrom.c
>>> index 48cac2e3d1..a07de06e9b 100644
>>> --- a/io/closefrom.c
>>> +++ b/io/closefrom.c
>>> @@ -30,7 +30,7 @@ __closefrom (int lowfd)
>>>    if (r == 0)
>>>      return ;
>>> 
>>> -  if (!__closefrom_fallback (l, true))
>>> +  if (!__closefrom_fallback (l, ~0U, true))
>>>      __fortify_fail ("closefrom failed to close a file descriptor");
>>>  }
>>>  weak_alias (__closefrom, closefrom)
>>> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> index a9dd0c46b2..ad301e1910 100644
>>> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
>>> @@ -28,7 +28,7 @@
>>>     /proc/self/fd open will trigger a fallback that tries to close a 
>>> file
>>>     descriptor before proceed.  */
>>>  _Bool
>>> -__closefrom_fallback (int from, _Bool dirfd_fallback)
>>> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool 
>>> dirfd_fallback)
>>>  {
>>>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | 
>>> O_DIRECTORY,
>>>                                 0);
>>> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool 
>>> dirfd_fallback)
>>> 
>>>            if (fd == dirfd || fd < from)
>>>              continue;
>>> +      if (fd > to)
>>> +        break;
>>> 
>> __closefrom_fallback() currently ignores "to" if it can't open 
>> /proc/self/fd and dirfd_fallback is true. This path is not used by 
>> posix_spawn(), but it's a potential future bug.
> 
> Indeed, I think iterate over [from, to] instead of [from, INT_MAX]
> should fix it.
> 
Yes, with some care for the fact that "to" is unsigned.
> 
>> 
>>>            /* We ignore close errors because EBADF, EINTR, and EIO 
>>> means the
>>>               descriptor has been released.  */
>>> diff --git a/sysdeps/unix/sysv/linux/spawni.c 
>>> b/sysdeps/unix/sysv/linux/spawni.c
>>> index d6f5ca89cd..0481791013 100644
>>> --- a/sysdeps/unix/sysv/linux/spawni.c
>>> +++ b/sysdeps/unix/sysv/linux/spawni.c
>>> @@ -44,7 +44,11 @@
>>>     third issue is done by a stack allocation in parent, and by using 
>>> a
>>>     field in struct spawn_args where the child can write an error
>>>     code. CLONE_VFORK ensures that the parent does not run until the
>>> -   child has either exec'ed successfully or exited.  */
>>> +   child has either exec'ed successfully or exited.
>>> +
>>> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel 
>>> limitation
>>> +   such as time namespace), only CLONE_VFORK is used instead and the
>>> +   preparation and execve failures are communicated with a pipe.  */
>>> 
>>> 
>>>  /* The Unix standard contains a long explanation of the way to 
>>> signal
>>> @@ -67,6 +71,7 @@ struct posix_spawn_args
>>>    char *const *envp;
>>>    int xflags;
>>>    int err;
>>> +  int pipe[2];
>>>  };
>>> 
>>>  /* Older version requires that shell script without shebang 
>>> definition
>>> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args 
>>> *args)
>>>      }
>>>  }
>>> 
>>> +/* If the file operation would clobber the pipe fd used to 
>>> communicate with
>>> +   parent, dup the pipe onto an unoccupied file descriptor.  */
>>> +static inline bool
>>> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
>>> +{
>>> +  int fd;
>>> +
>>> +  switch (fa->tag)
>>> +    {
>>> +    case spawn_do_close:
>>> +      fd = fa->action.close_action.fd;
>>> +      break;
>>> +    case spawn_do_open:
>>> +      fd = fa->action.open_action.fd;
>>> +      break;
>>> +    case spawn_do_dup2:
>>> +      fd = fa->action.dup2_action.newfd;
>>> +      break;
>>> +    case spawn_do_fchdir:
>>> +      fd = fa->action.fchdir_action.fd;
>>> +    default:
>>> +      return true;
>>> +    }
>>> +
>>> +  if (fd == p[1])
>>> +    {
>>> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
>>> +      if (r < 0)
>>> +    return false;
>>> +      __close_nocancel (p[1]);
>>> +      p[1] = r;
>>> +    }
>>> +
>>> +  return true;
>>> +}
>>> +
>>> +static inline bool
>>> +spawni_fa_closerange (int from, int to)
>>> +{
>>> +#if 0
>>> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
>>> +  return r == 0 || __closefrom_fallback (from, to, false);
>>> +#else
>>> +  return __closefrom_fallback (from, to, false);
>>> +#endif
>>> +}
>>> +
>>>  /* Function used in the clone call to setup the signals mask, 
>>> posix_spawn
>>>     attributes, and file actions.  It run on its own stack (provided 
>>> by the
>>>     posix_spawn call).  */
>>> -static int
>>> -__spawni_child (void *arguments)
>>> +static _Noreturn int
>>> +spawni_child (void *arguments)
>>>  {
>>>    struct posix_spawn_args *args = arguments;
>>>    const posix_spawnattr_t *restrict attr = args->attr;
>>>    const posix_spawn_file_actions_t *file_actions = args->fa;
>>> +  bool use_pipe = args->pipe[0] != -1;
>>> 
>>>    /* The child must ensure that no signal handler are enabled 
>>> because it shared
>>>       memory with parent, so the signal disposition must be either 
>>> SIG_DFL or
>>> @@ -113,6 +166,9 @@ __spawni_child (void *arguments)
>>>    struct sigaction sa;
>>>    memset (&sa, '\0', sizeof (sa));
>>> 
>>> +  if (use_pipe)
>>> +    __close (args->pipe[0]);
>>> +
>>>    sigset_t hset;
>>>    __sigprocmask (SIG_BLOCK, 0, &hset);
>>>    for (int sig = 1; sig < _NSIG; ++sig)
>>> @@ -181,6 +237,9 @@ __spawni_child (void *arguments)
>>>      {
>>>        struct __spawn_action *action = &file_actions->__actions[cnt];
>>> 
>>> +      if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
>>> +        goto fail;
>>> +
>>>        switch (action->tag)
>>>          {
>>>          case spawn_do_close:
>>> @@ -233,6 +292,11 @@ __spawni_child (void *arguments)
>>>            break;
>>> 
>>>          case spawn_do_dup2:
>>> +          if (use_pipe && action->action.dup2_action.fd == 
>>> args->pipe[1])
>>> +        {
>>> +          errno = EBADF;
>>> +          goto fail;
>>> +        }
>>>            /* Austin Group issue #411 requires adddup2 action with 
>>> source
>>>           and destination being equal to remove close-on-exec flag.  
>>> */
>>>            if (action->action.dup2_action.fd
>>> @@ -264,8 +328,18 @@ __spawni_child (void *arguments)
>>>          case spawn_do_closefrom:
>>>            {
>>>          int lowfd = action->action.closefrom_action.from;
>>> -            int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 
>>> 0);
>>> -        if (r != 0 && !__closefrom_fallback (lowfd, false))
>>> +        /* Skip the pipe descriptor if it is used.  No need to 
>>> handle
>>> +           it since it is created with O_CLOEXEC.  */
>>> +        if (use_pipe && args->pipe[1] >= lowfd)
>>> +          {
>>> +            if (args->pipe[1] == lowfd
>>> +             && !spawni_fa_closerange (lowfd + 1, ~0U))
>>> +              goto fail;
>>> +            else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 
>>> 1)
>>> +                 || !spawni_fa_closerange (args->pipe[1] + 1, ~0U))
>> 
>> The "else" branch will also run if "args->pipe[1] == lowfd" and 
>> spawni_fa_closerange (lowfd + 1, ~0U) succeeds.
> 
> I was trying to be too clever here, I think the following is more 
> clear:
> 
>                 if (use_pipe && args->pipe[1] == lowfd)
>                   {
>                     if (!spawni_fa_closerange (lowfd + 1, ~0U))
>                       goto fail;
>                   }
>                 else if (use_pipe && args->pipe[1] > lowfd)
>                   {
>                     if (!spawni_fa_closerange (lowfd, args->pipe[1] - 
> 1)
>                         || !spawni_fa_closerange (args->pipe[1] + 1, 
> ~0U))
>                       goto fail;
>                   }
>                 else if (!spawni_fa_closerange (lowfd, ~0U))
>                   goto fail;
> 
Yeah, it should work. In a (theoretical?) case when args->pipe[1] == 
INT_MAX it might also make sense to avoid signed overflow (UB) on 
additions (e.g. by using "lowfd + 1u").

>>>            } break;
>>> 
>>> @@ -300,10 +374,112 @@ fail:
>>>       (EINTERNALBUG) describing that, use ECHILD.  Another option 
>>> would
>>>       be to set args->err to some negative sentinel and have the 
>>> parent
>>>       abort(), but that seems needlessly harsh.  */
>>> -  args->err = errno ? : ECHILD;
>>> +  int ret = errno ? : ECHILD;
>>> +  if (use_pipe)
>>> +    {
>>> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 
>>> 0)
>>> +    if (errno == EPIPE || errno == EBADF)
>>> +      break;
>>> +    }
>>> +  else
>>> +    args->err = ret;
>>> +
>>>    _exit (SPAWN_ERROR);
>>>  }
>>> 
>>> +static pid_t
>>> +clone_call (struct posix_spawn_args *args, int flags, void *stack,
>>> +        size_t stack_size)
>>> +{
>>> +  struct clone_args clone_args =
>>> +    {
>>> +      .flags = flags,
>>> +      .exit_signal = SIGCHLD,
>>> +      .stack = (uintptr_t) stack,
>>> +      .stack_size = stack_size,
>>> +    };
>>> +  return __clone_internal (&clone_args, spawni_child, args);
>>> +}
>>> +
>>> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to 
>>> optimize
>>> +   memory and overcommit) and return TRUE if the helper was created 
>>> or if the
>>> +   failure was not due resource exhaustion.  */
>>> +static bool
>>> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int 
>>> *ec,
>>> +          void *stack, size_t stack_size)
>>> +{
>>> +  /* The clone flags used will create a new child that will run in 
>>> the same
>>> +     memory space (CLONE_VM) and the execution of calling thread 
>>> will be
>>> +     suspend until the child calls execve or _exit.
>>> +
>>> +     Also since the calling thread execution will be suspend, there 
>>> is not
>>> +     need for CLONE_SETTLS.  Although parent and child share the 
>>> same TLS
>>> +     namespace, there will be no concurrent access for TLS variables 
>>> (errno
>>> +     for instance).  */
>>> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, 
>>> stack_size);
>>> +
>>> +  /* It needs to collect the case where the auxiliary process was 
>>> created
>>> +     but failed to execute the file (due either any preparation step 
>>> or
>>> +     for execve itself).  */
>>> +  if (*new_pid > 0)
>>> +    {
>>> +      /* Also, it handles the unlikely case where the auxiliary 
>>> process was
>>> +     terminated before calling execve as if it was successfully.  
>>> The
>>> +     args.err is set to 0 as default and changed to a positive value
>>> +     only in case of failure, so in case of premature termination
>>> +     due a signal args.err will remain zeroed and it will be up to
>>> +     caller to actually collect it.  */
>>> +      *ec = args->err;
>>> +      if (*ec > 0)
>>> +    /* There still an unlikely case where the child is cancelled 
>>> after
>>> +       setting args.err, due to a positive error value.  Also there 
>>> is
>>> +       possible pid reuse race (where the kernel allocated the same 
>>> pid
>>> +       to an unrelated process).  Unfortunately due synchronization
>>> +       issues where the kernel might not have the process collected
>>> +       the waitpid below can not use WNOHANG.  */
>>> +    __waitpid (*new_pid, NULL, 0);
>>> +    }
>>> +  else
>>> +    *ec = errno;
>>> +
>>> +  /* There is no much point in retrying with fork and exec if kernel 
>>> returns a
>>> +     failure due resource exhaustion.  */
>>> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
>>> +}
>>> +
>>> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation 
>>> step or
>>> +   execve failure is passed with a pipe, which requires additional 
>>> care by
>>> +   the helper stating process since it additional file descriptors 
>>> handle.  */
>>> +static void
>>> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
>>> +         char *stack, size_t stack_size)
>>> +{
>>> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
>>> +    {
>>> +      *ec = errno;
>>> +      return;
>>> +    }
>>> +
>>> +  /* Do not trigger atfork handler nor any internal state reset 
>>> since the
>>> +     helper process will call execve.  */
>>> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
>>> +
>>> +  __close (args->pipe[1]);
>>> +
>>> +  if (*new_pid > 0)
>>> +    {
>>> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
>>> +    /* A successful execve will close the helper process pipe end.  
>>> */
>>> +    *ec = 0;
>>> +      else
>>> +    __waitpid (*new_pid, NULL, 0);
>>> +    }
>>> +  else
>>> +    *ec = errno;
>>> +
>>> +  __close (args->pipe[0]);
>>> +}
>>> +
>>>  /* Spawn a new process executing PATH with the attributes describes 
>>> in *ATTRP.
>>>     Before running the process perform the actions described in 
>>> FILE-ACTIONS. */
>>>  static int
>>> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file,
>>>    args.argc = argc;
>>>    args.envp = envp;
>>>    args.xflags = xflags;
>>> +  args.pipe[0] = args.pipe[1] = -1;
>>> 
>>>    __libc_signal_block_all (&args.oldmask);
>>> 
>>> -  /* The clone flags used will create a new child that will run in 
>>> the same
>>> -     memory space (CLONE_VM) and the execution of calling thread 
>>> will be
>>> -     suspend until the child calls execve or _exit.
>>> -
>>> -     Also since the calling thread execution will be suspend, there 
>>> is not
>>> -     need for CLONE_SETTLS.  Although parent and child share the 
>>> same TLS
>>> -     namespace, there will be no concurrent access for TLS variables 
>>> (errno
>>> -     for instance).  */
>>> -  struct clone_args clone_args =
>>> -    {
>>> -      .flags = CLONE_VM | CLONE_VFORK,
>>> -      .exit_signal = SIGCHLD,
>>> -      .stack = (uintptr_t) stack,
>>> -      .stack_size = stack_size,
>>> -    };
>>> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
>>> -
>>> -  /* It needs to collect the case where the auxiliary process was 
>>> created
>>> -     but failed to execute the file (due either any preparation step 
>>> or
>>> -     for execve itself).  */
>>> -  if (new_pid > 0)
>>> -    {
>>> -      /* Also, it handles the unlikely case where the auxiliary 
>>> process was
>>> -     terminated before calling execve as if it was successfully.  
>>> The
>>> -     args.err is set to 0 as default and changed to a positive value
>>> -     only in case of failure, so in case of premature termination
>>> -     due a signal args.err will remain zeroed and it will be up to
>>> -     caller to actually collect it.  */
>>> -      ec = args.err;
>>> -      if (ec > 0)
>>> -    /* There still an unlikely case where the child is cancelled 
>>> after
>>> -       setting args.err, due to a positive error value.  Also there 
>>> is
>>> -       possible pid reuse race (where the kernel allocated the same 
>>> pid
>>> -       to an unrelated process).  Unfortunately due synchronization
>>> -       issues where the kernel might not have the process collected
>>> -       the waitpid below can not use WNOHANG.  */
>>> -    __waitpid (new_pid, NULL, 0);
>>> -    }
>>> -  else
>>> -    ec = errno;
>>> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace 
>>> restriction
>>> +     (for instance Linux does not allow processes in different time 
>>> namespaces
>>> +     to share address space) and in this case clone fails with 
>>> EINVAL.  Retry
>>> +     with fork and exec.  */
>>> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
>>> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
>>> 
>>>    __munmap (stack, stack_size);