From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 3628D3858C51 for ; Tue, 3 May 2022 22:50:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3628D3858C51 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from mail.ispras.ru (unknown [83.149.199.84]) by mail.ispras.ru (Postfix) with ESMTPSA id 3D1EE40D403D; Tue, 3 May 2022 22:49:55 +0000 (UTC) MIME-Version: 1.0 Date: Wed, 04 May 2022 01:49:55 +0300 From: Alexey Izbyshev To: Adhemerval Zanella Cc: libc-alpha@sourceware.org, Carlos O'Donell , Florian Weimer Subject: Re: [PATCH v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115) In-Reply-To: References: <20220503170032.2531471-1-adhemerval.zanella@linaro.org> <66c881094d4f4d4ff1aa47b8d4f94230@ispras.ru> User-Agent: Roundcube Webmail/1.4.4 Message-ID: <2f2030c8b774aac2e09c6538fab780a1@ispras.ru> X-Sender: izbyshev@ispras.ru Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, BODY_8BITS, GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2022 22:50:05 -0000 On 2022-05-03 20:50, Adhemerval Zanella wrote: > On 03/05/2022 14:36, Alexey Izbyshev wrote: >> On 2022-05-03 20:00, Adhemerval Zanella wrote: >>> Even though it might characterizeis as a kernel bug (incomplete >>> kernel >>> support for CLONE_VM used along with CLONE_VFORK), time namespace >>> support >>> is already presented in a some kernel releases (since 2020). >>> >>> Although I agree with Carlos that it results in unexpected surprise >>> behaviour, I also see that providing a reasonable fallback that does >>> not >>> add possible issues like race conditions (as some syscall fallbacks >>> in >>> the >>> past) provides a easier transition and make the interfaca to work on >>> multiple scenarios. >>> >>> The fix itself does adds some corner cases that need to be handled, >>> but >>> I still think it is maintanable and a general improvement. >>> >>> -- >>> >>> Linux clone with CLONE_VM may fail for some namespace restriction >>> (for >>> instance if kernel does not allow processes in different time >>> namespaces >>> to share the sameaddress space). >>> >>> In this case clone fails with EINVAL and posix_spawn can not spawn a >>> new >>> process.  However the same process can be spawned with fork and exec. >>> >>> The patch fixes by retrying the clone syscall with just CLONE_VFORK >>> if clone fails with a non transient failure (ENOMEM and EAGAIN still >>> returns failure to caller). >>> >>> Error on preparation phase or execve is returned by a pipe now that >>> there is no shared memory that ca be used.  It requires some extra >>> care >>> for some file operations: >>> >>>   * If the file operation would clobber the pipe file descriptor, the >>>     helper process dup the pipe onto an unoccupied file descriptor. >>> >>>   * dup2 file action that targets the pipe file descriptor returns >>>     EBADF. >>> >>>   * If closefrom argument overlaps the pipe file descriptor, it is >>>     splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou]. >>> >>> Failure on prepare phase in helper process does not trigger the fork >>> and exec fallback. >>> >>> Checked on x86_64-linux-gnu. >>> --- >>> v3: Fixed closerange fallback,. >>> v2: Fixed closerange file actions, handle EPIPE, return all errors, >>>     handle if pipe fd is used in file actions. >>> --- >>>  include/unistd.h                             |   4 +- >>>  io/closefrom.c                               |   2 +- >>>  sysdeps/unix/sysv/linux/closefrom_fallback.c |   4 +- >>>  sysdeps/unix/sysv/linux/spawni.c             | 235 >>> +++++++++++++++---- >>>  4 files changed, 196 insertions(+), 49 deletions(-) >>> >>> diff --git a/include/unistd.h b/include/unistd.h >>> index af795a37c8..2643991a09 100644 >>> --- a/include/unistd.h >>> +++ b/include/unistd.h >>> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int >>> __lowfd, _Bool dirfd_fallback) >>>    return false; >>>  } >>>  #  else >>> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) >>> attribute_hidden; >>> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int >>> __to, >>> +                   _Bool) >>> +     attribute_hidden; >>>  #  endif >>>  extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); >>>  libc_hidden_proto (__read) >>> diff --git a/io/closefrom.c b/io/closefrom.c >>> index 48cac2e3d1..a07de06e9b 100644 >>> --- a/io/closefrom.c >>> +++ b/io/closefrom.c >>> @@ -30,7 +30,7 @@ __closefrom (int lowfd) >>>    if (r == 0) >>>      return ; >>> >>> -  if (!__closefrom_fallback (l, true)) >>> +  if (!__closefrom_fallback (l, ~0U, true)) >>>      __fortify_fail ("closefrom failed to close a file descriptor"); >>>  } >>>  weak_alias (__closefrom, closefrom) >>> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> b/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> index a9dd0c46b2..ad301e1910 100644 >>> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> @@ -28,7 +28,7 @@ >>>     /proc/self/fd open will trigger a fallback that tries to close a >>> file >>>     descriptor before proceed.  */ >>>  _Bool >>> -__closefrom_fallback (int from, _Bool dirfd_fallback) >>> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool >>> dirfd_fallback) >>>  { >>>    int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | >>> O_DIRECTORY, >>>                                 0); >>> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool >>> dirfd_fallback) >>> >>>            if (fd == dirfd || fd < from) >>>              continue; >>> +      if (fd > to) >>> +        break; >>> >> __closefrom_fallback() currently ignores "to" if it can't open >> /proc/self/fd and dirfd_fallback is true. This path is not used by >> posix_spawn(), but it's a potential future bug. > > Indeed, I think iterate over [from, to] instead of [from, INT_MAX] > should fix it. > Yes, with some care for the fact that "to" is unsigned. > >> >>>            /* We ignore close errors because EBADF, EINTR, and EIO >>> means the >>>               descriptor has been released.  */ >>> diff --git a/sysdeps/unix/sysv/linux/spawni.c >>> b/sysdeps/unix/sysv/linux/spawni.c >>> index d6f5ca89cd..0481791013 100644 >>> --- a/sysdeps/unix/sysv/linux/spawni.c >>> +++ b/sysdeps/unix/sysv/linux/spawni.c >>> @@ -44,7 +44,11 @@ >>>     third issue is done by a stack allocation in parent, and by using >>> a >>>     field in struct spawn_args where the child can write an error >>>     code. CLONE_VFORK ensures that the parent does not run until the >>> -   child has either exec'ed successfully or exited.  */ >>> +   child has either exec'ed successfully or exited. >>> + >>> +   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel >>> limitation >>> +   such as time namespace), only CLONE_VFORK is used instead and the >>> +   preparation and execve failures are communicated with a pipe.  */ >>> >>> >>>  /* The Unix standard contains a long explanation of the way to >>> signal >>> @@ -67,6 +71,7 @@ struct posix_spawn_args >>>    char *const *envp; >>>    int xflags; >>>    int err; >>> +  int pipe[2]; >>>  }; >>> >>>  /* Older version requires that shell script without shebang >>> definition >>> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args >>> *args) >>>      } >>>  } >>> >>> +/* If the file operation would clobber the pipe fd used to >>> communicate with >>> +   parent, dup the pipe onto an unoccupied file descriptor.  */ >>> +static inline bool >>> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) >>> +{ >>> +  int fd; >>> + >>> +  switch (fa->tag) >>> +    { >>> +    case spawn_do_close: >>> +      fd = fa->action.close_action.fd; >>> +      break; >>> +    case spawn_do_open: >>> +      fd = fa->action.open_action.fd; >>> +      break; >>> +    case spawn_do_dup2: >>> +      fd = fa->action.dup2_action.newfd; >>> +      break; >>> +    case spawn_do_fchdir: >>> +      fd = fa->action.fchdir_action.fd; >>> +    default: >>> +      return true; >>> +    } >>> + >>> +  if (fd == p[1]) >>> +    { >>> +      int r = __fcntl (p[1], F_DUPFD_CLOEXEC); >>> +      if (r < 0) >>> +    return false; >>> +      __close_nocancel (p[1]); >>> +      p[1] = r; >>> +    } >>> + >>> +  return true; >>> +} >>> + >>> +static inline bool >>> +spawni_fa_closerange (int from, int to) >>> +{ >>> +#if 0 >>> +  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); >>> +  return r == 0 || __closefrom_fallback (from, to, false); >>> +#else >>> +  return __closefrom_fallback (from, to, false); >>> +#endif >>> +} >>> + >>>  /* Function used in the clone call to setup the signals mask, >>> posix_spawn >>>     attributes, and file actions.  It run on its own stack (provided >>> by the >>>     posix_spawn call).  */ >>> -static int >>> -__spawni_child (void *arguments) >>> +static _Noreturn int >>> +spawni_child (void *arguments) >>>  { >>>    struct posix_spawn_args *args = arguments; >>>    const posix_spawnattr_t *restrict attr = args->attr; >>>    const posix_spawn_file_actions_t *file_actions = args->fa; >>> +  bool use_pipe = args->pipe[0] != -1; >>> >>>    /* The child must ensure that no signal handler are enabled >>> because it shared >>>       memory with parent, so the signal disposition must be either >>> SIG_DFL or >>> @@ -113,6 +166,9 @@ __spawni_child (void *arguments) >>>    struct sigaction sa; >>>    memset (&sa, '\0', sizeof (sa)); >>> >>> +  if (use_pipe) >>> +    __close (args->pipe[0]); >>> + >>>    sigset_t hset; >>>    __sigprocmask (SIG_BLOCK, 0, &hset); >>>    for (int sig = 1; sig < _NSIG; ++sig) >>> @@ -181,6 +237,9 @@ __spawni_child (void *arguments) >>>      { >>>        struct __spawn_action *action = &file_actions->__actions[cnt]; >>> >>> +      if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) >>> +        goto fail; >>> + >>>        switch (action->tag) >>>          { >>>          case spawn_do_close: >>> @@ -233,6 +292,11 @@ __spawni_child (void *arguments) >>>            break; >>> >>>          case spawn_do_dup2: >>> +          if (use_pipe && action->action.dup2_action.fd == >>> args->pipe[1]) >>> +        { >>> +          errno = EBADF; >>> +          goto fail; >>> +        } >>>            /* Austin Group issue #411 requires adddup2 action with >>> source >>>           and destination being equal to remove close-on-exec flag.  >>> */ >>>            if (action->action.dup2_action.fd >>> @@ -264,8 +328,18 @@ __spawni_child (void *arguments) >>>          case spawn_do_closefrom: >>>            { >>>          int lowfd = action->action.closefrom_action.from; >>> -            int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, >>> 0); >>> -        if (r != 0 && !__closefrom_fallback (lowfd, false)) >>> +        /* Skip the pipe descriptor if it is used.  No need to >>> handle >>> +           it since it is created with O_CLOEXEC.  */ >>> +        if (use_pipe && args->pipe[1] >= lowfd) >>> +          { >>> +            if (args->pipe[1] == lowfd >>> +             && !spawni_fa_closerange (lowfd + 1, ~0U)) >>> +              goto fail; >>> +            else if (!spawni_fa_closerange (lowfd, args->pipe[1] - >>> 1) >>> +                 || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) >> >> The "else" branch will also run if "args->pipe[1] == lowfd" and >> spawni_fa_closerange (lowfd + 1, ~0U) succeeds. > > I was trying to be too clever here, I think the following is more > clear: > > if (use_pipe && args->pipe[1] == lowfd) > { > if (!spawni_fa_closerange (lowfd + 1, ~0U)) > goto fail; > } > else if (use_pipe && args->pipe[1] > lowfd) > { > if (!spawni_fa_closerange (lowfd, args->pipe[1] - > 1) > || !spawni_fa_closerange (args->pipe[1] + 1, > ~0U)) > goto fail; > } > else if (!spawni_fa_closerange (lowfd, ~0U)) > goto fail; > Yeah, it should work. In a (theoretical?) case when args->pipe[1] == INT_MAX it might also make sense to avoid signed overflow (UB) on additions (e.g. by using "lowfd + 1u"). >>>            } break; >>> >>> @@ -300,10 +374,112 @@ fail: >>>       (EINTERNALBUG) describing that, use ECHILD.  Another option >>> would >>>       be to set args->err to some negative sentinel and have the >>> parent >>>       abort(), but that seems needlessly harsh.  */ >>> -  args->err = errno ? : ECHILD; >>> +  int ret = errno ? : ECHILD; >>> +  if (use_pipe) >>> +    { >>> +      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < >>> 0) >>> +    if (errno == EPIPE || errno == EBADF) >>> +      break; >>> +    } >>> +  else >>> +    args->err = ret; >>> + >>>    _exit (SPAWN_ERROR); >>>  } >>> >>> +static pid_t >>> +clone_call (struct posix_spawn_args *args, int flags, void *stack, >>> +        size_t stack_size) >>> +{ >>> +  struct clone_args clone_args = >>> +    { >>> +      .flags = flags, >>> +      .exit_signal = SIGCHLD, >>> +      .stack = (uintptr_t) stack, >>> +      .stack_size = stack_size, >>> +    }; >>> +  return __clone_internal (&clone_args, spawni_child, args); >>> +} >>> + >>> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to >>> optimize >>> +   memory and overcommit) and return TRUE if the helper was created >>> or if the >>> +   failure was not due resource exhaustion.  */ >>> +static bool >>> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int >>> *ec, >>> +          void *stack, size_t stack_size) >>> +{ >>> +  /* The clone flags used will create a new child that will run in >>> the same >>> +     memory space (CLONE_VM) and the execution of calling thread >>> will be >>> +     suspend until the child calls execve or _exit. >>> + >>> +     Also since the calling thread execution will be suspend, there >>> is not >>> +     need for CLONE_SETTLS.  Although parent and child share the >>> same TLS >>> +     namespace, there will be no concurrent access for TLS variables >>> (errno >>> +     for instance).  */ >>> +  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, >>> stack_size); >>> + >>> +  /* It needs to collect the case where the auxiliary process was >>> created >>> +     but failed to execute the file (due either any preparation step >>> or >>> +     for execve itself).  */ >>> +  if (*new_pid > 0) >>> +    { >>> +      /* Also, it handles the unlikely case where the auxiliary >>> process was >>> +     terminated before calling execve as if it was successfully.  >>> The >>> +     args.err is set to 0 as default and changed to a positive value >>> +     only in case of failure, so in case of premature termination >>> +     due a signal args.err will remain zeroed and it will be up to >>> +     caller to actually collect it.  */ >>> +      *ec = args->err; >>> +      if (*ec > 0) >>> +    /* There still an unlikely case where the child is cancelled >>> after >>> +       setting args.err, due to a positive error value.  Also there >>> is >>> +       possible pid reuse race (where the kernel allocated the same >>> pid >>> +       to an unrelated process).  Unfortunately due synchronization >>> +       issues where the kernel might not have the process collected >>> +       the waitpid below can not use WNOHANG.  */ >>> +    __waitpid (*new_pid, NULL, 0); >>> +    } >>> +  else >>> +    *ec = errno; >>> + >>> +  /* There is no much point in retrying with fork and exec if kernel >>> returns a >>> +     failure due resource exhaustion.  */ >>> +  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); >>> +} >>> + >>> +/* Fallback spawn case which does not use CLONE_VM.  Any preparation >>> step or >>> +   execve failure is passed with a pipe, which requires additional >>> care by >>> +   the helper stating process since it additional file descriptors >>> handle.  */ >>> +static void >>> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, >>> +         char *stack, size_t stack_size) >>> +{ >>> +  if (__pipe2 (args->pipe, O_CLOEXEC) != 0) >>> +    { >>> +      *ec = errno; >>> +      return; >>> +    } >>> + >>> +  /* Do not trigger atfork handler nor any internal state reset >>> since the >>> +     helper process will call execve.  */ >>> +  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); >>> + >>> +  __close (args->pipe[1]); >>> + >>> +  if (*new_pid > 0) >>> +    { >>> +      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) >>> +    /* A successful execve will close the helper process pipe end.  >>> */ >>> +    *ec = 0; >>> +      else >>> +    __waitpid (*new_pid, NULL, 0); >>> +    } >>> +  else >>> +    *ec = errno; >>> + >>> +  __close (args->pipe[0]); >>> +} >>> + >>>  /* Spawn a new process executing PATH with the attributes describes >>> in *ATTRP. >>>     Before running the process perform the actions described in >>> FILE-ACTIONS. */ >>>  static int >>> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, >>>    args.argc = argc; >>>    args.envp = envp; >>>    args.xflags = xflags; >>> +  args.pipe[0] = args.pipe[1] = -1; >>> >>>    __libc_signal_block_all (&args.oldmask); >>> >>> -  /* The clone flags used will create a new child that will run in >>> the same >>> -     memory space (CLONE_VM) and the execution of calling thread >>> will be >>> -     suspend until the child calls execve or _exit. >>> - >>> -     Also since the calling thread execution will be suspend, there >>> is not >>> -     need for CLONE_SETTLS.  Although parent and child share the >>> same TLS >>> -     namespace, there will be no concurrent access for TLS variables >>> (errno >>> -     for instance).  */ >>> -  struct clone_args clone_args = >>> -    { >>> -      .flags = CLONE_VM | CLONE_VFORK, >>> -      .exit_signal = SIGCHLD, >>> -      .stack = (uintptr_t) stack, >>> -      .stack_size = stack_size, >>> -    }; >>> -  new_pid = __clone_internal (&clone_args, __spawni_child, &args); >>> - >>> -  /* It needs to collect the case where the auxiliary process was >>> created >>> -     but failed to execute the file (due either any preparation step >>> or >>> -     for execve itself).  */ >>> -  if (new_pid > 0) >>> -    { >>> -      /* Also, it handles the unlikely case where the auxiliary >>> process was >>> -     terminated before calling execve as if it was successfully.  >>> The >>> -     args.err is set to 0 as default and changed to a positive value >>> -     only in case of failure, so in case of premature termination >>> -     due a signal args.err will remain zeroed and it will be up to >>> -     caller to actually collect it.  */ >>> -      ec = args.err; >>> -      if (ec > 0) >>> -    /* There still an unlikely case where the child is cancelled >>> after >>> -       setting args.err, due to a positive error value.  Also there >>> is >>> -       possible pid reuse race (where the kernel allocated the same >>> pid >>> -       to an unrelated process).  Unfortunately due synchronization >>> -       issues where the kernel might not have the process collected >>> -       the waitpid below can not use WNOHANG.  */ >>> -    __waitpid (new_pid, NULL, 0); >>> -    } >>> -  else >>> -    ec = errno; >>> +  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace >>> restriction >>> +     (for instance Linux does not allow processes in different time >>> namespaces >>> +     to share address space) and in this case clone fails with >>> EINVAL.  Retry >>> +     with fork and exec.  */ >>> +  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) >>> +    spawni_fork (&args, &new_pid, &ec, stack, stack_size); >>> >>>    __munmap (stack, stack_size);