public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
@ 2022-05-10 19:11 Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h Adhemerval Zanella
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Adhemerval Zanella @ 2022-05-10 19:11 UTC (permalink / raw)
  To: libc-alpha, Alexey Izbyshev, Carlos O'Donell, Florian Weimer

The patchset adds some support to tests the fallback code to
use only use CLONE_VFORK.  It uses unshare directly because
it simpler than add container support.

Adhemerval Zanella (3):
  linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
  support: Add support_enter_time_namespace
  linux: Add fallback for clone failure on posix_spawn (BZ #29115)

 include/unistd.h                             |   4 +-
 io/closefrom.c                               |   2 +-
 posix/Makefile                               |  16 +-
 posix/tst-spawn-chdir-timens.c               |   2 +
 posix/tst-spawn-chdir.c                      |   8 +
 posix/tst-spawn-timens.c                     |   2 +
 posix/tst-spawn.c                            |   6 +
 posix/tst-spawn2-timens.c                    |   2 +
 posix/tst-spawn2.c                           |   8 +
 posix/tst-spawn4-timens.c                    |   2 +
 posix/tst-spawn4.c                           |  10 +-
 posix/tst-spawn5-timens.c                    |   2 +
 posix/tst-spawn5.c                           |  10 +-
 posix/tst-spawn6-timens.c                    |   2 +
 posix/tst-spawn6.c                           |  12 +-
 support/Makefile                             |   1 +
 support/namespace.h                          |   5 +
 support/support_enter_time_namespace.c       |  34 +++
 sysdeps/unix/sysv/linux/bits/sched.h         |   4 +
 sysdeps/unix/sysv/linux/closefrom_fallback.c |   6 +-
 sysdeps/unix/sysv/linux/spawni.c             | 233 +++++++++++++++----
 21 files changed, 315 insertions(+), 56 deletions(-)
 create mode 100644 posix/tst-spawn-chdir-timens.c
 create mode 100644 posix/tst-spawn-timens.c
 create mode 100644 posix/tst-spawn2-timens.c
 create mode 100644 posix/tst-spawn4-timens.c
 create mode 100644 posix/tst-spawn5-timens.c
 create mode 100644 posix/tst-spawn6-timens.c
 create mode 100644 support/support_enter_time_namespace.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
  2022-05-10 19:11 [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Adhemerval Zanella
@ 2022-05-10 19:11 ` Adhemerval Zanella
  2022-05-23 13:45   ` Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 2/3] support: Add support_enter_time_namespace Adhemerval Zanella
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Adhemerval Zanella @ 2022-05-10 19:11 UTC (permalink / raw)
  To: libc-alpha, Alexey Izbyshev, Carlos O'Donell, Florian Weimer

It was added in commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c.
---
 sysdeps/unix/sysv/linux/bits/sched.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/sysdeps/unix/sysv/linux/bits/sched.h b/sysdeps/unix/sysv/linux/bits/sched.h
index f13c569863..d79359eaeb 100644
--- a/sysdeps/unix/sysv/linux/bits/sched.h
+++ b/sysdeps/unix/sysv/linux/bits/sched.h
@@ -71,6 +71,10 @@
 # define CLONE_NEWPID	0x20000000	/* New pid namespace.  */
 # define CLONE_NEWNET	0x40000000	/* New network namespace.  */
 # define CLONE_IO	0x80000000	/* Clone I/O context.  */
+
+/* cloning flags intersect with CSIGNAL so can be used only with unshare and
+   clone3 syscalls.  */
+#define CLONE_NEWTIME	0x00000080      /* New time namespace */
 #endif
 
 #include <bits/types/struct_sched_param.h>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 2/3] support: Add support_enter_time_namespace
  2022-05-10 19:11 [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h Adhemerval Zanella
@ 2022-05-10 19:11 ` Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 3/3] linux: Add fallback for clone failure on posix_spawn (BZ #29115) Adhemerval Zanella
  2022-05-10 19:18 ` [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Florian Weimer
  3 siblings, 0 replies; 15+ messages in thread
From: Adhemerval Zanella @ 2022-05-10 19:11 UTC (permalink / raw)
  To: libc-alpha, Alexey Izbyshev, Carlos O'Donell, Florian Weimer

Enter a time namespace, where the new namespace isolates clock
values.  It requires either a root-like privileges (done with
support_become_root) or a previous user namespace (CLONE_NEWUSER).

A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time
namespace, but doesn't set it to the current process. Then all
children of the process will be born in the new time namespace.

It will be used on posix_spawn tests to exercise the BZ #29115
fix, where clone (CLONE_VFORK | CLONE_VM) fails if the process
enter a time namespace.
---
 support/Makefile                       |  1 +
 support/namespace.h                    |  5 ++++
 support/support_enter_time_namespace.c | 34 ++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)
 create mode 100644 support/support_enter_time_namespace.c

diff --git a/support/Makefile b/support/Makefile
index 9b50eac117..e4a1402c36 100644
--- a/support/Makefile
+++ b/support/Makefile
@@ -56,6 +56,7 @@ libsupport-routines = \
   support_descriptors \
   support_enter_mount_namespace \
   support_enter_network_namespace \
+  support_enter_time_namespace \
   support_format_address_family \
   support_format_addrinfo \
   support_format_dns_packet \
diff --git a/support/namespace.h b/support/namespace.h
index 23bad6403b..338000547c 100644
--- a/support/namespace.h
+++ b/support/namespace.h
@@ -56,6 +56,11 @@ bool support_enter_network_namespace (void);
    not affect the host system afterwards.  */
 bool support_enter_mount_namespace (void);
 
+/* Enter a time namespace, where the new namespace isolates clock
+   values.  It requires either a root-like privileges (done with
+   support_become_root) or a previous user namespace (CLONE_NEWUSER).  */
+bool support_enter_time_namespace (void);
+
 /* Return true if support_enter_network_namespace managed to enter a
    UTS namespace.  */
 bool support_in_uts_namespace (void);
diff --git a/support/support_enter_time_namespace.c b/support/support_enter_time_namespace.c
new file mode 100644
index 0000000000..a18caa878a
--- /dev/null
+++ b/support/support_enter_time_namespace.c
@@ -0,0 +1,34 @@
+/* Enter a time namespace.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <support/namespace.h>
+
+#include <sched.h>
+#include <stdio.h>
+
+bool
+support_enter_time_namespace (void)
+{
+#ifdef CLONE_NEWTIME
+  if (unshare (CLONE_NEWTIME) == 0)
+    return true;
+  else
+    printf ("warning: unshare (CLONE_NEWTIME) failed: %m\n");
+#endif /* CLONE_NEWNS */
+  return false;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 3/3] linux: Add fallback for clone failure on posix_spawn (BZ #29115)
  2022-05-10 19:11 [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h Adhemerval Zanella
  2022-05-10 19:11 ` [PATCH v4 2/3] support: Add support_enter_time_namespace Adhemerval Zanella
@ 2022-05-10 19:11 ` Adhemerval Zanella
  2022-05-10 19:18 ` [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Florian Weimer
  3 siblings, 0 replies; 15+ messages in thread
From: Adhemerval Zanella @ 2022-05-10 19:11 UTC (permalink / raw)
  To: libc-alpha, Alexey Izbyshev, Carlos O'Donell, Florian Weimer

Linux clone with CLONE_VM may fail for some namespace restriction (for
instance if kernel does not allow processes in different time namespaces
to share the sameaddress space).

In this case clone fails with EINVAL and posix_spawn can not spawn a new
process.  However the same process can be spawned with fork and exec.

The patch fixes by retrying the clone syscall with just CLONE_VFORK
if clone fails with a non transient failure (ENOMEM and EAGAIN still
returns failure to caller).

Error on preparation phase or execve is returned by a pipe now that
there is no shared memory that ca be used.  It requires some extra care
for some file operations:

  * If the file operation would clobber the pipe file descriptor, the
    helper process dup the pipe onto an unoccupied file descriptor.

  * dup2 file action that targets the pipe file descriptor returns
    EBADF.

  * If closefrom argument overlaps the pipe file descriptor, it is
    splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou].

Failure on prepare phase in helper process does not trigger the fork
and exec fallback.

Some spawn tests were adapted to run inside a created time namespace.

Checked on x86_64-linux-gnu.
---
 include/unistd.h                             |   4 +-
 io/closefrom.c                               |   2 +-
 posix/Makefile                               |  16 +-
 posix/tst-spawn-chdir-timens.c               |   2 +
 posix/tst-spawn-chdir.c                      |   8 +
 posix/tst-spawn-timens.c                     |   2 +
 posix/tst-spawn.c                            |   6 +
 posix/tst-spawn2-timens.c                    |   2 +
 posix/tst-spawn2.c                           |   8 +
 posix/tst-spawn4-timens.c                    |   2 +
 posix/tst-spawn4.c                           |  10 +-
 posix/tst-spawn5-timens.c                    |   2 +
 posix/tst-spawn5.c                           |  10 +-
 posix/tst-spawn6-timens.c                    |   2 +
 posix/tst-spawn6.c                           |  12 +-
 sysdeps/unix/sysv/linux/closefrom_fallback.c |   6 +-
 sysdeps/unix/sysv/linux/spawni.c             | 233 +++++++++++++++----
 17 files changed, 271 insertions(+), 56 deletions(-)
 create mode 100644 posix/tst-spawn-chdir-timens.c
 create mode 100644 posix/tst-spawn-timens.c
 create mode 100644 posix/tst-spawn2-timens.c
 create mode 100644 posix/tst-spawn4-timens.c
 create mode 100644 posix/tst-spawn5-timens.c
 create mode 100644 posix/tst-spawn6-timens.c

diff --git a/include/unistd.h b/include/unistd.h
index af795a37c8..2643991a09 100644
--- a/include/unistd.h
+++ b/include/unistd.h
@@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int __lowfd, _Bool dirfd_fallback)
   return false;
 }
 #  else
-extern _Bool __closefrom_fallback (int __lowfd, _Bool) attribute_hidden;
+extern _Bool __closefrom_fallback (unsigned int __from, unsigned int __to,
+				   _Bool)
+     attribute_hidden;
 #  endif
 extern ssize_t __read (int __fd, void *__buf, size_t __nbytes);
 libc_hidden_proto (__read)
diff --git a/io/closefrom.c b/io/closefrom.c
index 48cac2e3d1..a07de06e9b 100644
--- a/io/closefrom.c
+++ b/io/closefrom.c
@@ -30,7 +30,7 @@ __closefrom (int lowfd)
   if (r == 0)
     return ;
 
-  if (!__closefrom_fallback (l, true))
+  if (!__closefrom_fallback (l, ~0U, true))
     __fortify_fail ("closefrom failed to close a file descriptor");
 }
 weak_alias (__closefrom, closefrom)
diff --git a/posix/Makefile b/posix/Makefile
index 9b30b53a7c..3597aa354b 100644
--- a/posix/Makefile
+++ b/posix/Makefile
@@ -109,7 +109,9 @@ tests		:= test-errno tstgetopt testfnm runtests runptests \
 		   tst-glob-tilde test-ssize-max tst-spawn4 bug-regex37 \
 		   bug-regex38 tst-regcomp-truncated tst-spawn-chdir \
 		   tst-wordexp-nocmd tst-execveat tst-spawn5 \
-		   tst-sched_getaffinity tst-spawn6
+		   tst-sched_getaffinity tst-spawn6 \
+		   tst-spawn-chdir-timens tst-spawn4-timens tst-spawn5-timens \
+		   tst-spawn6-timens
 
 # Test for the glob symbol version that was replaced in glibc 2.27.
 ifeq ($(have-GLIBC_2.26)$(build-shared),yesyes)
@@ -130,7 +132,15 @@ xtests		:= tst-getaddrinfo4 tst-getaddrinfo5 tst-sched_rr_get_interval
 xtests-time64	:= tst-sched_rr_get_interval-time64
 ifeq (yes,$(build-shared))
 test-srcs	:= globtest
-tests           += wordexp-test tst-exec tst-spawn tst-spawn2 tst-spawn3
+tests           += \
+  tst-exec \
+  tst-spawn \
+  tst-spawn-timens \
+  tst-spawn2 \
+  tst-spawn2-timens \
+  tst-spawn3 \
+  wordexp-test \
+  # tests
 endif
 ifeq (yesyes,$(build-shared)$(have-thread-library))
 tests		+= tst-getopt-cancel tst-_Fork
@@ -290,7 +300,9 @@ tst-execvpe5-ARGS = -- $(host-test-program-cmd)
 tst-spawn-ARGS = -- $(host-test-program-cmd)
 tst-spawn-static-ARGS = $(tst-spawn-ARGS)
 tst-spawn5-ARGS = -- $(host-test-program-cmd)
+tst-spawn5-timens-ARGS = -- $(host-test-program-cmd)
 tst-spawn6-ARGS = -- $(host-test-program-cmd)
+tst-spawn6-timens-ARGS = -- $(host-test-program-cmd)
 tst-dir-ARGS = `pwd` `cd $(common-objdir)/$(subdir); pwd` `cd $(common-objdir); pwd` $(objpfx)tst-dir
 tst-chmod-ARGS = $(objdir)
 tst-vfork3-ARGS = --test-dir=$(objpfx)
diff --git a/posix/tst-spawn-chdir-timens.c b/posix/tst-spawn-chdir-timens.c
new file mode 100644
index 0000000000..a659e3201e
--- /dev/null
+++ b/posix/tst-spawn-chdir-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn-chdir.c"
diff --git a/posix/tst-spawn-chdir.c b/posix/tst-spawn-chdir.c
index 18e90c2957..ecb0634816 100644
--- a/posix/tst-spawn-chdir.c
+++ b/posix/tst-spawn-chdir.c
@@ -29,6 +29,7 @@
 #include <support/test-driver.h>
 #include <support/xstdio.h>
 #include <support/xunistd.h>
+#include <support/namespace.h>
 #include <unistd.h>
 
 /* Reads the file at PATH, which must consist of exactly one line.
@@ -66,6 +67,7 @@ get_pwd_program (void)
   FAIL_EXIT1 ("cannot find pwd program");
 }
 
+
 /* Adds chdir operations to ACTIONS, using PATH.  If DO_FCHDIR, use
    the open function and TMPFD to emulate chdir using fchdir.  */
 static void
@@ -87,6 +89,12 @@ add_chdir (posix_spawn_file_actions_t *actions, const char *path,
 static int
 do_test (void)
 {
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   /* Directory for temporary file data.  Each subtest uses a numeric
      subdirectory.  */
   char *directory = support_create_temp_directory ("tst-spawn-chdir-");
diff --git a/posix/tst-spawn-timens.c b/posix/tst-spawn-timens.c
new file mode 100644
index 0000000000..a659e3201e
--- /dev/null
+++ b/posix/tst-spawn-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn-chdir.c"
diff --git a/posix/tst-spawn.c b/posix/tst-spawn.c
index e378c72f1b..2007384a90 100644
--- a/posix/tst-spawn.c
+++ b/posix/tst-spawn.c
@@ -178,6 +178,12 @@ do_test (int argc, char *argv[])
     return handle_restart (argv[1], argv[2], argv[3], argv[4], argv[5],
 			   argv[6]);
 
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   /* Prepare the test.  We are creating four files: two which file descriptor
      will be marked with FD_CLOEXEC, another which is not.  */
 
diff --git a/posix/tst-spawn2-timens.c b/posix/tst-spawn2-timens.c
new file mode 100644
index 0000000000..612531fc86
--- /dev/null
+++ b/posix/tst-spawn2-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn2.c"
diff --git a/posix/tst-spawn2.c b/posix/tst-spawn2.c
index 45ff17f9a7..16fcbe4295 100644
--- a/posix/tst-spawn2.c
+++ b/posix/tst-spawn2.c
@@ -26,10 +26,18 @@
 #include <stdio.h>
 
 #include <support/check.h>
+#include <support/test-driver.h>
+#include <support/namespace.h>
 
 int
 do_test (void)
 {
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   /* Check if posix_spawn correctly returns an error and an invalid pid
      by trying to spawn an invalid binary.  */
 
diff --git a/posix/tst-spawn4-timens.c b/posix/tst-spawn4-timens.c
new file mode 100644
index 0000000000..3f4e8ea089
--- /dev/null
+++ b/posix/tst-spawn4-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn4.c"
diff --git a/posix/tst-spawn4.c b/posix/tst-spawn4.c
index 9eac2d1080..29b7a6215d 100644
--- a/posix/tst-spawn4.c
+++ b/posix/tst-spawn4.c
@@ -21,13 +21,21 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
-#include <support/xunistd.h>
 #include <support/check.h>
+#include <support/namespace.h>
 #include <support/temp_file.h>
+#include <support/test-driver.h>
+#include <support/xunistd.h>
 
 static int
 do_test (void)
 {
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   char *scriptname;
   int fd = create_temp_file ("tst-spawn4.", &scriptname);
   TEST_VERIFY_EXIT (fd >= 0);
diff --git a/posix/tst-spawn5-timens.c b/posix/tst-spawn5-timens.c
new file mode 100644
index 0000000000..8d46819222
--- /dev/null
+++ b/posix/tst-spawn5-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn5.c"
diff --git a/posix/tst-spawn5.c b/posix/tst-spawn5.c
index dd5b3f9357..045ab15358 100644
--- a/posix/tst-spawn5.c
+++ b/posix/tst-spawn5.c
@@ -28,8 +28,10 @@
 #include <limits.h>
 
 #include <support/check.h>
-#include <support/xunistd.h>
+#include <support/namespace.h>
 #include <support/support.h>
+#include <support/test-driver.h>
+#include <support/xunistd.h>
 
 #include <arch-fd_to_filename.h>
 #include <array_length.h>
@@ -278,6 +280,12 @@ do_test (int argc, char *argv[])
     /* Ignore the application name. */
     handle_restart (argc - 1, &argv[1]);
 
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   TEST_VERIFY_EXIT (argc == 2 || argc == 5);
 
   int i;
diff --git a/posix/tst-spawn6-timens.c b/posix/tst-spawn6-timens.c
new file mode 100644
index 0000000000..a23d068ce5
--- /dev/null
+++ b/posix/tst-spawn6-timens.c
@@ -0,0 +1,2 @@
+#define CHECK_TIMENAMESPACE
+#include "tst-spawn6.c"
diff --git a/posix/tst-spawn6.c b/posix/tst-spawn6.c
index 044abd8535..4c55a1c110 100644
--- a/posix/tst-spawn6.c
+++ b/posix/tst-spawn6.c
@@ -25,12 +25,14 @@
 #include <spawn.h>
 #include <stdbool.h>
 #include <stdio.h>
+#include <stdlib.h>
 #include <string.h>
 #include <support/check.h>
+#include <support/namespace.h>
+#include <support/test-driver.h>
 #include <support/xunistd.h>
-#include <sys/wait.h>
 #include <sys/ioctl.h>
-#include <stdlib.h>
+#include <sys/wait.h>
 #include <termios.h>
 
 #ifndef PATH_MAX
@@ -124,6 +126,12 @@ run_subprogram (int argc, char *argv[], const posix_spawnattr_t *attr,
 static int
 run_test (int argc, char *argv[])
 {
+#ifdef CHECK_TIMENAMESPACE
+  support_become_root();
+  if (!support_enter_time_namespace ())
+    return EXIT_UNSUPPORTED;
+#endif
+
   /* We must have either:
      - four parameters left if called initially:
        + path to ld.so         optional
diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c b/sysdeps/unix/sysv/linux/closefrom_fallback.c
index a9dd0c46b2..cb0642c637 100644
--- a/sysdeps/unix/sysv/linux/closefrom_fallback.c
+++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c
@@ -28,7 +28,7 @@
    /proc/self/fd open will trigger a fallback that tries to close a file
    descriptor before proceed.  */
 _Bool
-__closefrom_fallback (int from, _Bool dirfd_fallback)
+__closefrom_fallback (unsigned int from, unsigned int to, _Bool dirfd_fallback)
 {
   int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | O_DIRECTORY,
                                0);
@@ -40,7 +40,7 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
 	return false;
 
       /* The closefrom should work even when process can't open new files.  */
-      for (int i = from; i < INT_MAX; i++)
+      for (int i = from; i <= to; i++)
         {
           int r = __close_nocancel (i);
           if (r == 0 || (r == -1 && errno != EBADF))
@@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback)
 
           if (fd == dirfd || fd < from)
             continue;
+	  if (fd > to)
+	    break;
 
           /* We ignore close errors because EBADF, EINTR, and EIO means the
              descriptor has been released.  */
diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index d6f5ca89cd..4ba7c2fc4f 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -44,7 +44,11 @@
    third issue is done by a stack allocation in parent, and by using a
    field in struct spawn_args where the child can write an error
    code. CLONE_VFORK ensures that the parent does not run until the
-   child has either exec'ed successfully or exited.  */
+   child has either exec'ed successfully or exited.
+
+   If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel limitation
+   such as time namespace), only CLONE_VFORK is used instead and the
+   preparation and execve failures are communicated with a pipe.  */
 
 
 /* The Unix standard contains a long explanation of the way to signal
@@ -67,6 +71,7 @@ struct posix_spawn_args
   char *const *envp;
   int xflags;
   int err;
+  int pipe[2];
 };
 
 /* Older version requires that shell script without shebang definition
@@ -94,15 +99,59 @@ maybe_script_execute (struct posix_spawn_args *args)
     }
 }
 
+/* If the file operation would clobber the pipe fd used to communicate with
+   parent, dup the pipe onto an unoccupied file descriptor.  */
+static inline bool
+spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[])
+{
+  int fd;
+
+  switch (fa->tag)
+    {
+    case spawn_do_close:
+      fd = fa->action.close_action.fd;
+      break;
+    case spawn_do_open:
+      fd = fa->action.open_action.fd;
+      break;
+    case spawn_do_dup2:
+      fd = fa->action.dup2_action.newfd;
+      break;
+    case spawn_do_fchdir:
+      fd = fa->action.fchdir_action.fd;
+    default:
+      return true;
+    }
+
+  if (fd == p[1])
+    {
+      int r = __fcntl (p[1], F_DUPFD_CLOEXEC);
+      if (r < 0)
+	return false;
+      __close_nocancel (p[1]);
+      p[1] = r;
+    }
+
+  return true;
+}
+
+static inline bool
+spawni_fa_closerange (int from, int to)
+{
+  int r = INLINE_SYSCALL_CALL (close_range, from, to, 0);
+  return r == 0 || __closefrom_fallback (from, to, false);
+}
+
 /* Function used in the clone call to setup the signals mask, posix_spawn
    attributes, and file actions.  It run on its own stack (provided by the
    posix_spawn call).  */
-static int
-__spawni_child (void *arguments)
+static _Noreturn int
+spawni_child (void *arguments)
 {
   struct posix_spawn_args *args = arguments;
   const posix_spawnattr_t *restrict attr = args->attr;
   const posix_spawn_file_actions_t *file_actions = args->fa;
+  bool use_pipe = args->pipe[0] != -1;
 
   /* The child must ensure that no signal handler are enabled because it shared
      memory with parent, so the signal disposition must be either SIG_DFL or
@@ -113,6 +162,9 @@ __spawni_child (void *arguments)
   struct sigaction sa;
   memset (&sa, '\0', sizeof (sa));
 
+  if (use_pipe)
+    __close (args->pipe[0]);
+
   sigset_t hset;
   __sigprocmask (SIG_BLOCK, 0, &hset);
   for (int sig = 1; sig < _NSIG; ++sig)
@@ -181,6 +233,9 @@ __spawni_child (void *arguments)
 	{
 	  struct __spawn_action *action = &file_actions->__actions[cnt];
 
+	  if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe))
+	    goto fail;
+
 	  switch (action->tag)
 	    {
 	    case spawn_do_close:
@@ -233,6 +288,11 @@ __spawni_child (void *arguments)
 	      break;
 
 	    case spawn_do_dup2:
+	      if (use_pipe && action->action.dup2_action.fd == args->pipe[1])
+		{
+		  errno = EBADF;
+		  goto fail;
+		}
 	      /* Austin Group issue #411 requires adddup2 action with source
 		 and destination being equal to remove close-on-exec flag.  */
 	      if (action->action.dup2_action.fd
@@ -264,8 +324,20 @@ __spawni_child (void *arguments)
 	    case spawn_do_closefrom:
 	      {
 		int lowfd = action->action.closefrom_action.from;
-	        int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0);
-		if (r != 0 && !__closefrom_fallback (lowfd, false))
+		/* Skip the pipe descriptor if it is used.  No need to handle
+		   it since it is created with O_CLOEXEC.  */
+		if (use_pipe && args->pipe[1] == lowfd)
+		  {
+		    if (!spawni_fa_closerange (lowfd + 1u, ~0U))
+		      goto fail;
+		  }
+		else if (use_pipe && args->pipe[1] > lowfd)
+		  {
+		    if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1)
+			|| !spawni_fa_closerange (args->pipe[1] + 1u, ~0U))
+		      goto fail;
+		  }
+		else if (!spawni_fa_closerange (lowfd, ~0U))
 		  goto fail;
 	      } break;
 
@@ -300,10 +372,112 @@ fail:
      (EINTERNALBUG) describing that, use ECHILD.  Another option would
      be to set args->err to some negative sentinel and have the parent
      abort(), but that seems needlessly harsh.  */
-  args->err = errno ? : ECHILD;
+  int ret = errno ? : ECHILD;
+  if (use_pipe)
+    {
+      while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0)
+	if (errno == EPIPE || errno == EBADF)
+	  break;
+    }
+  else
+    args->err = ret;
+
   _exit (SPAWN_ERROR);
 }
 
+static pid_t
+clone_call (struct posix_spawn_args *args, int flags, void *stack,
+	    size_t stack_size)
+{
+  struct clone_args clone_args =
+    {
+      .flags = flags,
+      .exit_signal = SIGCHLD,
+      .stack = (uintptr_t) stack,
+      .stack_size = stack_size,
+    };
+  return __clone_internal (&clone_args, spawni_child, args);
+}
+
+/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to optimize
+   memory and overcommit) and return TRUE if the helper was created or if the
+   failure was not due resource exhaustion.  */
+static bool
+spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
+	      void *stack, size_t stack_size)
+{
+  /* The clone flags used will create a new child that will run in the same
+     memory space (CLONE_VM) and the execution of calling thread will be
+     suspend until the child calls execve or _exit.
+
+     Also since the calling thread execution will be suspend, there is not
+     need for CLONE_SETTLS.  Although parent and child share the same TLS
+     namespace, there will be no concurrent access for TLS variables (errno
+     for instance).  */
+  *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, stack_size);
+
+  /* It needs to collect the case where the auxiliary process was created
+     but failed to execute the file (due either any preparation step or
+     for execve itself).  */
+  if (*new_pid > 0)
+    {
+      /* Also, it handles the unlikely case where the auxiliary process was
+	 terminated before calling execve as if it was successfully.  The
+	 args.err is set to 0 as default and changed to a positive value
+	 only in case of failure, so in case of premature termination
+	 due a signal args.err will remain zeroed and it will be up to
+	 caller to actually collect it.  */
+      *ec = args->err;
+      if (*ec > 0)
+	/* There still an unlikely case where the child is cancelled after
+	   setting args.err, due to a positive error value.  Also there is
+	   possible pid reuse race (where the kernel allocated the same pid
+	   to an unrelated process).  Unfortunately due synchronization
+	   issues where the kernel might not have the process collected
+	   the waitpid below can not use WNOHANG.  */
+	__waitpid (*new_pid, NULL, 0);
+    }
+  else
+    *ec = errno;
+
+  /* There is no much point in retrying with fork and exec if kernel returns a
+     failure due resource exhaustion.  */
+  return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN);
+}
+
+/* Fallback spawn case which does not use CLONE_VM.  Any preparation step or
+   execve failure is passed with a pipe, which requires additional care by
+   the helper stating process since it additional file descriptors handle.  */
+static void
+spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec,
+	     char *stack, size_t stack_size)
+{
+  if (__pipe2 (args->pipe, O_CLOEXEC) != 0)
+    {
+      *ec = errno;
+      return;
+    }
+
+  /* Do not trigger atfork handler nor any internal state reset since the
+     helper process will call execve.  */
+  *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size);
+
+  __close (args->pipe[1]);
+
+  if (*new_pid > 0)
+    {
+      if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec)
+	/* A successful execve will close the helper process pipe end.  */
+	*ec = 0;
+      else
+	__waitpid (*new_pid, NULL, 0);
+    }
+  else
+    *ec = errno;
+
+  __close (args->pipe[0]);
+}
+
 /* Spawn a new process executing PATH with the attributes describes in *ATTRP.
    Before running the process perform the actions described in FILE-ACTIONS. */
 static int
@@ -367,49 +541,16 @@ __spawnix (pid_t * pid, const char *file,
   args.argc = argc;
   args.envp = envp;
   args.xflags = xflags;
+  args.pipe[0] = args.pipe[1] = -1;
 
   __libc_signal_block_all (&args.oldmask);
 
-  /* The clone flags used will create a new child that will run in the same
-     memory space (CLONE_VM) and the execution of calling thread will be
-     suspend until the child calls execve or _exit.
-
-     Also since the calling thread execution will be suspend, there is not
-     need for CLONE_SETTLS.  Although parent and child share the same TLS
-     namespace, there will be no concurrent access for TLS variables (errno
-     for instance).  */
-  struct clone_args clone_args =
-    {
-      .flags = CLONE_VM | CLONE_VFORK,
-      .exit_signal = SIGCHLD,
-      .stack = (uintptr_t) stack,
-      .stack_size = stack_size,
-    };
-  new_pid = __clone_internal (&clone_args, __spawni_child, &args);
-
-  /* It needs to collect the case where the auxiliary process was created
-     but failed to execute the file (due either any preparation step or
-     for execve itself).  */
-  if (new_pid > 0)
-    {
-      /* Also, it handles the unlikely case where the auxiliary process was
-	 terminated before calling execve as if it was successfully.  The
-	 args.err is set to 0 as default and changed to a positive value
-	 only in case of failure, so in case of premature termination
-	 due a signal args.err will remain zeroed and it will be up to
-	 caller to actually collect it.  */
-      ec = args.err;
-      if (ec > 0)
-	/* There still an unlikely case where the child is cancelled after
-	   setting args.err, due to a positive error value.  Also there is
-	   possible pid reuse race (where the kernel allocated the same pid
-	   to an unrelated process).  Unfortunately due synchronization
-	   issues where the kernel might not have the process collected
-	   the waitpid below can not use WNOHANG.  */
-	__waitpid (new_pid, NULL, 0);
-    }
-  else
-    ec = errno;
+  /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace restriction
+     (for instance Linux does not allow processes in different time namespaces
+     to share address space) and in this case clone fails with EINVAL.  Retry
+     with fork and exec.  */
+  if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size))
+    spawni_fork (&args, &new_pid, &ec, stack, stack_size);
 
   __munmap (stack, stack_size);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-10 19:11 [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2022-05-10 19:11 ` [PATCH v4 3/3] linux: Add fallback for clone failure on posix_spawn (BZ #29115) Adhemerval Zanella
@ 2022-05-10 19:18 ` Florian Weimer
  2022-05-11  9:21   ` Christian Brauner
  3 siblings, 1 reply; 15+ messages in thread
From: Florian Weimer @ 2022-05-10 19:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adhemerval Zanella, libc-alpha, Alexey Izbyshev, Carlos O'Donell

* Adhemerval Zanella:

> The patchset adds some support to tests the fallback code to
> use only use CLONE_VFORK.  It uses unshare directly because
> it simpler than add container support.
>
> Adhemerval Zanella (3):
>   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>   support: Add support_enter_time_namespace
>   linux: Add fallback for clone failure on posix_spawn (BZ #29115)

Christan, how likely is it that we'd get another time namespace variant
that would only become effective after execve (when the DSO is remapped
anyway)?

This is a lot of complexity in posix_spawn implementations for what is a
rather limited use case.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-10 19:18 ` [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Florian Weimer
@ 2022-05-11  9:21   ` Christian Brauner
  2022-05-11 16:01     ` Alexey Izbyshev
  2022-05-25 12:24     ` Florian Weimer
  0 siblings, 2 replies; 15+ messages in thread
From: Christian Brauner @ 2022-05-11  9:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Adhemerval Zanella, libc-alpha, Alexey Izbyshev, Carlos O'Donell

On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
> > The patchset adds some support to tests the fallback code to
> > use only use CLONE_VFORK.  It uses unshare directly because
> > it simpler than add container support.
> >
> > Adhemerval Zanella (3):
> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> >   support: Add support_enter_time_namespace
> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> 
> Christan, how likely is it that we'd get another time namespace variant
> that would only become effective after execve (when the DSO is remapped
> anyway)?

Not unlikely if it helps you avoid a lot of complexity. I will need some
time to track down Andrei and others to discuss though.

Thanks,
Christian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-11  9:21   ` Christian Brauner
@ 2022-05-11 16:01     ` Alexey Izbyshev
  2022-05-25 12:24     ` Florian Weimer
  1 sibling, 0 replies; 15+ messages in thread
From: Alexey Izbyshev @ 2022-05-11 16:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Florian Weimer, Adhemerval Zanella, libc-alpha, Carlos O'Donell

On 2022-05-11 12:21, Christian Brauner wrote:
> On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>> > The patchset adds some support to tests the fallback code to
>> > use only use CLONE_VFORK.  It uses unshare directly because
>> > it simpler than add container support.
>> >
>> > Adhemerval Zanella (3):
>> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>> >   support: Add support_enter_time_namespace
>> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
>> 
>> Christan, how likely is it that we'd get another time namespace 
>> variant
>> that would only become effective after execve (when the DSO is 
>> remapped
>> anyway)?
> 
> Not unlikely if it helps you avoid a lot of complexity. I will need 
> some
> time to track down Andrei and others to discuss though.
> 
A more general question is whether the kernel aims to maintain the 
property that vfork() can always be used in place of fork() in the 
future. If it doesn't, code that currently relies on vfork() (even 
unwittingly via high-level APIs like posix_spawn()) will still have to 
grow workarounds.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
  2022-05-10 19:11 ` [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h Adhemerval Zanella
@ 2022-05-23 13:45   ` Adhemerval Zanella
  0 siblings, 0 replies; 15+ messages in thread
From: Adhemerval Zanella @ 2022-05-23 13:45 UTC (permalink / raw)
  To: libc-alpha, Alexey Izbyshev, Carlos O'Donell, Florian Weimer

Although the BZ #29115 fix might result in WONTFIX, this change is
unrelated.  I will commit this shortly if no one opposes it.

On 10/05/2022 16:11, Adhemerval Zanella wrote:
> It was added in commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c.
> ---
>  sysdeps/unix/sysv/linux/bits/sched.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/sysdeps/unix/sysv/linux/bits/sched.h b/sysdeps/unix/sysv/linux/bits/sched.h
> index f13c569863..d79359eaeb 100644
> --- a/sysdeps/unix/sysv/linux/bits/sched.h
> +++ b/sysdeps/unix/sysv/linux/bits/sched.h
> @@ -71,6 +71,10 @@
>  # define CLONE_NEWPID	0x20000000	/* New pid namespace.  */
>  # define CLONE_NEWNET	0x40000000	/* New network namespace.  */
>  # define CLONE_IO	0x80000000	/* Clone I/O context.  */
> +
> +/* cloning flags intersect with CSIGNAL so can be used only with unshare and
> +   clone3 syscalls.  */
> +#define CLONE_NEWTIME	0x00000080      /* New time namespace */
>  #endif
>  
>  #include <bits/types/struct_sched_param.h>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-11  9:21   ` Christian Brauner
  2022-05-11 16:01     ` Alexey Izbyshev
@ 2022-05-25 12:24     ` Florian Weimer
  2022-05-27 15:53       ` Andrei Vagin
  1 sibling, 1 reply; 15+ messages in thread
From: Florian Weimer @ 2022-05-25 12:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adhemerval Zanella, libc-alpha, Alexey Izbyshev,
	Carlos O'Donell, Andrei Vagin

* Christian Brauner:

> On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>> > The patchset adds some support to tests the fallback code to
>> > use only use CLONE_VFORK.  It uses unshare directly because
>> > it simpler than add container support.
>> >
>> > Adhemerval Zanella (3):
>> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
>> >   support: Add support_enter_time_namespace
>> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
>> 
>> Christan, how likely is it that we'd get another time namespace variant
>> that would only become effective after execve (when the DSO is remapped
>> anyway)?
>
> Not unlikely if it helps you avoid a lot of complexity. I will need some
> time to track down Andrei and others to discuss though.

Any progress with that?  (I hope I guessed the right Andrei.)

Breaking vfork is really a bit of a hassle for us, and the workaround
code is quite non-trivial and will have to implemented across many
projects (not just glibc).  An unshare request that takes effect on
execve only would really help.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-25 12:24     ` Florian Weimer
@ 2022-05-27 15:53       ` Andrei Vagin
  2022-05-28  0:03         ` Andrei Vagin
  2022-05-30 12:58         ` Florian Weimer
  0 siblings, 2 replies; 15+ messages in thread
From: Andrei Vagin @ 2022-05-27 15:53 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christian Brauner, Adhemerval Zanella, libc-alpha,
	Alexey Izbyshev, Carlos O'Donell

On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Christian Brauner:
>
> > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> >> * Adhemerval Zanella:
> >>
> >> > The patchset adds some support to tests the fallback code to
> >> > use only use CLONE_VFORK.  It uses unshare directly because
> >> > it simpler than add container support.
> >> >
> >> > Adhemerval Zanella (3):
> >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> >> >   support: Add support_enter_time_namespace
> >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> >>
> >> Christan, how likely is it that we'd get another time namespace variant
> >> that would only become effective after execve (when the DSO is remapped
> >> anyway)?
> >
> > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > time to track down Andrei and others to discuss though.
>
> Any progress with that?  (I hope I guessed the right Andrei.)

I think this is the right me. Have I missed something?

>
> Breaking vfork is really a bit of a hassle for us, and the workaround
> code is quite non-trivial and will have to implemented across many
> projects (not just glibc).  An unshare request that takes effect on
> execve only would really help.

Is the problem that vfork fails if a process has half-entered a time namespace?

>
> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-27 15:53       ` Andrei Vagin
@ 2022-05-28  0:03         ` Andrei Vagin
  2022-05-29 14:33           ` Christian Brauner
  2022-05-30 12:58         ` Florian Weimer
  1 sibling, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2022-05-28  0:03 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christian Brauner, Adhemerval Zanella, libc-alpha,
	Alexey Izbyshev, Carlos O'Donell, Dmitry Safonov

On Fri, May 27, 2022 at 08:53:54AM -0700, Andrei Vagin wrote:
> On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Christian Brauner:
> >
> > > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> > >> * Adhemerval Zanella:
> > >>
> > >> > The patchset adds some support to tests the fallback code to
> > >> > use only use CLONE_VFORK.  It uses unshare directly because
> > >> > it simpler than add container support.
> > >> >
> > >> > Adhemerval Zanella (3):
> > >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> > >> >   support: Add support_enter_time_namespace
> > >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> > >>
> > >> Christan, how likely is it that we'd get another time namespace variant
> > >> that would only become effective after execve (when the DSO is remapped
> > >> anyway)?
> > >
> > > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > > time to track down Andrei and others to discuss though.
> >
> > Any progress with that?  (I hope I guessed the right Andrei.)
> 
> I think this is the right me. Have I missed something?
> 
> >
> > Breaking vfork is really a bit of a hassle for us, and the workaround
> > code is quite non-trivial and will have to implemented across many
> > projects (not just glibc).  An unshare request that takes effect on
> > execve only would really help.
> 
> Is the problem that vfork fails if a process has half-entered a time namespace?

I like the idea of entering a time namespace on exec and don't fail
vfork.  The only worry is that the behavior becomes more complicated and
less obvious.

Here is the untested patch:

diff --git a/fs/exec.c b/fs/exec.c
index 14b4b3755580..96f650ec7533 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
 #include <linux/coredump.h>
+#include <linux/time_namespace.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1266,6 +1267,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	timens_on_fork(me->nsproxy, me);
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index 124829ed0163..653b80524a54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2030,15 +2030,6 @@ static __latent_entropy struct task_struct
*copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
-	/*
-	 * If the new process will be in a different time namespace
-	 * do not allow it to share VM or a thread group with the
	 forking task.
-	 */
-	if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
-		if (nsp->time_ns != nsp->time_ns_for_children)
-			return ERR_PTR(-EINVAL);
-	}
-
 	if (clone_flags & CLONE_PIDFD) {
 		/*
 		 * - CLONE_DETACHED is blocked so that we can
 		 * potentially
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index eec72ca962e2..b4cbb406bc28 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -179,7 +179,8 @@ int copy_namespaces(unsigned long flags, struct
task_struct *tsk)
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
-	timens_on_fork(new_ns, tsk);
+	if ((flags & CLONE_VM) == 0)
+		timens_on_fork(new_ns, tsk);
 
 	tsk->nsproxy = new_ns;
 	return 0;

If this is what we want, I can prepare a final patch. But I will be on
vacation next week, so it will happen after it.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-28  0:03         ` Andrei Vagin
@ 2022-05-29 14:33           ` Christian Brauner
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2022-05-29 14:33 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Florian Weimer, Adhemerval Zanella, libc-alpha, Alexey Izbyshev,
	Carlos O'Donell, Dmitry Safonov

On Fri, May 27, 2022 at 05:03:14PM -0700, Andrei Vagin wrote:
> On Fri, May 27, 2022 at 08:53:54AM -0700, Andrei Vagin wrote:
> > On Wed, May 25, 2022 at 5:24 AM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * Christian Brauner:
> > >
> > > > On Tue, May 10, 2022 at 09:18:46PM +0200, Florian Weimer wrote:
> > > >> * Adhemerval Zanella:
> > > >>
> > > >> > The patchset adds some support to tests the fallback code to
> > > >> > use only use CLONE_VFORK.  It uses unshare directly because
> > > >> > it simpler than add container support.
> > > >> >
> > > >> > Adhemerval Zanella (3):
> > > >> >   linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h
> > > >> >   support: Add support_enter_time_namespace
> > > >> >   linux: Add fallback for clone failure on posix_spawn (BZ #29115)
> > > >>
> > > >> Christan, how likely is it that we'd get another time namespace variant
> > > >> that would only become effective after execve (when the DSO is remapped
> > > >> anyway)?
> > > >
> > > > Not unlikely if it helps you avoid a lot of complexity. I will need some
> > > > time to track down Andrei and others to discuss though.
> > >
> > > Any progress with that?  (I hope I guessed the right Andrei.)
> > 
> > I think this is the right me. Have I missed something?
> > 
> > >
> > > Breaking vfork is really a bit of a hassle for us, and the workaround
> > > code is quite non-trivial and will have to implemented across many
> > > projects (not just glibc).  An unshare request that takes effect on
> > > execve only would really help.
> > 
> > Is the problem that vfork fails if a process has half-entered a time namespace?
> 
> I like the idea of entering a time namespace on exec and don't fail
> vfork.  The only worry is that the behavior becomes more complicated and
> less obvious.
> 
> Here is the untested patch:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 14b4b3755580..96f650ec7533 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
>  #include <linux/io_uring.h>
>  #include <linux/syscall_user_dispatch.h>
>  #include <linux/coredump.h>
> +#include <linux/time_namespace.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1266,6 +1267,7 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +	timens_on_fork(me->nsproxy, me);
>  	/*
>  	 * Cancel any io_uring activity across execve
>  	 */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 124829ed0163..653b80524a54 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2030,15 +2030,6 @@ static __latent_entropy struct task_struct
> *copy_process(
>  			return ERR_PTR(-EINVAL);
>  	}
>  
> -	/*
> -	 * If the new process will be in a different time namespace
> -	 * do not allow it to share VM or a thread group with the
> 	 forking task.
> -	 */
> -	if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
> -		if (nsp->time_ns != nsp->time_ns_for_children)
> -			return ERR_PTR(-EINVAL);
> -	}
> -
>  	if (clone_flags & CLONE_PIDFD) {
>  		/*
>  		 * - CLONE_DETACHED is blocked so that we can
>  		 * potentially
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index eec72ca962e2..b4cbb406bc28 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -179,7 +179,8 @@ int copy_namespaces(unsigned long flags, struct
> task_struct *tsk)
>  	if (IS_ERR(new_ns))
>  		return  PTR_ERR(new_ns);
>  
> -	timens_on_fork(new_ns, tsk);
> +	if ((flags & CLONE_VM) == 0)
> +		timens_on_fork(new_ns, tsk);
>  
>  	tsk->nsproxy = new_ns;
>  	return 0;
> 
> If this is what we want, I can prepare a final patch. But I will be on
> vacation next week, so it will happen after it.

No problem. It's the merge window anyway and it's May which means
bank holiday galore in parts of Europe.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-27 15:53       ` Andrei Vagin
  2022-05-28  0:03         ` Andrei Vagin
@ 2022-05-30 12:58         ` Florian Weimer
  2022-06-08  7:21           ` Andrei Vagin
  1 sibling, 1 reply; 15+ messages in thread
From: Florian Weimer @ 2022-05-30 12:58 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Christian Brauner, Adhemerval Zanella, libc-alpha,
	Alexey Izbyshev, Carlos O'Donell

* Andrei Vagin:

[CLONE_NEWTIME and vfork]

>> Breaking vfork is really a bit of a hassle for us, and the workaround
>> code is quite non-trivial and will have to implemented across many
>> projects (not just glibc).  An unshare request that takes effect on
>> execve only would really help.
>
> Is the problem that vfork fails if a process has half-entered a time
> namespace?

Exactly.  Anything that implements a general-purpose process launching
facility on top of vfork now needs to implement fork fallback (after
vfork failure), so that launching processes still works if the original
process has called unshare(CLONE_NEWTIME).  

In glibc, this affects posix_spawn, but other libcs are also impacted,
and so is any custom posix_spawn-like interface that uses vfork
internally.  Without fork fallback, they turn unusable if anything in
the process has previously called unshare(CLONE_NEWTIME).  The fallback
implementation tends to be complicated if it's necessary to report
execve and other errors to the caller.  There is a choice between the
O_CLOEXEC pipe hack (which has become more complex to implement due to
close_range support), or a shared mapping has to be created using
MAP_SHARED, and the subprocess writes error information to that (which
adds more potentially costly MM updates).  MAP_SHARED is probably easier
to implement than the pipe approach (no interference possible from file
actions), but for glibc, Adhemerval wrote something based on the pipe
approach.

But the key point is that any general-purpose wrapper around vfork now
has to implement fork fallback.

Regarding the patch you sketched, we'd probably have to introduce a new
flag (not CLONE_NEWTIME) for this because the difference in behavior is
quite visible.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-05-30 12:58         ` Florian Weimer
@ 2022-06-08  7:21           ` Andrei Vagin
  2022-06-08  8:35             ` Florian Weimer
  0 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2022-06-08  7:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christian Brauner, Adhemerval Zanella, libc-alpha,
	Alexey Izbyshev, Carlos O'Donell

On Mon, May 30, 2022 at 02:58:01PM +0200, Florian Weimer wrote:
> * Andrei Vagin:
> 
> [CLONE_NEWTIME and vfork]
> 
> >> Breaking vfork is really a bit of a hassle for us, and the workaround
> >> code is quite non-trivial and will have to implemented across many
> >> projects (not just glibc).  An unshare request that takes effect on
> >> execve only would really help.
> >
> > Is the problem that vfork fails if a process has half-entered a time
> > namespace?
> 
> Exactly.  Anything that implements a general-purpose process launching
> facility on top of vfork now needs to implement fork fallback (after
> vfork failure), so that launching processes still works if the original
> process has called unshare(CLONE_NEWTIME).  
> 
> In glibc, this affects posix_spawn, but other libcs are also impacted,
> and so is any custom posix_spawn-like interface that uses vfork
> internally.  Without fork fallback, they turn unusable if anything in
> the process has previously called unshare(CLONE_NEWTIME).  The fallback
> implementation tends to be complicated if it's necessary to report
> execve and other errors to the caller.  There is a choice between the
> O_CLOEXEC pipe hack (which has become more complex to implement due to
> close_range support), or a shared mapping has to be created using
> MAP_SHARED, and the subprocess writes error information to that (which
> adds more potentially costly MM updates).  MAP_SHARED is probably easier
> to implement than the pipe approach (no interference possible from file
> actions), but for glibc, Adhemerval wrote something based on the pipe
> approach.
> 
> But the key point is that any general-purpose wrapper around vfork now
> has to implement fork fallback.
> 
> Regarding the patch you sketched, we'd probably have to introduce a new
> flag (not CLONE_NEWTIME) for this because the difference in behavior is
> quite visible.

I agree that if we want to switch timens on exec, we need to introduce a
new clone flag. But I would like to avoid doing that. I think we can
live with the current clone flag if we allow switching timens only when
exec is executed by a vfork-ed process. In this case, the chance to
affect existing users is very low, isn't it?


Without the new change, vfork fails if the currect process has unshared
timens. With the change, vfork creates a new process, and the following
exec executes a binary in the target timens.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces
  2022-06-08  7:21           ` Andrei Vagin
@ 2022-06-08  8:35             ` Florian Weimer
  0 siblings, 0 replies; 15+ messages in thread
From: Florian Weimer @ 2022-06-08  8:35 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Christian Brauner, Adhemerval Zanella, libc-alpha,
	Alexey Izbyshev, Carlos O'Donell

* Andrei Vagin:

>> Regarding the patch you sketched, we'd probably have to introduce a new
>> flag (not CLONE_NEWTIME) for this because the difference in behavior is
>> quite visible.
>
> I agree that if we want to switch timens on exec, we need to introduce a
> new clone flag. But I would like to avoid doing that. I think we can
> live with the current clone flag if we allow switching timens only when
> exec is executed by a vfork-ed process. In this case, the chance to
> affect existing users is very low, isn't it?
>
> Without the new change, vfork fails if the currect process has unshared
> timens. With the change, vfork creates a new process, and the following
> exec executes a binary in the target timens.

This is an interesting observation.  The behavior is a bit non-obvious
(but so is the vfork failure), but I agree that it could work.  There
are already restrictions and oddities after vfork, so I think it's not
too bad after all.  And I think it will make posix_spawn-like functions
magically work that are currently broken, in pretty much all cases.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-06-08  8:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10 19:11 [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Adhemerval Zanella
2022-05-10 19:11 ` [PATCH v4 1/3] linux: Add CLONE_NEWTIME from Linux 5.6 to bits/sched.h Adhemerval Zanella
2022-05-23 13:45   ` Adhemerval Zanella
2022-05-10 19:11 ` [PATCH v4 2/3] support: Add support_enter_time_namespace Adhemerval Zanella
2022-05-10 19:11 ` [PATCH v4 3/3] linux: Add fallback for clone failure on posix_spawn (BZ #29115) Adhemerval Zanella
2022-05-10 19:18 ` [PATCH v4 0/3] Linux: Fix posix_spawn when user with time namespaces Florian Weimer
2022-05-11  9:21   ` Christian Brauner
2022-05-11 16:01     ` Alexey Izbyshev
2022-05-25 12:24     ` Florian Weimer
2022-05-27 15:53       ` Andrei Vagin
2022-05-28  0:03         ` Andrei Vagin
2022-05-29 14:33           ` Christian Brauner
2022-05-30 12:58         ` Florian Weimer
2022-06-08  7:21           ` Andrei Vagin
2022-06-08  8:35             ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).