public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
@ 2022-09-09 21:03 Zack Weinberg
  2022-09-13  9:52 ` Florian Weimer
  0 siblings, 1 reply; 11+ messages in thread
From: Zack Weinberg @ 2022-09-09 21:03 UTC (permalink / raw)
  To: GNU libc development

When pthread_getattr_np is applied to the initial thread, it has to figure out how big the initial thread's stack is.  Since the initial thread's stack is lazily allocated and the kernel reuses that memory region for the "information block" (argv, environ, etc) there are several different ways one could define "the size of the initial thread's stack"; for many years, the NPTL implementation has said that the stack starts at __libc_stack_end, rounded in the opposite direction from stack growth to the nearest page boundary, and extends for getrlimit(RLIMIT_STACK).rlim_cur bytes, *minus the size of the information block*, which is beyond __libc_stack_end.  The rationale is that the resource limit is enforced against the entire memory area, so if we don't subtract the size of the information block, then the program will run out of stack a few pages before pthread_attr_getstack says it will.

The trouble with this definition is determining how big the information block is.  NPTL has, again for many years, done it by parsing /proc/self/maps, which is less than ideal because (a) it's a text parser deep in the guts of libpthread (and it uses _scanf_, eeew) and (b) it means programs that call pthread_getattr_np on the initial thread have a hidden dependency on /proc being mounted.  This came to my attention because the Rust runtime _always_ calls pthread_getattr_np on the initial thread during startup, in order to set up stack overflow monitoring, causing problems for use of Rust programs during early boot.

This patch changes things so that instead of parsing /proc/self/maps, we take advantage of the fact that process_vm_readv is guaranteed to return a short read when it encounters an unmapped page somewhere in the _middle_ of the "remote" iovec.  So we set up an iovec that reads one byte from each of N pages starting just before __libc_stack_end, feed it to process_vm_readv, and count how many pages were actually readable.

This is arguably just as dirty of a hack -- in particular, process_vm_readv(getpid(), ...) _ought_ to always be allowed, since you can't do anything with it that you couldn't do with direct memory access instructions and possibly a SIGSEGV handler, but it's not apparent to me if it _is_ always allowed; process_vm_readv is a spinoff of ptrace and it uses that set of security checks which are a bit of a mess.  Also it _may_ not be computing exactly the same values.  I wrote a test to compare the new algorithm to the old one, and it passes, but I'm not trying very hard to set up exotic conditions (mmap areas close to the initial stack, for instance).

Since this is a proof-of-concept, and also because the test suite is thoroughly broken for unrelated reasons right now, I haven't bothered setting up PLT bypass for process_vm_readv.

Thoughts?
zw

diff --git a/nptl/Makefile b/nptl/Makefile
index 3d2ce8af8a..fdb6011c34 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -295,6 +295,7 @@ tests = tst-attr2 tst-attr3 tst-default-attr \
 	tst-dlsym1 \
 	tst-context1 \
 	tst-sched1 \
+	tst-initial-thread-stacksize \
 	tst-initializers1 $(addprefix tst-initializers1-,\
 			    c89 gnu89 c99 gnu99 c11 gnu11) \
 	tst-thread_local1 \
diff --git a/nptl/pthread_getattr_np.c b/nptl/pthread_getattr_np.c
index 9c5b73b452..4764c885ab 100644
--- a/nptl/pthread_getattr_np.c
+++ b/nptl/pthread_getattr_np.c
@@ -18,18 +18,125 @@
 #include <assert.h>
 #include <errno.h>
 #include <inttypes.h>
-#include <stdio.h>
-#include <stdio_ext.h>
 #include <stdlib.h>
 #include <string.h>
+#include <unistd.h>
 #include <sys/resource.h>
+#include <sys/uio.h>
 #include "pthreadP.h"
 #include <lowlevellock.h>
 #include <ldsodefs.h>
+#include <shlib-compat.h>
 
+/* Number of pages to probe in each call to process_vm_readv.
+   Must be at least 2 for probe_initial_thread_stack to work
+   correctly.  Larger numbers trade off larger arrays allocated
+   in probe_initial_thread_stack's stack frame for fewer calls
+   to process_vm_readv.
+
+   The current choice of 32 allows us to probe past up to 124KiB
+   of argv, environ, etc. for the end of the stack in a single call
+   to process_vm_readv, while requiring only a little more than
+   1KiB of scratch space.  */
+#define PROBES_PER_CALL 32
+
+/* Subroutine of pthread_getattr_np:
+   Determine the size and base address of the initial thread's stack.
+
+   We assume that the initial thread's stack can grow to fill all the
+   space permitted by the RLIMIT_STACK resource limit, and we treat it
+   as having ended at __libc_stack_end, rounded to a page boundary.
+   Because __libc_stack_end is not the actual end of the memory region
+   containing the stack -- the program arguments, environment
+   variables, auxiliary vector, etc. are all "below" __libc_stack_end
+   -- but the resource limit applies to the entire memory region, we
+   must find the actual end in order to report the size accurately.
+
+   Currently, we find the actual end by taking advantage of the fact
+   that process_vm_readv is guaranteed to stop at the first "remote"
+   iovec entry that touches an unmapped page.  */
+static int
+probe_initial_thread_stack (struct pthread_attr *iattr)
+{
+  uintptr_t pagesize = GLRO(dl_pagesize);
+  uintptr_t stack_end_page =
+    _STACK_GROWS_DOWN
+    ? ((uintptr_t) __libc_stack_end + pagesize - 1) & -pagesize
+    : (uintptr_t) __libc_stack_end & -pagesize;
+
+  struct rlimit rl;
+  if (__getrlimit (RLIMIT_STACK, &rl) != 0)
+    return -1;
+
+  char local_buf[PROBES_PER_CALL];
+  struct iovec local_iov[PROBES_PER_CALL];
+  struct iovec stack_iov[PROBES_PER_CALL];
+
+  /* We're going to read one byte per page of potential stack into
+     local_buf.  The local_iov can be set up for this once in advance,
+     as can the stack_iov's iov_len fields. */
+  for (int i = 0; i < PROBES_PER_CALL; i++)
+    {
+      local_iov[i].iov_base = &local_buf[i];
+      local_iov[i].iov_len = 1;
+      stack_iov[i].iov_len = 1;
+    }
+
+  /* We're looking for the bottommost page of the stack, so the probes
+     should advance in the opposite direction from the direction the
+     stack grows.  */
+  intptr_t probe_stride = pagesize * (_STACK_GROWS_DOWN ? 1 : -1);
+  uintptr_t first_probe_point = stack_end_page;
+  pid_t me = __getpid ();
+
+  /* If necessary, keep probing until we wrap around the address space.  */
+  int ret = -1;
+  while (_STACK_GROWS_DOWN
+	 ? (first_probe_point >= stack_end_page)
+	 : (first_probe_point <= stack_end_page))
+    {
+      /* first_probe_point marks a boundary between addresses known to
+         be accessible, and addresses that might not be accessible.
+         In order to ensure that process_vm_readv gives us a partial
+         read rather than a complete failure when we hit the first
+         address that isn't accessible, start the probes half a page
+         backward (i.e. in the opposite direction from probe_stride)
+         from there.  */
+      for (int i = 0; i < PROBES_PER_CALL; i++)
+	stack_iov[i].iov_base =
+	  (void *) (first_probe_point + i * probe_stride - probe_stride / 2);
+
+      ssize_t n = process_vm_readv (me, local_iov, PROBES_PER_CALL,
+				    stack_iov, PROBES_PER_CALL, 0);
+      if (n < 0)
+	break;
+      if (n < PROBES_PER_CALL)
+	{
+	  /* The 0th probe is _before_ first_probe_point.  */
+	  first_probe_point += (n - 1) * probe_stride;
+	  ret = 0;
+	  break;
+	}
+
+      first_probe_point += PROBES_PER_CALL * probe_stride;
+    }
+
+  if (ret == -1)
+    return -1;
+
+  /* first_probe_point marks the actual boundary of the stack's VM
+     region, and is known to be page-aligned.  Calculate the apparent
+     stack size.  */
+  iattr->stackaddr = (void *) stack_end_page;
+  iattr->stacksize = rl.rlim_cur
+    - (_STACK_GROWS_DOWN
+       ? first_probe_point - stack_end_page
+       : stack_end_page - first_probe_point);
+  return 0;
+}
 
 int
-__pthread_getattr_np (pthread_t thread_id, pthread_attr_t *attr)
+__pthread_getattr_np (pthread_t thread_id, pthread_attr_t * attr)
 {
   struct pthread *thread = (struct pthread *) thread_id;
 
@@ -62,112 +169,22 @@ __pthread_getattr_np (pthread_t thread_id, pthread_attr_t *attr)
   if (__glibc_likely (thread->stackblock != NULL))
     {
       /* The stack size reported to the user should not include the
-	 guard size.  */
+         guard size.  */
       iattr->stacksize = thread->stackblock_size - thread->guardsize;
-#if _STACK_GROWS_DOWN
-      iattr->stackaddr = (char *) thread->stackblock
-			 + thread->stackblock_size;
-#else
       iattr->stackaddr = (char *) thread->stackblock;
+#if _STACK_GROWS_DOWN
+      iattr->stackaddr += thread->stackblock_size;
 #endif
     }
   else
     {
       /* No stack information available.  This must be for the initial
-	 thread.  Get the info in some magical way.  */
-
-      /* Stack size limit.  */
-      struct rlimit rl;
-
-      /* The safest way to get the top of the stack is to read
-	 /proc/self/maps and locate the line into which
-	 __libc_stack_end falls.  */
-      FILE *fp = fopen ("/proc/self/maps", "rce");
-      if (fp == NULL)
+         thread.  Get the info in some magical way.  */
+      if (probe_initial_thread_stack (iattr))
 	ret = errno;
-      /* We need the limit of the stack in any case.  */
-      else
-	{
-	  if (__getrlimit (RLIMIT_STACK, &rl) != 0)
-	    ret = errno;
-	  else
-	    {
-	      /* We consider the main process stack to have ended with
-	         the page containing __libc_stack_end.  There is stuff below
-		 it in the stack too, like the program arguments, environment
-		 variables and auxv info, but we ignore those pages when
-		 returning size so that the output is consistent when the
-		 stack is marked executable due to a loaded DSO requiring
-		 it.  */
-	      void *stack_end = (void *) ((uintptr_t) __libc_stack_end
-					  & -(uintptr_t) GLRO(dl_pagesize));
-#if _STACK_GROWS_DOWN
-	      stack_end += GLRO(dl_pagesize);
-#endif
-	      /* We need no locking.  */
-	      __fsetlocking (fp, FSETLOCKING_BYCALLER);
-
-	      /* Until we found an entry (which should always be the case)
-		 mark the result as a failure.  */
-	      ret = ENOENT;
-
-	      char *line = NULL;
-	      size_t linelen = 0;
-#if _STACK_GROWS_DOWN
-	      uintptr_t last_to = 0;
-#endif
-
-	      while (! feof_unlocked (fp))
-		{
-		  if (__getline (&line, &linelen, fp) <= 0)
-		    break;
-
-		  uintptr_t from;
-		  uintptr_t to;
-		  if (sscanf (line, "%" SCNxPTR "-%" SCNxPTR, &from, &to) != 2)
-		    continue;
-		  if (from <= (uintptr_t) __libc_stack_end
-		      && (uintptr_t) __libc_stack_end < to)
-		    {
-		      /* Found the entry.  Now we have the info we need.  */
-		      iattr->stackaddr = stack_end;
-		      iattr->stacksize =
-		        rl.rlim_cur - (size_t) (to - (uintptr_t) stack_end);
-
-		      /* Cut it down to align it to page size since otherwise we
-		         risk going beyond rlimit when the kernel rounds up the
-		         stack extension request.  */
-		      iattr->stacksize = (iattr->stacksize
-					  & -(intptr_t) GLRO(dl_pagesize));
-#if _STACK_GROWS_DOWN
-		      /* The limit might be too high.  */
-		      if ((size_t) iattr->stacksize
-			  > (size_t) iattr->stackaddr - last_to)
-			iattr->stacksize = (size_t) iattr->stackaddr - last_to;
-#else
-		      /* The limit might be too high.  */
-		      if ((size_t) iattr->stacksize
-			  > to - (size_t) iattr->stackaddr)
-			iattr->stacksize = to - (size_t) iattr->stackaddr;
-#endif
-		      /* We succeed and no need to look further.  */
-		      ret = 0;
-		      break;
-		    }
-#if _STACK_GROWS_DOWN
-		  last_to = to;
-#endif
-		}
-
-	      free (line);
-	    }
-
-	  fclose (fp);
-	}
+      iattr->flags |= ATTR_FLAG_STACKADDR;
     }
 
-  iattr->flags |= ATTR_FLAG_STACKADDR;
-
   if (ret == 0)
     {
       size_t size = 16;
@@ -205,6 +222,7 @@ __pthread_getattr_np (pthread_t thread_id, pthread_attr_t *attr)
 
   return ret;
 }
+
 versioned_symbol (libc, __pthread_getattr_np, pthread_getattr_np, GLIBC_2_32);
 
 #if SHLIB_COMPAT (libc, GLIBC_2_2_3, GLIBC_2_32)
diff --git a/nptl/tst-initial-thread-stacksize.c b/nptl/tst-initial-thread-stacksize.c
new file mode 100644
index 0000000000..3b47e32822
--- /dev/null
+++ b/nptl/tst-initial-thread-stacksize.c
@@ -0,0 +1,163 @@
+/* Copyright (C) 2002-2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <inttypes.h>
+#include <pthread.h>
+#include <stackinfo.h>
+#include <stdio.h>
+#include <stdio_ext.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/resource.h>
+
+#include <support/test-driver.h>
+
+extern void *__libc_stack_end;
+
+/* The old algorithm for determining the stack size of the initial
+   thread.  It relies on parsing /proc/self/maps.
+
+   Writes the base address and size of the stack to STACKADDR_P and
+   STACKSIZE_P respectively; returns 0 on success or -1 on failure.  */
+static int
+get_stack_size_proc (void **stackaddr_p, size_t *stacksize_p)
+{
+  uintptr_t pagesize = sysconf (_SC_PAGESIZE);
+  if (pagesize == (uintptr_t) - 1)
+    return -1;
+
+  /* Page-align the guess.  */
+  uintptr_t stack_end_page =
+    _STACK_GROWS_DOWN
+    ? ((uintptr_t) __libc_stack_end + pagesize - 1) & -pagesize
+    : (uintptr_t) __libc_stack_end & -pagesize;
+
+  struct rlimit rl;
+  if (getrlimit (RLIMIT_STACK, &rl))
+    return -1;
+
+  FILE *fp = fopen ("/proc/self/maps", "rce");
+  if (fp == NULL)
+    return -1;
+  /* We need no locking.  */
+  __fsetlocking (fp, FSETLOCKING_BYCALLER);
+
+  int ret = -1;
+  char *line = NULL;
+  size_t linelen = 0;
+  uintptr_t stackaddr = 0;
+  uintptr_t stacksize = 0;
+  uintptr_t from;
+  uintptr_t to;
+#if _STACK_GROWS_DOWN
+  uintptr_t last_to = 0;
+#endif
+
+  while (!feof_unlocked (fp))
+    {
+      if (getline (&line, &linelen, fp) <= 0)
+	break;
+      if (sscanf (line, "%" SCNxPTR "-%" SCNxPTR, &from, &to) != 2)
+	continue;
+      if (from <= stack_end_page && stack_end_page < to)
+	{
+	  /* Found the entry.  Now we have the info we need.  */
+	  stackaddr = stack_end_page;
+	  stacksize = rl.rlim_cur - (to - stack_end_page);
+	  /* We succeed and no need to look further.  */
+	  ret = 0;
+	  break;
+	}
+#if _STACK_GROWS_DOWN
+      last_to = to;
+#endif
+    }
+
+  free (line);
+  fclose (fp);
+  if (ret == 0)
+    {
+      /* Cut stacksize down to align it to page size since otherwise we
+         risk going beyond rlimit when the kernel rounds up the
+         stack extension request.  */
+      stacksize = stacksize & -pagesize;
+      /* The limit might be too high.  */
+#if _STACK_GROWS_DOWN
+      if (stacksize > stackaddr - last_to)
+	stacksize = stackaddr - last_to;
+#else
+      if (stacksize > to - stackaddr)
+	stacksize = to - stackaddr;
+#endif
+      *stackaddr_p =
+        (void *) (_STACK_GROWS_DOWN ? stackaddr - stacksize : stackaddr);
+      *stacksize_p = stacksize;
+    }
+  return ret;
+}
+
+static int
+do_test (void)
+{
+  void *stackaddr_old, *stackaddr_new;
+  size_t stacksize_old, stacksize_new;
+
+  int perr;
+  pthread_attr_t attr;
+  if ((perr = pthread_getattr_np (pthread_self (), &attr)) != 0)
+    {
+      fprintf (stderr, "pthread_getattr_np: %s\n", strerror (perr));
+      return 1;
+    }
+  if ((perr =
+       pthread_attr_getstack (&attr, &stackaddr_new, &stacksize_new)) != 0)
+    {
+      fprintf (stderr, "pthread_attr_getstack: %s\n", strerror (perr));
+      return 1;
+    }
+
+  if (get_stack_size_proc (&stackaddr_old, &stacksize_old))
+    {
+      perror ("get_stack_size_proc");
+      return 1;
+    }
+
+  int ret = 0;
+  if (stackaddr_old != stackaddr_new)
+    {
+      puts ("ERROR: old != new stackaddr");
+      ret = 1;
+    }
+  if (stacksize_old != stacksize_new)
+    {
+      puts ("ERROR: old != new stacksize");
+      ret = 1;
+    }
+
+  if (ret || test_verbose)
+    {
+      printf ("old: stackaddr=%p stacksize=%zu\n",
+              stackaddr_old, stacksize_old);
+      printf ("new: stackaddr=%p stacksize=%zu\n",
+              stackaddr_new, stacksize_new);
+    }
+
+  return ret;
+}
+
+#include <support/test-driver.c>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-09 21:03 RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack Zack Weinberg
@ 2022-09-13  9:52 ` Florian Weimer
  2022-09-13 22:03   ` Michael Hudson-Doyle
  2022-09-15 16:09   ` Zack Weinberg
  0 siblings, 2 replies; 11+ messages in thread
From: Florian Weimer @ 2022-09-13  9:52 UTC (permalink / raw)
  To: Zack Weinberg via Libc-alpha; +Cc: Zack Weinberg

* Zack Weinberg via Libc-alpha:

> When pthread_getattr_np is applied to the initial thread, it has to
> figure out how big the initial thread's stack is.  Since the initial
> thread's stack is lazily allocated and the kernel reuses that memory
> region for the "information block" (argv, environ, etc) there are
> several different ways one could define "the size of the initial
> thread's stack"; for many years, the NPTL implementation has said that
> the stack starts at __libc_stack_end, rounded in the opposite
> direction from stack growth to the nearest page boundary, and extends
> for getrlimit(RLIMIT_STACK).rlim_cur bytes, *minus the size of the
> information block*, which is beyond __libc_stack_end.  The rationale
> is that the resource limit is enforced against the entire memory area,
> so if we don't subtract the size of the information block, then the
> program will run out of stack a few pages before pthread_attr_getstack
> says it will.

Do we actually have to subtract the size of the information block?
One could argue that this is just part of the arguments passed to main,
so sort-of-but-not-quite part of main's stack frame.

process_vm_readv seems quite likely to get blocked by seccomp filters.

Maybe we can get the kernel to pass the end of the stack in the
auxiliary vector?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-13  9:52 ` Florian Weimer
@ 2022-09-13 22:03   ` Michael Hudson-Doyle
  2022-09-15 16:09   ` Zack Weinberg
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Hudson-Doyle @ 2022-09-13 22:03 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Zack Weinberg via Libc-alpha, Zack Weinberg

[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]

On Tue, 13 Sept 2022 at 21:53, Florian Weimer via Libc-alpha <
libc-alpha@sourceware.org> wrote:

> * Zack Weinberg via Libc-alpha:
>
> > When pthread_getattr_np is applied to the initial thread, it has to
> > figure out how big the initial thread's stack is.  Since the initial
> > thread's stack is lazily allocated and the kernel reuses that memory
> > region for the "information block" (argv, environ, etc) there are
> > several different ways one could define "the size of the initial
> > thread's stack"; for many years, the NPTL implementation has said that
> > the stack starts at __libc_stack_end, rounded in the opposite
> > direction from stack growth to the nearest page boundary, and extends
> > for getrlimit(RLIMIT_STACK).rlim_cur bytes, *minus the size of the
> > information block*, which is beyond __libc_stack_end.  The rationale
> > is that the resource limit is enforced against the entire memory area,
> > so if we don't subtract the size of the information block, then the
> > program will run out of stack a few pages before pthread_attr_getstack
> > says it will.
>
> Do we actually have to subtract the size of the information block?
> One could argue that this is just part of the arguments passed to main,
> so sort-of-but-not-quite part of main's stack frame.
>
> process_vm_readv seems quite likely to get blocked by seccomp filters.
>
> Maybe we can get the kernel to pass the end of the stack in the
> auxiliary vector?
>

I wondered the same when reading the first mail fwiw -- guessing what the
kernel did is always going to be more annoying than just having it tell
us...

Cheers,
mwh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-13  9:52 ` Florian Weimer
  2022-09-13 22:03   ` Michael Hudson-Doyle
@ 2022-09-15 16:09   ` Zack Weinberg
  2022-09-20 12:16     ` Florian Weimer
  2022-09-21 20:58     ` Adhemerval Zanella Netto
  1 sibling, 2 replies; 11+ messages in thread
From: Zack Weinberg @ 2022-09-15 16:09 UTC (permalink / raw)
  To: Florian Weimer, GNU libc development

On Tue, Sep 13, 2022, at 5:52 AM, Florian Weimer wrote:
> * Zack Weinberg via Libc-alpha:
>> for many years, the NPTL implementation has said that
>> the stack starts at __libc_stack_end, rounded in the opposite
>> direction from stack growth to the nearest page boundary, and extends
>> for getrlimit(RLIMIT_STACK).rlim_cur bytes, *minus the size of the
>> information block*, which is beyond __libc_stack_end.  The rationale
>> is that the resource limit is enforced against the entire memory area,
>> so if we don't subtract the size of the information block, then the
>> program will run out of stack a few pages before pthread_attr_getstack
>> says it will.
>
> Do we actually have to subtract the size of the information block?
> One could argue that this is just part of the arguments passed to main,
> so sort-of-but-not-quite part of main's stack frame.

We could make that change, but we'd need to make other changes as well
to keep everything consistent, and I'm not sure _how_ to make that
change without having the information that pthread_getattr_np is probing for.

Suppose 'stackaddr' and 'stacksize' are the values reported by
pthread_attr_getstack when applied to the initial thread. Then the
invariants I think we need to preserve are:

  stacksize <= getrlimit(RLIMIT_STACK).rlim_cur
  stackaddr % getpagesize() == 0
  if the stack grows downward in memory, it must be OK to grow the
     stack down to, but not necessarily beyond, stackaddr
  conversely, if the stack grows upward, it must be OK to grow the
     stack up to, but not necessarily beyond, stackaddr + stacksize

Now, the entire headache here is that __libc_stack_end is *not*
necessarily page aligned and (on an architecture where the stack grows
downward in memory)

  __libc_stack_end - getrlimit(RLIMIT_STACK).rlim_cur

will be a pointer to somewhere *beyond* the lowest address that the
kernel will enlarge the stack to, even if you round __libc_stack_end
up to the next page boundary before the subtraction.  The function of
the code changed by my patch -- before and after -- is to determine
the actual boundaries of the lazy-allocation region for the initial
thread's stack.

If we changed __libc_stack_end to point to the "bottom" (opposite the
direction of stack growth) of the entire stack region, then we could
simply subtract the rlimit size from it and have stackaddr.  But
that's exactly the challenge: how do we know where that "bottom" is?

I don't know where __libc_stack_end is set.  Early startup code should
be able to do things that pthread_attr_t can't, like "find the
end-most address among all the pointers in argv, envp, and auxv, then
round end-wards to a page boundary" (where "end-most" and "end-wards"
mean "in the direction opposite to stack growth") but that might not
always give the right answer.  I also don't know if there's any
existing code in libc that depends on __libc_stack_end _not_ pointing
past the information block (of course we could always add a new
__libc_info_block_end, or just fill in the initial thread's pthread_t
more thoroughly).

> process_vm_readv seems quite likely to get blocked by seccomp filters.

I was worried about that too :-/

> Maybe we can get the kernel to pass the end of the stack in the
> auxiliary vector?

Sure, but then what do we do on older kernels?  I'm reluctant to say
"keep the old code" because we know this is breaking for people right
now (although honestly "mount /proc earlier" isn't a terrible
suggestion for a workaround).

zw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-15 16:09   ` Zack Weinberg
@ 2022-09-20 12:16     ` Florian Weimer
  2022-09-21 12:41       ` Zack Weinberg
  2022-09-21 20:58     ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2022-09-20 12:16 UTC (permalink / raw)
  To: Zack Weinberg via Libc-alpha; +Cc: Zack Weinberg

* Zack Weinberg via Libc-alpha:

>> process_vm_readv seems quite likely to get blocked by seccomp filters.
>
> I was worried about that too :-/

I think we see that in the pre-commit CI builder.

>> Maybe we can get the kernel to pass the end of the stack in the
>> auxiliary vector?
>
> Sure, but then what do we do on older kernels?  I'm reluctant to say
> "keep the old code" because we know this is breaking for people right
> now (although honestly "mount /proc earlier" isn't a terrible
> suggestion for a workaround).

We can keep doing what we are doing on older kernels.  I don't think we
should add yet another fallback path for this in case /proc isn't
available and the kernel doesn't provide the (future) AT_* entry.  *Two*
fallback paths instead of one seems a bit over the top.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-20 12:16     ` Florian Weimer
@ 2022-09-21 12:41       ` Zack Weinberg
  2022-09-21 13:01         ` Florian Weimer
  0 siblings, 1 reply; 11+ messages in thread
From: Zack Weinberg @ 2022-09-21 12:41 UTC (permalink / raw)
  To: GNU libc development

On Tue, Sep 20, 2022, at 8:16 AM, Florian Weimer via Libc-alpha wrote:
>>> Maybe we can get the kernel to pass the end of the stack in the
>>> auxiliary vector?
>>
>> Sure, but then what do we do on older kernels?  I'm reluctant to say
>> "keep the old code" because we know this is breaking for people right
>> now (although honestly "mount /proc earlier" isn't a terrible
>> suggestion for a workaround).
>
> We can keep doing what we are doing on older kernels.  I don't think we
> should add yet another fallback path for this in case /proc isn't
> available and the kernel doesn't provide the (future) AT_* entry.  *Two*
> fallback paths instead of one seems a bit over the top.

Fair enough.  I have no experience writing kernel patches, do you know anyone who can make the additional AT entry happen?

zw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-21 12:41       ` Zack Weinberg
@ 2022-09-21 13:01         ` Florian Weimer
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2022-09-21 13:01 UTC (permalink / raw)
  To: Zack Weinberg via Libc-alpha; +Cc: Zack Weinberg

* Zack Weinberg via Libc-alpha:

> On Tue, Sep 20, 2022, at 8:16 AM, Florian Weimer via Libc-alpha wrote:
>>>> Maybe we can get the kernel to pass the end of the stack in the
>>>> auxiliary vector?
>>>
>>> Sure, but then what do we do on older kernels?  I'm reluctant to say
>>> "keep the old code" because we know this is breaking for people right
>>> now (although honestly "mount /proc earlier" isn't a terrible
>>> suggestion for a workaround).
>>
>> We can keep doing what we are doing on older kernels.  I don't think we
>> should add yet another fallback path for this in case /proc isn't
>> available and the kernel doesn't provide the (future) AT_* entry.  *Two*
>> fallback paths instead of one seems a bit over the top.
>
> Fair enough.  I have no experience writing kernel patches, do you know
> anyone who can make the additional AT entry happen?

Sorry, no idea.  The easiest way to get people's attention seems to be,
post a slightly broken patch to fs/binfmt_elf.c.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-15 16:09   ` Zack Weinberg
  2022-09-20 12:16     ` Florian Weimer
@ 2022-09-21 20:58     ` Adhemerval Zanella Netto
  2022-09-23 14:59       ` Zack Weinberg
  1 sibling, 1 reply; 11+ messages in thread
From: Adhemerval Zanella Netto @ 2022-09-21 20:58 UTC (permalink / raw)
  To: libc-alpha



On 15/09/22 13:09, Zack Weinberg via Libc-alpha wrote:
> On Tue, Sep 13, 2022, at 5:52 AM, Florian Weimer wrote:
>> * Zack Weinberg via Libc-alpha:
>>> for many years, the NPTL implementation has said that
>>> the stack starts at __libc_stack_end, rounded in the opposite
>>> direction from stack growth to the nearest page boundary, and extends
>>> for getrlimit(RLIMIT_STACK).rlim_cur bytes, *minus the size of the
>>> information block*, which is beyond __libc_stack_end.  The rationale
>>> is that the resource limit is enforced against the entire memory area,
>>> so if we don't subtract the size of the information block, then the
>>> program will run out of stack a few pages before pthread_attr_getstack
>>> says it will.
>>
>> Do we actually have to subtract the size of the information block?
>> One could argue that this is just part of the arguments passed to main,
>> so sort-of-but-not-quite part of main's stack frame.
> 
> We could make that change, but we'd need to make other changes as well
> to keep everything consistent, and I'm not sure _how_ to make that
> change without having the information that pthread_getattr_np is probing for.
> 
> Suppose 'stackaddr' and 'stacksize' are the values reported by
> pthread_attr_getstack when applied to the initial thread. Then the
> invariants I think we need to preserve are:
> 
>   stacksize <= getrlimit(RLIMIT_STACK).rlim_cur
>   stackaddr % getpagesize() == 0
>   if the stack grows downward in memory, it must be OK to grow the
>      stack down to, but not necessarily beyond, stackaddr
>   conversely, if the stack grows upward, it must be OK to grow the
>      stack up to, but not necessarily beyond, stackaddr + stacksize
> 
> Now, the entire headache here is that __libc_stack_end is *not*
> necessarily page aligned and (on an architecture where the stack grows
> downward in memory)
> 
>   __libc_stack_end - getrlimit(RLIMIT_STACK).rlim_cur
> 
> will be a pointer to somewhere *beyond* the lowest address that the
> kernel will enlarge the stack to, even if you round __libc_stack_end
> up to the next page boundary before the subtraction.  The function of
> the code changed by my patch -- before and after -- is to determine
> the actual boundaries of the lazy-allocation region for the initial
> thread's stack.
> 
> If we changed __libc_stack_end to point to the "bottom" (opposite the
> direction of stack growth) of the entire stack region, then we could
> simply subtract the rlimit size from it and have stackaddr.  But
> that's exactly the challenge: how do we know where that "bottom" is?
> 
> I don't know where __libc_stack_end is set.  Early startup code should
> be able to do things that pthread_attr_t can't, like "find the
> end-most address among all the pointers in argv, envp, and auxv, then
> round end-wards to a page boundary" (where "end-most" and "end-wards"
> mean "in the direction opposite to stack growth") but that might not
> always give the right answer.  I also don't know if there's any
> existing code in libc that depends on __libc_stack_end _not_ pointing
> past the information block (of course we could always add a new
> __libc_info_block_end, or just fill in the initial thread's pthread_t
> more thoroughly).
> 
>> process_vm_readv seems quite likely to get blocked by seccomp filters.
> 
> I was worried about that too :-/
> 
>> Maybe we can get the kernel to pass the end of the stack in the
>> auxiliary vector?
> 
> Sure, but then what do we do on older kernels?  I'm reluctant to say
> "keep the old code" because we know this is breaking for people right
> now (although honestly "mount /proc earlier" isn't a terrible
> suggestion for a workaround).
> 
> zw

I wonder if we could use inplace mremap (which should be a nop) to inform
a more approximate value for the stack (the code only handles grown down
architecture):

  uintptr_t pagesize = GLRO(dl_pagesize);
  char *stack_end_page = (char*) ALIGN_UP ((uintptr_t) __libc_stack_end,
                                           pagesize);

  size_t stacksize = pagesize;
  while (mremap (stack_end_page - stacksize - pagesize, pagesize,
                 2 * pagesize, 0)
         == MAP_FAILED && errno == ENOMEM)
    stacksize += pagesize;

  iattr->stackaddr = (void *) stack_end_page;
  iattr->stacksize = stacksize;


On x86_64 it does show a value more similar to what [stack] segment reports,
for instance with:

  7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]

It returns stackaddr as 0x7ffffffdd000 with stacksize as 0x21000.  It does 
not return the same value as current implementation, but I also see that
current implementation returns both the address and size way large than
what /proc/self/maps actually maps for the process.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-21 20:58     ` Adhemerval Zanella Netto
@ 2022-09-23 14:59       ` Zack Weinberg
  2022-09-23 15:24         ` Adhemerval Zanella Netto
  2022-09-23 18:57         ` Florian Weimer
  0 siblings, 2 replies; 11+ messages in thread
From: Zack Weinberg @ 2022-09-23 14:59 UTC (permalink / raw)
  To: GNU libc development

On Wed, Sep 21, 2022, at 4:58 PM, Adhemerval Zanella Netto via Libc-alpha wrote:
> I wonder if we could use inplace mremap (which should be a nop) to inform
> a more approximate value for the stack (the code only handles grown down
> architecture):

mremap is unusual enough that I think we'd have the same problem with seccomp filters that Florian pointed out regarding process_vm_readv.

zw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-23 14:59       ` Zack Weinberg
@ 2022-09-23 15:24         ` Adhemerval Zanella Netto
  2022-09-23 18:57         ` Florian Weimer
  1 sibling, 0 replies; 11+ messages in thread
From: Adhemerval Zanella Netto @ 2022-09-23 15:24 UTC (permalink / raw)
  To: libc-alpha



On 23/09/22 11:59, Zack Weinberg via Libc-alpha wrote:
> On Wed, Sep 21, 2022, at 4:58 PM, Adhemerval Zanella Netto via Libc-alpha wrote:
>> I wonder if we could use inplace mremap (which should be a nop) to inform
>> a more approximate value for the stack (the code only handles grown down
>> architecture):
> 
> mremap is unusual enough that I think we'd have the same problem with seccomp filters that Florian pointed out regarding process_vm_readv.
> 
> zw

At least with debian code search [1], it seems mremap is used on some common 
projects and it way simpler than process_vm_readv. 

[1] https://codesearch.debian.net/search?q=mremap&literal=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack
  2022-09-23 14:59       ` Zack Weinberg
  2022-09-23 15:24         ` Adhemerval Zanella Netto
@ 2022-09-23 18:57         ` Florian Weimer
  1 sibling, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2022-09-23 18:57 UTC (permalink / raw)
  To: Zack Weinberg via Libc-alpha; +Cc: Zack Weinberg

* Zack Weinberg via Libc-alpha:

> On Wed, Sep 21, 2022, at 4:58 PM, Adhemerval Zanella Netto via Libc-alpha wrote:
>> I wonder if we could use inplace mremap (which should be a nop) to inform
>> a more approximate value for the stack (the code only handles grown down
>> architecture):
>
> mremap is unusual enough that I think we'd have the same problem with
> seccomp filters that Florian pointed out regarding process_vm_readv.

glibc uses mremap for realloc in some cases.  And the system call does
not directly permit mapping memory of a separate process.
(process_vm_readv is one of the system calls that looks like it would be
deliberately blocked, not just accidentally.)

But I'd worry about consistent mremap semantics for this particular use
case here.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-23 18:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-09 21:03 RFC PATCH: Don't use /proc/self/maps to calculate size of initial thread stack Zack Weinberg
2022-09-13  9:52 ` Florian Weimer
2022-09-13 22:03   ` Michael Hudson-Doyle
2022-09-15 16:09   ` Zack Weinberg
2022-09-20 12:16     ` Florian Weimer
2022-09-21 12:41       ` Zack Weinberg
2022-09-21 13:01         ` Florian Weimer
2022-09-21 20:58     ` Adhemerval Zanella Netto
2022-09-23 14:59       ` Zack Weinberg
2022-09-23 15:24         ` Adhemerval Zanella Netto
2022-09-23 18:57         ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).