* [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
@ 2015-04-24 13:45 Florian Weimer
2015-05-05 15:37 ` Florian Weimer
2015-05-05 20:28 ` Carlos O'Donell
0 siblings, 2 replies; 25+ messages in thread
From: Florian Weimer @ 2015-04-24 13:45 UTC (permalink / raw)
To: libc-alpha
The previous implementation could result in silent data corruption,
and this has been observed to happen with application code.
---
ChangeLog | 18 ++++
NEWS | 22 ++--
sysdeps/posix/posix_fallocate.c | 93 -----------------
sysdeps/posix/posix_fallocate64.c | 113 ---------------------
.../sysv/linux/mips/mips64/n32/posix_fallocate.c | 8 +-
.../sysv/linux/mips/mips64/n32/posix_fallocate64.c | 9 +-
sysdeps/unix/sysv/linux/posix_fallocate.c | 8 +-
sysdeps/unix/sysv/linux/posix_fallocate64.c | 26 +++--
.../unix/sysv/linux/wordsize-64/posix_fallocate.c | 10 +-
9 files changed, 56 insertions(+), 251 deletions(-)
delete mode 100644 sysdeps/posix/posix_fallocate.c
delete mode 100644 sysdeps/posix/posix_fallocate64.c
diff --git a/ChangeLog b/ChangeLog
index b927022..9219d8b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,23 @@
2015-04-24 Florian Weimer <fweimer@redhat.com>
+ [BZ#15661]
+ * sysdeps/posix/posix_fallocate.c: Remove.
+ * sysdeps/posix/posix_fallocate64.c: Likewise.
+ * sysdeps/unix/sysv/linux/posix_fallocate.c (posix_fallocate):
+ Remove internal_fallocate function and fallback.
+ * sysdeps/unix/sysv/linux/posix_fallocate64.c
+ (__posix_fallocate64_l64): Likewise. Establish aliases previously
+ defined in sysdeps/posix/posix_fallocate64.c.
+ * sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate.c
+ (posix_fallocate): Remove internal_fallocate function and
+ fallback.
+ * sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate64.c
+ (__posix_fallocate64_l64): Likewise.
+ * sysdeps/unix/sysv/linux/wordsize-64/posix_fallocate.c
+ (posix_fallocate): Likewise.
+
+2015-04-24 Florian Weimer <fweimer@redhat.com>
+
* sysdeps/unix/sysv/linux/posix_fallocate.c (posix_fallocate):
Assume __ASSUME_FALLOCATE is always true.
* sysdeps/unix/sysv/linux/posix_fallocate64.c
diff --git a/NEWS b/NEWS
index ccc4d13..016629f 100644
--- a/NEWS
+++ b/NEWS
@@ -9,14 +9,14 @@ Version 2.22
* The following bugs are resolved with this release:
- 4719, 6792, 13064, 14094, 14841, 14906, 15319, 15467, 15790, 15969, 16351,
- 16512, 16560, 16783, 16850, 17090, 17195, 17269, 17523, 17542, 17569,
- 17588, 17596, 17620, 17621, 17628, 17631, 17711, 17776, 17779, 17792,
- 17836, 17912, 17916, 17930, 17932, 17944, 17949, 17964, 17965, 17967,
- 17969, 17978, 17987, 17991, 17996, 17998, 17999, 18019, 18020, 18029,
- 18030, 18032, 18036, 18038, 18039, 18042, 18043, 18046, 18047, 18068,
- 18080, 18093, 18100, 18104, 18110, 18111, 18128, 18138, 18185, 18197,
- 18206, 18210, 18211, 18247, 18287.
+ 4719, 6792, 13064, 14094, 14841, 14906, 15319, 15467, 15661, 15790, 15969,
+ 16351, 16512, 16560, 16783, 16850, 17090, 17195, 17269, 17523, 17542,
+ 17569, 17588, 17596, 17620, 17621, 17628, 17631, 17711, 17776, 17779,
+ 17792, 17836, 17912, 17916, 17930, 17932, 17944, 17949, 17964, 17965,
+ 17967, 17969, 17978, 17987, 17991, 17996, 17998, 17999, 18019, 18020,
+ 18029, 18030, 18032, 18036, 18038, 18039, 18042, 18043, 18046, 18047,
+ 18068, 18080, 18093, 18100, 18104, 18110, 18111, 18128, 18138, 18185,
+ 18197, 18206, 18210, 18211, 18247, 18287.
* A buffer overflow in gethostbyname_r and related functions performing DNS
requests has been fixed. If the NSS functions were called with a
@@ -25,6 +25,12 @@ Version 2.22
potentially arbitrary code execution, using crafted, but syntactically
valid DNS responses. (CVE-2015-1781)
+* The fallback emulation of posix_fallocate and posix_fallocate64 was
+ removed because it could result in silent data corruption on file systems
+ which do not implement fallocate support in the kernel. posix_fallocate
+ and posix_fallocate64 will now fail and return ENOTSUP if the file system
+ does not support fallocate operations.
+
* A powerpc and powerpc64 optimization for TLS, similar to TLS descriptors
for LD and GD on x86 and x86-64, has been implemented. You will need
binutils-2.24 or later to enable this optimization.
diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
deleted file mode 100644
index d15d603..0000000
--- a/sysdeps/posix/posix_fallocate.c
+++ /dev/null
@@ -1,93 +0,0 @@
-/* Copyright (C) 2000-2015 Free Software Foundation, Inc.
- This file is part of the GNU C Library.
-
- The GNU C Library is free software; you can redistribute it and/or
- modify it under the terms of the GNU Lesser General Public
- License as published by the Free Software Foundation; either
- version 2.1 of the License, or (at your option) any later version.
-
- The GNU C Library is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- Lesser General Public License for more details.
-
- You should have received a copy of the GNU Lesser General Public
- License along with the GNU C Library; if not, see
- <http://www.gnu.org/licenses/>. */
-
-#include <errno.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <sys/stat.h>
-#include <sys/statfs.h>
-
-/* Reserve storage for the data of the file associated with FD. */
-
-int
-posix_fallocate (int fd, __off_t offset, __off_t len)
-{
- struct stat64 st;
- struct statfs f;
-
- /* `off_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
- if (offset < 0 || len < 0)
- return EINVAL;
- if (offset + len < 0)
- return EFBIG;
-
- /* First thing we have to make sure is that this is really a regular
- file. */
- if (__fxstat64 (_STAT_VER, fd, &st) != 0)
- return EBADF;
- if (S_ISFIFO (st.st_mode))
- return ESPIPE;
- if (! S_ISREG (st.st_mode))
- return ENODEV;
-
- if (len == 0)
- {
- if (st.st_size < offset)
- {
- int ret = __ftruncate (fd, offset);
-
- if (ret != 0)
- ret = errno;
- return ret;
- }
- return 0;
- }
-
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
- {
- len -= f.f_bsize;
-
- if (offset < st.st_size)
- {
- unsigned char c;
- ssize_t rsize = __pread (fd, &c, 1, offset);
-
- if (rsize < 0)
- return errno;
- /* If there is a non-zero byte, the block must have been
- allocated already. */
- else if (rsize == 1 && c != 0)
- continue;
- }
-
- if (__pwrite (fd, "", 1, offset) != 1)
- return errno;
- }
-
- return 0;
-}
diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
deleted file mode 100644
index b845df7..0000000
--- a/sysdeps/posix/posix_fallocate64.c
+++ /dev/null
@@ -1,113 +0,0 @@
-/* Copyright (C) 2000-2015 Free Software Foundation, Inc.
- This file is part of the GNU C Library.
-
- The GNU C Library is free software; you can redistribute it and/or
- modify it under the terms of the GNU Lesser General Public
- License as published by the Free Software Foundation; either
- version 2.1 of the License, or (at your option) any later version.
-
- The GNU C Library is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- Lesser General Public License for more details.
-
- You should have received a copy of the GNU Lesser General Public
- License along with the GNU C Library; if not, see
- <http://www.gnu.org/licenses/>. */
-
-#include <errno.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <sys/stat.h>
-#include <sys/statfs.h>
-
-/* Reserve storage for the data of the file associated with FD. */
-
-int
-__posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
-{
- struct stat64 st;
- struct statfs64 f;
-
- /* `off64_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
- if (offset < 0 || len < 0)
- return EINVAL;
- if (offset + len < 0)
- return EFBIG;
-
- /* First thing we have to make sure is that this is really a regular
- file. */
- if (__fxstat64 (_STAT_VER, fd, &st) != 0)
- return EBADF;
- if (S_ISFIFO (st.st_mode))
- return ESPIPE;
- if (! S_ISREG (st.st_mode))
- return ENODEV;
-
- if (len == 0)
- {
- if (st.st_size < offset)
- {
- int ret = __ftruncate64 (fd, offset);
-
- if (ret != 0)
- ret = errno;
- return ret;
- }
- return 0;
- }
-
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs64 (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
- {
- len -= f.f_bsize;
-
- if (offset < st.st_size)
- {
- unsigned char c;
- ssize_t rsize = __libc_pread64 (fd, &c, 1, offset);
-
- if (rsize < 0)
- return errno;
- /* If there is a non-zero byte, the block must have been
- allocated already. */
- else if (rsize == 1 && c != 0)
- continue;
- }
-
- if (__libc_pwrite64 (fd, "", 1, offset) != 1)
- return errno;
- }
-
- return 0;
-}
-
-#undef __posix_fallocate64_l64
-#include <shlib-compat.h>
-#include <bits/wordsize.h>
-
-#if __WORDSIZE == 32 && SHLIB_COMPAT(libc, GLIBC_2_2, GLIBC_2_3_3)
-
-int
-attribute_compat_text_section
-__posix_fallocate64_l32 (int fd, off64_t offset, size_t len)
-{
- return __posix_fallocate64_l64 (fd, offset, len);
-}
-
-versioned_symbol (libc, __posix_fallocate64_l64, posix_fallocate64,
- GLIBC_2_3_3);
-compat_symbol (libc, __posix_fallocate64_l32, posix_fallocate64, GLIBC_2_2);
-#else
-strong_alias (__posix_fallocate64_l64, posix_fallocate64);
-#endif
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate.c b/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate.c
index a9c8d73..5d926f5 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate.c
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate.c
@@ -18,10 +18,6 @@
#include <fcntl.h>
#include <sysdep.h>
-#define posix_fallocate static internal_fallocate
-#include <sysdeps/posix/posix_fallocate.c>
-#undef posix_fallocate
-
/* Reserve storage for the data of the file associated with FD. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
@@ -31,7 +27,5 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
- if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
- return INTERNAL_SYSCALL_ERRNO (res, err);
- return internal_fallocate (fd, offset, len);
+ return INTERNAL_SYSCALL_ERRNO (res, err);
}
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate64.c b/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate64.c
index 503e918..5d3a636 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate64.c
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/posix_fallocate64.c
@@ -18,11 +18,6 @@
#include <fcntl.h>
#include <sysdep.h>
-extern int __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len);
-#define __posix_fallocate64_l64 static internal_fallocate64
-#include <sysdeps/posix/posix_fallocate64.c>
-#undef __posix_fallocate64_l64
-
/* Reserve storage for the data of the file associated with FD. */
int
__posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
@@ -32,7 +27,5 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
- if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
- return INTERNAL_SYSCALL_ERRNO (res, err);
- return internal_fallocate64 (fd, offset, len);
+ return INTERNAL_SYSCALL_ERRNO (res, err);
}
diff --git a/sysdeps/unix/sysv/linux/posix_fallocate.c b/sysdeps/unix/sysv/linux/posix_fallocate.c
index 4587029..b6124db 100644
--- a/sysdeps/unix/sysv/linux/posix_fallocate.c
+++ b/sysdeps/unix/sysv/linux/posix_fallocate.c
@@ -18,10 +18,6 @@
#include <fcntl.h>
#include <sysdep.h>
-#define posix_fallocate static internal_fallocate
-#include <sysdeps/posix/posix_fallocate.c>
-#undef posix_fallocate
-
/* Reserve storage for the data of the file associated with FD. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
@@ -33,7 +29,5 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
- if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
- return INTERNAL_SYSCALL_ERRNO (res, err);
- return internal_fallocate (fd, offset, len);
+ return INTERNAL_SYSCALL_ERRNO (res, err);
}
diff --git a/sysdeps/unix/sysv/linux/posix_fallocate64.c b/sysdeps/unix/sysv/linux/posix_fallocate64.c
index 771e59c..97c5a57 100644
--- a/sysdeps/unix/sysv/linux/posix_fallocate64.c
+++ b/sysdeps/unix/sysv/linux/posix_fallocate64.c
@@ -15,14 +15,11 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
+#include <bits/wordsize.h>
#include <fcntl.h>
+#include <shlib-compat.h>
#include <sysdep.h>
-extern int __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len);
-#define __posix_fallocate64_l64 static internal_fallocate64
-#include <sysdeps/posix/posix_fallocate64.c>
-#undef __posix_fallocate64_l64
-
/* Reserve storage for the data of the file associated with FD. */
int
__posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
@@ -36,7 +33,20 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
- if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
- return INTERNAL_SYSCALL_ERRNO (res, err);
- return internal_fallocate64 (fd, offset, len);
+ return INTERNAL_SYSCALL_ERRNO (res, err);
+}
+
+#if __WORDSIZE == 32 && SHLIB_COMPAT(libc, GLIBC_2_2, GLIBC_2_3_3)
+int
+attribute_compat_text_section
+__posix_fallocate64_l32 (int fd, off64_t offset, size_t len)
+{
+ return __posix_fallocate64_l64 (fd, offset, len);
}
+
+versioned_symbol (libc, __posix_fallocate64_l64, posix_fallocate64,
+ GLIBC_2_3_3);
+compat_symbol (libc, __posix_fallocate64_l32, posix_fallocate64, GLIBC_2_2);
+#else
+strong_alias (__posix_fallocate64_l64, posix_fallocate64);
+#endif
diff --git a/sysdeps/unix/sysv/linux/wordsize-64/posix_fallocate.c b/sysdeps/unix/sysv/linux/wordsize-64/posix_fallocate.c
index 8ae8a29..992d8cb 100644
--- a/sysdeps/unix/sysv/linux/wordsize-64/posix_fallocate.c
+++ b/sysdeps/unix/sysv/linux/wordsize-64/posix_fallocate.c
@@ -16,13 +16,10 @@
<http://www.gnu.org/licenses/>. */
#include <fcntl.h>
+#include <errno.h>
#include <kernel-features.h>
#include <sysdep.h>
-#define posix_fallocate static internal_fallocate
-#include <sysdeps/posix/posix_fallocate.c>
-#undef posix_fallocate
-
/* The alpha architecture introduced the fallocate system call in
2.6.33-rc1, so we still need the fallback code. */
#if !defined __ASSUME_FALLOCATE && defined __NR_fallocate
@@ -56,11 +53,10 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
__have_fallocate = -1;
else
# endif
- if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
- return INTERNAL_SYSCALL_ERRNO (res, err);
+ return INTERNAL_SYSCALL_ERRNO (res, err);
}
#endif
- return internal_fallocate (fd, offset, len);
+ return ENOSYS;
}
weak_alias (posix_fallocate, posix_fallocate64)
--
2.1.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-04-24 13:45 [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661] Florian Weimer
@ 2015-05-05 15:37 ` Florian Weimer
2015-05-05 15:59 ` Paul Eggert
2015-05-05 20:28 ` Carlos O'Donell
1 sibling, 1 reply; 25+ messages in thread
From: Florian Weimer @ 2015-05-05 15:37 UTC (permalink / raw)
To: libc-alpha
On 04/24/2015 02:53 PM, Florian Weimer wrote:
> The previous implementation could result in silent data corruption,
> and this has been observed to happen with application code.
I'd appreciate some comment on this patch. Do you agree that this is
the right approach?
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-05 15:37 ` Florian Weimer
@ 2015-05-05 15:59 ` Paul Eggert
0 siblings, 0 replies; 25+ messages in thread
From: Paul Eggert @ 2015-05-05 15:59 UTC (permalink / raw)
To: Florian Weimer, libc-alpha
On 05/05/2015 08:37 AM, Florian Weimer wrote:
> On 04/24/2015 02:53 PM, Florian Weimer wrote:
>> The previous implementation could result in silent data corruption,
>> and this has been observed to happen with application code.
> I'd appreciate some comment on this patch. Do you agree that this is
> the right approach?
>
I just now read the bug report and patch and agree that the patch is the
right way to go.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-04-24 13:45 [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661] Florian Weimer
2015-05-05 15:37 ` Florian Weimer
@ 2015-05-05 20:28 ` Carlos O'Donell
2015-05-05 20:48 ` Christoph Hellwig
2015-05-06 7:19 ` Florian Weimer
1 sibling, 2 replies; 25+ messages in thread
From: Carlos O'Donell @ 2015-05-05 20:28 UTC (permalink / raw)
To: Florian Weimer, libc-alpha
On 04/24/2015 08:53 AM, Florian Weimer wrote:
> The previous implementation could result in silent data corruption,
> and this has been observed to happen with application code.
In principle I agree with the removal of all of the fallback fallocate
code, it simply can't work reliably, and a reliable solution is ridiculously
expensive (see Rich's comments in the BZ about CAS over all the mmap'd pages).
The bug with O_APPEND files is real, and yet another reason to remove the
fallback code.
My opinion is that some of the failure modes talked about in the bugzilla
are invalid, for example having another thread or process calling truncate
is already a race condition, you don't need the fallocate fallback to
expose a race that corrupts data. The other thread might truncate after
you had written data to the file, resulting in the loss of data since
there was no synchronization. If there was synchronization then there would
be no problem since the thread calling truncate would wait for posix_fallocate
to complete before truncating.
The other side of the coin is that POSIX goes on further to say in
"2.9.7 Thread Interactions with Regular File Operations" that threads
should never see interleaving sets of file operations, but it is insane
to do anything like that because it kills performance, so you don't get
those guarantees in Linux.
What worries me though is that this change could break existing systems
that relied on this emulation to do something sensible for filesystems
that don't support fallocate. These binaries could easily be single threaded
systems with no other process touching their files and writing to filesystems
that don't support fallocate. If that is a sensible class of users, then we
need to version the interface, with the old version continuing to call the
fallback code and the new version not calling the fallback code.
In summary:
OK to checkin as long as you version the interface to prevent breaking
existing applications. Unless you can show that all filesystems a sensible
person might care about support fallocate, making versioning a waste of
time.
Thoughts?
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-05 20:28 ` Carlos O'Donell
@ 2015-05-05 20:48 ` Christoph Hellwig
2015-05-06 20:59 ` Carlos O'Donell
2015-05-06 7:19 ` Florian Weimer
1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-05-05 20:48 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: Florian Weimer, libc-alpha
On Tue, May 05, 2015 at 04:28:41PM -0400, Carlos O'Donell wrote:
> The other side of the coin is that POSIX goes on further to say in
> "2.9.7 Thread Interactions with Regular File Operations" that threads
> should never see interleaving sets of file operations, but it is insane
> to do anything like that because it kills performance, so you don't get
> those guarantees in Linux.
Which specific guarantees do you see violated with a sane filesystem like
XFS?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-05 20:28 ` Carlos O'Donell
2015-05-05 20:48 ` Christoph Hellwig
@ 2015-05-06 7:19 ` Florian Weimer
2015-05-06 22:48 ` Paul Eggert
2015-05-07 19:01 ` Carlos O'Donell
1 sibling, 2 replies; 25+ messages in thread
From: Florian Weimer @ 2015-05-06 7:19 UTC (permalink / raw)
To: Carlos O'Donell, libc-alpha
On 05/05/2015 10:28 PM, Carlos O'Donell wrote:
> On 04/24/2015 08:53 AM, Florian Weimer wrote:
>> The previous implementation could result in silent data corruption,
>> and this has been observed to happen with application code.
>
> In principle I agree with the removal of all of the fallback fallocate
> code, it simply can't work reliably, and a reliable solution is ridiculously
> expensive (see Rich's comments in the BZ about CAS over all the mmap'd pages).
It's also not covered by the memory model, I think.
> The bug with O_APPEND files is real, and yet another reason to remove the
> fallback code.
We should handle that better at the very least.
We could clear O_APPEND, but only in single-threaded mode; I don't think
it's worth the effort. Re-opening the descriptor through /proc/self/fd
does not work because closing that descriptor would release POSIX
advisory locks.
> What worries me though is that this change could break existing systems
> that relied on this emulation to do something sensible for filesystems
> that don't support fallocate. These binaries could easily be single threaded
> systems with no other process touching their files and writing to filesystems
> that don't support fallocate. If that is a sensible class of users, then we
> need to version the interface, with the old version continuing to call the
> fallback code and the new version not calling the fallback code.
After sleeping over your comments, I actually did my homework. The gist
is that we cannot remove fallback, I think not even with the
compatibility symbol.
Various file systems do not support fallocate. This includes NFS, where
even the most recent version makes it optional to implement in the server.
SQLite ignores the posix_fallocate return value, but MariaDB does not.
A recompiled MariaDB would suddenly start to fail, and the DBA would
have to disable pre-allocation in the configuration. If I read the
source correctly, systemd-journald will stop logging, and there is no
knob to turn off fallocate. Same for libvirt, it will fail to create
backing files for storage devices.
Both MariaDB and libvirt are often run on NFS storage, so a glibc change
which removes fallback would actually affect them. For the code we
ship, we can move the fallback to the applications, but there is no good
way to make sure that happens with third-party applications. I do not
believe the compatibility symbol mechanism is a good alternative because
the breakage will be file-system-dependent and may not be noticed during
testing. (I'm generally skeptical of using compatibility symbols this way.)
Maybe we could remove the write loop and perform only an ftruncate call
which (hopefully) increases the file size. This would take care of the
O_APPEND issue and remove most of the races. Using posix_fallocate to
avoid ENOSPC later would not work, but with thin provisioning,
deduplicating storage and compression going around these days, I don't
think writing zero blocks has that effect in practice anyway
(particularly not on NFS). I'll ask around.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-05 20:48 ` Christoph Hellwig
@ 2015-05-06 20:59 ` Carlos O'Donell
2015-05-06 23:29 ` Rich Felker
0 siblings, 1 reply; 25+ messages in thread
From: Carlos O'Donell @ 2015-05-06 20:59 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Florian Weimer, libc-alpha
On 05/05/2015 04:48 PM, Christoph Hellwig wrote:
> On Tue, May 05, 2015 at 04:28:41PM -0400, Carlos O'Donell wrote:
>> The other side of the coin is that POSIX goes on further to say in
>> "2.9.7 Thread Interactions with Regular File Operations" that threads
>> should never see interleaving sets of file operations, but it is insane
>> to do anything like that because it kills performance, so you don't get
>> those guarantees in Linux.
>
> Which specific guarantees do you see violated with a sane filesystem like
> XFS?
I have not verified that XFS behaves as is expected by POSIX, but I was
going by Linus's comments when this issue was discussed and then fixed
in 3.14.
In particular:
http://article.gmane.org/gmane.linux.kernel/398249
With the original thread here:
http://thread.gmane.org/gmane.linux.kernel/397980
Would an fstat on XFS show the in-progress IO being done by a call to
write? If it does, then it violates POSIX, which requires that none
or all of the write show up in the fstat call.
The standard statement in question is:
~~~
2.9.7 Thread Interactions with Regular File Operations
All of the functions chmod( ), close( ), fchmod( ), fcntl( ), fstat( ),
ftruncate( ), lseek( ), open( ), read( ), readlink( ), stat( ), symlink( ),
and write( ) shall be atomic with respect to each other in the effects
specified in IEEE Std 1003.1-2001 when they operate on regular files. If two
threads each call one of these functions, each call shall either see all of
the specified effects of the other call, or none of them.
~~~
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-06 7:19 ` Florian Weimer
@ 2015-05-06 22:48 ` Paul Eggert
2015-05-06 23:31 ` Rich Felker
2015-05-07 19:01 ` Carlos O'Donell
1 sibling, 1 reply; 25+ messages in thread
From: Paul Eggert @ 2015-05-06 22:48 UTC (permalink / raw)
To: Florian Weimer, libc-alpha
Florian Weimer wrote:
> Maybe we could remove the write loop and perform only an ftruncate call
> which (hopefully) increases the file size. This would take care of the
> O_APPEND issue and remove most of the races.
I like this idea.
> Using posix_fallocate to
> avoid ENOSPC later would not work, but with thin provisioning,
> deduplicating storage and compression going around these days, I don't
> think writing zero blocks has that effect in practice anyway
That's right.
> (particularly not on NFS).
It's in draft NFS v4.2 as the ALLOCATE operation; see:
https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-38
This is pretty much bleeding-edge of course.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-06 20:59 ` Carlos O'Donell
@ 2015-05-06 23:29 ` Rich Felker
0 siblings, 0 replies; 25+ messages in thread
From: Rich Felker @ 2015-05-06 23:29 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: Christoph Hellwig, Florian Weimer, libc-alpha
On Wed, May 06, 2015 at 04:58:56PM -0400, Carlos O'Donell wrote:
> On 05/05/2015 04:48 PM, Christoph Hellwig wrote:
> > On Tue, May 05, 2015 at 04:28:41PM -0400, Carlos O'Donell wrote:
> >> The other side of the coin is that POSIX goes on further to say in
> >> "2.9.7 Thread Interactions with Regular File Operations" that threads
> >> should never see interleaving sets of file operations, but it is insane
> >> to do anything like that because it kills performance, so you don't get
> >> those guarantees in Linux.
> >
> > Which specific guarantees do you see violated with a sane filesystem like
> > XFS?
>
> I have not verified that XFS behaves as is expected by POSIX, but I was
> going by Linus's comments when this issue was discussed and then fixed
> in 3.14.
>
> In particular:
> http://article.gmane.org/gmane.linux.kernel/398249
>
> With the original thread here:
> http://thread.gmane.org/gmane.linux.kernel/397980
>
> Would an fstat on XFS show the in-progress IO being done by a call to
> write? If it does, then it violates POSIX, which requires that none
> or all of the write show up in the fstat call.
>
> The standard statement in question is:
> ~~~
> 2.9.7 Thread Interactions with Regular File Operations
> All of the functions chmod( ), close( ), fchmod( ), fcntl( ), fstat( ),
> ftruncate( ), lseek( ), open( ), read( ), readlink( ), stat( ), symlink( ),
> and write( ) shall be atomic with respect to each other in the effects
> specified in IEEE Std 1003.1-2001 when they operate on regular files. If two
> threads each call one of these functions, each call shall either see all of
> the specified effects of the other call, or none of them.
> ~~~
I'm pretty sure Linux has a lot of bugs in this regard. Unless the
standard is to be relaxed, I think the right solution is either for
the kernel to simulate atomicity or to break out of the long write and
return a short write when another operation tries to access the file
state while it's in progress. Sadly there does not seem to be anything
userspace can do to work around the kernel bugs, though.
Rich
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-06 22:48 ` Paul Eggert
@ 2015-05-06 23:31 ` Rich Felker
2015-05-07 18:19 ` Roland McGrath
0 siblings, 1 reply; 25+ messages in thread
From: Rich Felker @ 2015-05-06 23:31 UTC (permalink / raw)
To: Paul Eggert; +Cc: Florian Weimer, libc-alpha
On Wed, May 06, 2015 at 03:48:38PM -0700, Paul Eggert wrote:
> Florian Weimer wrote:
> >Maybe we could remove the write loop and perform only an ftruncate call
> >which (hopefully) increases the file size. This would take care of the
> >O_APPEND issue and remove most of the races.
>
> I like this idea.
If I'm not mistaken ftruncate could still reduce the file size if it
races with another operation that would extend the file. This is also
a data loss bug.
Rich
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-06 23:31 ` Rich Felker
@ 2015-05-07 18:19 ` Roland McGrath
2015-05-07 19:05 ` Florian Weimer
0 siblings, 1 reply; 25+ messages in thread
From: Roland McGrath @ 2015-05-07 18:19 UTC (permalink / raw)
To: Rich Felker; +Cc: Paul Eggert, Florian Weimer, libc-alpha
> If I'm not mistaken ftruncate could still reduce the file size if it
> races with another operation that would extend the file. This is also
> a data loss bug.
I concur.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-06 7:19 ` Florian Weimer
2015-05-06 22:48 ` Paul Eggert
@ 2015-05-07 19:01 ` Carlos O'Donell
1 sibling, 0 replies; 25+ messages in thread
From: Carlos O'Donell @ 2015-05-07 19:01 UTC (permalink / raw)
To: Florian Weimer, libc-alpha
On 05/06/2015 03:19 AM, Florian Weimer wrote:
> On 05/05/2015 10:28 PM, Carlos O'Donell wrote:
>> On 04/24/2015 08:53 AM, Florian Weimer wrote:
>>> The previous implementation could result in silent data corruption,
>>> and this has been observed to happen with application code.
>>
>> In principle I agree with the removal of all of the fallback fallocate
>> code, it simply can't work reliably, and a reliable solution is ridiculously
>> expensive (see Rich's comments in the BZ about CAS over all the mmap'd pages).
>
> It's also not covered by the memory model, I think.
>
>> The bug with O_APPEND files is real, and yet another reason to remove the
>> fallback code.
>
> We should handle that better at the very least.
>
> We could clear O_APPEND, but only in single-threaded mode; I don't think
> it's worth the effort. Re-opening the descriptor through /proc/self/fd
> does not work because closing that descriptor would release POSIX
> advisory locks.
I do not think we need to do that, and I agree with some of your comments
below.
Keep in mind that we need only assure that subsequent writes succeed
and that the files is the right length on the filesystem. This in my mind
means we need only call `ftruncate` successfully.
>> What worries me though is that this change could break existing systems
>> that relied on this emulation to do something sensible for filesystems
>> that don't support fallocate. These binaries could easily be single threaded
>> systems with no other process touching their files and writing to filesystems
>> that don't support fallocate. If that is a sensible class of users, then we
>> need to version the interface, with the old version continuing to call the
>> fallback code and the new version not calling the fallback code.
>
> After sleeping over your comments, I actually did my homework. The gist
> is that we cannot remove fallback, I think not even with the
> compatibility symbol.
>
> Various file systems do not support fallocate. This includes NFS, where
> even the most recent version makes it optional to implement in the server.
OK.
> SQLite ignores the posix_fallocate return value, but MariaDB does not.
> A recompiled MariaDB would suddenly start to fail, and the DBA would
> have to disable pre-allocation in the configuration. If I read the
> source correctly, systemd-journald will stop logging, and there is no
> knob to turn off fallocate. Same for libvirt, it will fail to create
> backing files for storage devices.
OK.
> Both MariaDB and libvirt are often run on NFS storage, so a glibc change
> which removes fallback would actually affect them. For the code we
> ship, we can move the fallback to the applications, but there is no good
> way to make sure that happens with third-party applications. I do not
> believe the compatibility symbol mechanism is a good alternative because
> the breakage will be file-system-dependent and may not be noticed during
> testing. (I'm generally skeptical of using compatibility symbols this way.)
That is a difference of opinion, but I buy your analysis, despite our best
efforts with compatibility symbols the NFS use case would remain and users
would see failures everywhere after a recompilation. It would not be prudent
of us to do this, and it is exactly what I worried about.
> Maybe we could remove the write loop and perform only an ftruncate call
> which (hopefully) increases the file size. This would take care of the
> O_APPEND issue and remove most of the races. Using posix_fallocate to
> avoid ENOSPC later would not work, but with thin provisioning,
> deduplicating storage and compression going around these days, I don't
> think writing zero blocks has that effect in practice anyway
> (particularly not on NFS). I'll ask around.
I agree. I was thinking exactly the same thing when I saw the write loop.
Unfortunately only fallocate at the kernel fs layer is going to guarantee
you never see ENOSPC in all reasonable situations.
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-07 18:19 ` Roland McGrath
@ 2015-05-07 19:05 ` Florian Weimer
2015-05-18 13:14 ` Florian Weimer
0 siblings, 1 reply; 25+ messages in thread
From: Florian Weimer @ 2015-05-07 19:05 UTC (permalink / raw)
To: Roland McGrath, Rich Felker; +Cc: Paul Eggert, libc-alpha
On 05/07/2015 08:19 PM, Roland McGrath wrote:
>> If I'm not mistaken ftruncate could still reduce the file size if it
>> races with another operation that would extend the file. This is also
>> a data loss bug.
>
> I concur.
It happens with length == 0. We could error out with EINVAL instead of
calling ftruncate.
Daniel Berrange pointed me to these bugs:
https://sourceware.org/bugzilla/show_bug.cgi?id=17322
https://bugzilla.redhat.com/show_bug.cgi?id=1140250
https://bugzilla.redhat.com/show_bug.cgi?id=1077068
This suggests that people actually rely on the current allocation
behavior. Combined with my previous analysis that applications will
start to fail if we remove the fallback and return EINVAL, I now think
we need to keep the allocation loop.
I don't like this situation. It's a strong argument against providing
approximate user-space emulation (setxid is another example, I'm sure
there are others). These experiences may be relevant to the getrandom
debate.
I'm working on a patch with a few minor fixes to posix_fallocate and an
update to the manual. I don't think we can do better at present,
unfortunately.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-07 19:05 ` Florian Weimer
@ 2015-05-18 13:14 ` Florian Weimer
2015-05-22 15:03 ` Mark Wielaard
2015-06-03 6:59 ` Carlos O'Donell
0 siblings, 2 replies; 25+ messages in thread
From: Florian Weimer @ 2015-05-18 13:14 UTC (permalink / raw)
To: Roland McGrath, Rich Felker
Cc: Paul Eggert, libc-alpha, Carlos O'Donell, Mark Wielaard
[-- Attachment #1: Type: text/plain, Size: 1171 bytes --]
On 05/07/2015 09:05 PM, Florian Weimer wrote:
> On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>> If I'm not mistaken ftruncate could still reduce the file size if it
>>> races with another operation that would extend the file. This is also
>>> a data loss bug.
>>
>> I concur.
>
> It happens with length == 0. We could error out with EINVAL instead of
> calling ftruncate.
>
> Daniel Berrange pointed me to these bugs:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=17322
> https://bugzilla.redhat.com/show_bug.cgi?id=1140250
> https://bugzilla.redhat.com/show_bug.cgi?id=1077068
Another very recent example is here:
https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> This suggests that people actually rely on the current allocation
> behavior. Combined with my previous analysis that applications will
> start to fail if we remove the fallback and return EINVAL, I now think
> we need to keep the allocation loop.
This is the patch I currently have. It fixes the avoidable bugs. I
still think we are in a bad situation here, that even a compatibility
symbol cannot fix.
--
Florian Weimer / Red Hat Product Security
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch --]
[-- Type: text/x-patch; name="0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch", Size: 13642 bytes --]
From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
From: Florian Weimer <fweimer@redhat.com>
Date: Mon, 18 May 2015 11:32:44 +0100
Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
#15661]
To: libc-alpha@sourceware.org
Handle signed integer overflow correctly. Detect and reject O_APPEND.
Document drawbacks of emulation.
This does not completely address bug 15661, but improves the situation
somewhat.
---
ChangeLog | 9 ++++
manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++
sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++--------
sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
4 files changed, 199 insertions(+), 38 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 4de8a25..603847b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@
+2015-05-18 Florian Weimer <fweimer@redhat.com>
+
+ [BZ #15661]
+ * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
+ Check for overflow properly. Check for O_APPEND. Ignore large
+ file system block sizes. Add comments about problems.
+ * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
+ * manual/filesys.texi (Storage Allocation): New node.
+
2015-05-18 Arjun Shankar <arjun.is@lostca.se>
* include/stdio.h: Define __need_wint_t.
diff --git a/manual/filesys.texi b/manual/filesys.texi
index 7d55b43..0f2e3dc 100644
--- a/manual/filesys.texi
+++ b/manual/filesys.texi
@@ -1723,6 +1723,7 @@ modify the attributes of a file.
access a file.
* File Times:: About the time attributes of a file.
* File Size:: Manually changing the size of a file.
+* Storage Allocation:: Allocate backing storage for files.
@end menu
@node Attribute Meanings
@@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the
real size, and when it has finished a final @code{ftruncate} call should
set the real size of the file.
+@node Storage Allocation
+@subsection Storage Allocation
+@cindex allocating file storage
+@cindex file allocation
+@cindex storage allocating
+
+@cindex file fragmentation
+@cindex fragmentation of files
+@cindex sparse files
+@cindex files, sparse
+Most file systems support allocating large files in a non-contiguous
+fashion: the file is split into @emph{fragments} which are allocated
+sequentially, but the fragments themselves can be scattered across the
+disk. File systems generally try to avoid such fragmentation because it
+decreases performance, but if a file gradually increases in size, there
+might be no other option than to fragment it. In addition, many file
+systems support @emph{sparse files} with @emph{holes}: regions of null
+bytes for which no backing storage has been allocated by the file
+system. When the holes are finally overwritten with data, fragmentation
+can occur as well.
+
+Explicit allocation of storage for yet-unwritten parts of the file can
+help the system to avoid fragmentation. Additionally, if storage
+pre-allocation fails, it is possible to report the out-of-disk error
+early, often without filling up the entire disk. However, due to
+deduplication, copy-on-write semantics, and file compression, such
+pre-allocation may not reliably prevent the out-of-disk-space error from
+occurring later. Checking for write errors is still required, and
+writes to memory-mapped regions created with @code{mmap} can still
+result in @code{SIGBUS}.
+
+@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+@c If the file system does not support allocation,
+@c @code{posix_fallocate} has a race with file extension (if
+@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
+@c @var{length} is positive).
+
+Allocate backing store for the region of @var{length} bytes starting at
+byte @var{offset} in the file for the descriptor @var{fd}. The file
+length is increased to @samp{@var{length} + @var{offset}} if necessary.
+
+@var{fd} must be a regular file opened for writing, or @code{EBADF} is
+returned. If there is insufficient disk space to fulfill the allocation
+request, @code{ENOSPC} is returned.
+
+@strong{Note:} If @code{fallocate} is not available (because the file
+system does not support it), @code{posix_fallocate} is emulated, which
+has the following drawbacks:
+
+@itemize @bullet
+@item
+It is very inefficient because all file system blocks in the requested
+range need to be examined (even if they have been allocated before) and
+potentially rewritten. In contrast, with proper @code{fallocate}
+support (see below), the file system can examine the internal file
+allocation data structures and eliminate holes directly, maybe even
+using unwritten extents (which are pre-allocated but uninitialized on
+disk).
+
+@item
+There is a race condition if another thread or process modifies the
+underlying file in the to-be-allocated area. Non-null bytes could be
+overwritten with null bytes.
+
+@item
+If @var{fd} has been opened with the @code{O_APPEND} flag, the function
+will fail with an @code{errno} value of @code{EBADF}.
+
+@item
+If @var{length} is zero, @code{ftruncate} is used to increase the file
+size as requested, without allocating file system blocks. There is a
+race condition which means that @code{ftruncate} can accidentally
+truncate the file if it has been extended concurrently.
+@end itemize
+
+On Linux, if an application does not benefit from emulation or if the
+emulation is harmful due to its inherent race conditions, the
+application can use the Linux-specific @code{fallocate} function, with a
+zero flag argument. For the @code{fallocate} function, @theglibc{} does
+not perform allocation emulation if the file system does not support
+allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller.
+
+@end deftypefun
+
+@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+This function is a variant of @code{posix_fallocate64} which accepts
+64-bit file offsets on all platforms.
+
+@end deftypefun
+
@node Making Special Files
@section Making Special Files
@cindex creating special files
diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
index d15d603..e7fe201 100644
--- a/sysdeps/posix/posix_fallocate.c
+++ b/sysdeps/posix/posix_fallocate.c
@@ -18,26 +18,36 @@
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
#include <sys/stat.h>
#include <sys/statfs.h>
-/* Reserve storage for the data of the file associated with FD. */
+/* Reserve storage for the data of the file associated with FD. This
+ emulation is far from perfect, but the kernel cannot do not much
+ better for network file systems, either. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
struct stat64 st;
- struct statfs f;
- /* `off_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
- if (offset + len < 0)
+
+ /* Perform overflow check. The outer cast relies on a GCC
+ extension. */
+ if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
return EFBIG;
- /* First thing we have to make sure is that this is really a regular
- file. */
+ /* pwrite below will not do the right thing in O_APPEND mode. */
+ {
+ int flags = __fcntl (fd, F_GETFL, 0);
+ if (flags < 0 || (flags & O_APPEND) != 0)
+ return EBADF;
+ }
+
+ /* We have to make sure that this is really a regular file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
if (len == 0)
{
+ /* This is racy, but there is no good way to satisfy a
+ zero-length allocation request. */
if (st.st_size < offset)
{
int ret = __ftruncate (fd, offset);
@@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
return 0;
}
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+ /* Minimize data transfer for network file systems, by issuing
+ single-byte write requests spaced by the file system block size.
+ (Most local file systems have fallocate support, so this fallback
+ code is not used there.) */
+
+ unsigned increment;
+ {
+ struct statfs64 f;
+
+ if (__fstatfs64 (fd, &f) != 0)
+ return errno;
+ if (f.f_bsize == 0)
+ increment = 512;
+ else if (f.f_bsize < 4096)
+ increment = f.f_bsize;
+ else
+ /* NFS does not propagate the block size of the underlying
+ storage and may report a much larger value which would still
+ leave holes after the loop below, so we cap the increment at
+ 4096. */
+ increment = 4096;
+ }
+
+ /* Write a null byte to every block. This is racy; we currently
+ lack a better option. Compare-and-swap against a file mapping
+ might additional local races, but requires interposition of a
+ signal handler to catch SIGBUS. */
+ for (offset += (len - 1) % increment; len > 0; offset += increment)
{
- len -= f.f_bsize;
+ len -= increment;
if (offset < st.st_size)
{
diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
index b845df7..ee32679 100644
--- a/sysdeps/posix/posix_fallocate64.c
+++ b/sysdeps/posix/posix_fallocate64.c
@@ -18,26 +18,36 @@
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
#include <sys/stat.h>
#include <sys/statfs.h>
-/* Reserve storage for the data of the file associated with FD. */
+/* Reserve storage for the data of the file associated with FD. This
+ emulation is far from perfect, but the kernel cannot do not much
+ better for network file systems, either. */
int
__posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
{
struct stat64 st;
- struct statfs64 f;
- /* `off64_t' is a signed type. Therefore we can determine whether
- OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
- if (offset + len < 0)
+
+ /* Perform overflow check. The outer cast relies on a GCC
+ extension. */
+ if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
return EFBIG;
- /* First thing we have to make sure is that this is really a regular
- file. */
+ /* pwrite64 below will not do the right thing in O_APPEND mode. */
+ {
+ int flags = __fcntl (fd, F_GETFL, 0);
+ if (flags < 0 || (flags & O_APPEND) != 0)
+ return EBADF;
+ }
+
+ /* We have to make sure that this is really a regular file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
if (len == 0)
{
+ /* This is racy, but there is no good way to satisfy a
+ zero-length allocation request. */
if (st.st_size < offset)
{
int ret = __ftruncate64 (fd, offset);
@@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
return 0;
}
- /* We have to know the block size of the filesystem to get at least some
- sort of performance. */
- if (__fstatfs64 (fd, &f) != 0)
- return errno;
-
- /* Try to play safe. */
- if (f.f_bsize == 0)
- f.f_bsize = 512;
-
- /* Write something to every block. */
- for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+ /* Minimize data transfer for network file systems, by issuing
+ single-byte write requests spaced by the file system block size.
+ (Most local file systems have fallocate support, so this fallback
+ code is not used there.) */
+
+ unsigned increment;
+ {
+ struct statfs64 f;
+
+ if (__fstatfs64 (fd, &f) != 0)
+ return errno;
+ if (f.f_bsize == 0)
+ increment = 512;
+ else if (f.f_bsize < 4096)
+ increment = f.f_bsize;
+ else
+ /* NFS clients do not propagate the block size of the underlying
+ storage and may report a much larger value which would still
+ leave holes after the loop below, so we cap the increment at
+ 4096. */
+ increment = 4096;
+ }
+
+ /* Write a null byte to every block. This is racy; we currently
+ lack a better option. Compare-and-swap against a file mapping
+ might address local races, but requires interposition of a signal
+ handler to catch SIGBUS. */
+ for (offset += (len - 1) % increment; len > 0; offset += increment)
{
- len -= f.f_bsize;
+ len -= increment;
if (offset < st.st_size)
{
--
2.1.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-18 13:14 ` Florian Weimer
@ 2015-05-22 15:03 ` Mark Wielaard
2015-05-26 12:24 ` Florian Weimer
2015-06-03 6:59 ` Carlos O'Donell
1 sibling, 1 reply; 25+ messages in thread
From: Mark Wielaard @ 2015-05-22 15:03 UTC (permalink / raw)
To: Florian Weimer
Cc: Roland McGrath, Rich Felker, Paul Eggert, libc-alpha,
Carlos O'Donell
On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
> Another very recent example is here:
>
> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
>
> > This suggests that people actually rely on the current allocation
> > behavior. Combined with my previous analysis that applications will
> > start to fail if we remove the fallback and return EINVAL, I now think
> > we need to keep the allocation loop.
I should point out that the above patch isn't in elfutils yet. It is
waiting on how this discussion turns out.
At the moment we simply use ftruncate. The problem that is solved by
using posix_fallocate is that we are about the write to the memory of an
mmapped file. Since ftruncate doesn't guarantee that the backing store
is really there we risk getting a SIGBUS if the disk is full and we
write to a memory area that hasn't been allocated yet. Since this is in
library code, we cannot simply catch the SIGBUS. And we cannot use
fallocate since that doesn't guarantee that the backing store is really
allocated since it depends on whether the underlying file system support
fallocate. As far as I know posix_fallocate is the only way to guarantee
that the file is fully allocated without spurious failures depending on
where the file resides in the file system. And posix_fallocate has been
available since forever with this functionality. We only expect an error
if there was a real error, like ENOSPACE, allocating the file. But if
there is another way to get that guarantee we would be happy to switch
to it.
Cheers,
Mark
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-22 15:03 ` Mark Wielaard
@ 2015-05-26 12:24 ` Florian Weimer
2015-05-26 12:29 ` Mark Wielaard
2015-05-26 16:57 ` Rich Felker
0 siblings, 2 replies; 25+ messages in thread
From: Florian Weimer @ 2015-05-26 12:24 UTC (permalink / raw)
To: Mark Wielaard
Cc: Roland McGrath, Rich Felker, Paul Eggert, libc-alpha,
Carlos O'Donell
On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
>> Another very recent example is here:
>>
>> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
>>
>>> This suggests that people actually rely on the current allocation
>>> behavior. Combined with my previous analysis that applications will
>>> start to fail if we remove the fallback and return EINVAL, I now think
>>> we need to keep the allocation loop.
>
> I should point out that the above patch isn't in elfutils yet. It is
> waiting on how this discussion turns out.
>
> At the moment we simply use ftruncate. The problem that is solved by
> using posix_fallocate is that we are about the write to the memory of an
> mmapped file. Since ftruncate doesn't guarantee that the backing store
> is really there we risk getting a SIGBUS if the disk is full and we
> write to a memory area that hasn't been allocated yet. Since this is in
> library code, we cannot simply catch the SIGBUS. And we cannot use
> fallocate since that doesn't guarantee that the backing store is really
> allocated since it depends on whether the underlying file system support
> fallocate.
posix_fallocate does not guarantee this, either. See my patch with the
documentation update. Compression, COW, thin provision all can result
in ENOSPC. And obviously, there can be other I/O errors.
If you absolutely, truly need to use mmap, we need either have to
provide a way to intercept SIGBUS (perhaps à la SHEâthe technology is
there, it's just not available to C code in a deeply nested library
right now), or another mmap flag that prevents the kernel from sending
SIGBUS, and some way to tell if a mapping had been subject to write
errors (perhaps revive msync(MS_ASYNC)?).
> As far as I know posix_fallocate is the only way to guarantee
> that the file is fully allocated without spurious failures depending on
> where the file resides in the file system. And posix_fallocate has been
> available since forever with this functionality.
Thin provisioning etc. has become more common lately, so while
posix_fallocate did provide this functionality in the past, it doesn't
seem to do it right now.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-26 12:24 ` Florian Weimer
@ 2015-05-26 12:29 ` Mark Wielaard
2015-05-26 12:38 ` Florian Weimer
2015-05-26 16:57 ` Rich Felker
1 sibling, 1 reply; 25+ messages in thread
From: Mark Wielaard @ 2015-05-26 12:29 UTC (permalink / raw)
To: Florian Weimer
Cc: Roland McGrath, Rich Felker, Paul Eggert, libc-alpha,
Carlos O'Donell
On Tue, 2015-05-26 at 11:12 +0200, Florian Weimer wrote:
> On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> > At the moment we simply use ftruncate. The problem that is solved by
> > using posix_fallocate is that we are about the write to the memory of an
> > mmapped file. Since ftruncate doesn't guarantee that the backing store
> > is really there we risk getting a SIGBUS if the disk is full and we
> > write to a memory area that hasn't been allocated yet. Since this is in
> > library code, we cannot simply catch the SIGBUS. And we cannot use
> > fallocate since that doesn't guarantee that the backing store is really
> > allocated since it depends on whether the underlying file system support
> > fallocate.
>
> posix_fallocate does not guarantee this, either. See my patch with the
> documentation update. Compression, COW, thin provision all can result
> in ENOSPC. And obviously, there can be other I/O errors.
But that is precisely what we want. We want to get an ENOSPC (or some
other I/O error as documented for posix_fallocate) before we start
poking at the mmap backed memory. It seems that is what the glibc
posix_fallocate guarantees, unlike ftruncate (or fallocate which can
return an error even though there is space, but the file system just
doesn't implement fallocate support).
If not posix_allocate, what would you recommend we use to make sure the
file has a backing store before we start poking at it?
> If you absolutely, truly need to use mmap, we need either have to
> provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
> there, it's just not available to C code in a deeply nested library
> right now), or another mmap flag that prevents the kernel from sending
> SIGBUS, and some way to tell if a mapping had been subject to write
> errors (perhaps revive msync(MS_ASYNC)?).
Yes, having msync report the error instead of having to deal with
signals (which is a pain in a library) would definitely be preferred. Or
any other mechanism to make sure that the mmap backed file area really
exists and/or gets written to the backing store.
> Thin provisioning etc. has become more common lately, so while
> posix_fallocate did provide this functionality in the past, it doesn't
> seem to do it right now.
I might not understand precisely what you mean by thin-provisioning. But
wouldn't all fancy new filesystems also support fallocate directly
anyway? In which case the fallback part doesn't even trigger. And if it
does trigger then because it touches and explicitly writes any block
that doesn't have a backing store right now, would get it, doesn't it?
Thanks,
Mark
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-26 12:29 ` Mark Wielaard
@ 2015-05-26 12:38 ` Florian Weimer
0 siblings, 0 replies; 25+ messages in thread
From: Florian Weimer @ 2015-05-26 12:38 UTC (permalink / raw)
To: Mark Wielaard
Cc: Roland McGrath, Rich Felker, Paul Eggert, libc-alpha,
Carlos O'Donell
On 05/26/2015 12:44 PM, Mark Wielaard wrote:
>> posix_fallocate does not guarantee this, either. See my patch with the
>> documentation update. Compression, COW, thin provision all can result
>> in ENOSPC. And obviously, there can be other I/O errors.
>
> But that is precisely what we want. We want to get an ENOSPC (or some
> other I/O error as documented for posix_fallocate) before we start
> poking at the mmap backed memory. It seems that is what the glibc
> posix_fallocate guarantees, unlike ftruncate (or fallocate which can
> return an error even though there is space, but the file system just
> doesn't implement fallocate support).
Sorry, what I meant is this: Even if you actually *write* zeros now, the
increasing depth of the storage stack means that this will not reliable
prevent you from getting ENOSPC later (which is reported, in the context
of mmap, as SIGBUS).
> If not posix_allocate, what would you recommend we use to make sure the
> file has a backing store before we start poking at it?
Due to the possibility of copy-on-write storage backends, I don't think
you can get what you want anymore.
>> Thin provisioning etc. has become more common lately, so while
>> posix_fallocate did provide this functionality in the past, it doesn't
>> seem to do it right now.
>
> I might not understand precisely what you mean by thin-provisioning. But
> wouldn't all fancy new filesystems also support fallocate directly
> anyway?
Thin provisioning often operates at the block layer. An fallocate call
typically only updates metadata, creating unwritten extents instead of
actually writing out zeros.
> In which case the fallback part doesn't even trigger. And if it
> does trigger then because it touches and explicitly writes any block
> that doesn't have a backing store right now, would get it, doesn't it?
I don't know if anything implements this optimization, but it's possible
that if you write zeros to a thin-provisioned hole (think sparse file at
the block layer), then the device discards the write operation. It's
definitely an issue with compression. A block of zeros likely
compresses better than anything you will write later, so you can still
get into an ENOSPC situation.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-26 12:24 ` Florian Weimer
2015-05-26 12:29 ` Mark Wielaard
@ 2015-05-26 16:57 ` Rich Felker
2015-05-27 19:48 ` Mark Wielaard
2015-05-28 14:43 ` Florian Weimer
1 sibling, 2 replies; 25+ messages in thread
From: Rich Felker @ 2015-05-26 16:57 UTC (permalink / raw)
To: Florian Weimer
Cc: Mark Wielaard, Roland McGrath, Paul Eggert, libc-alpha,
Carlos O'Donell
On Tue, May 26, 2015 at 11:12:15AM +0200, Florian Weimer wrote:
> On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> > On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
> >> Another very recent example is here:
> >>
> >> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> >>
> >>> This suggests that people actually rely on the current allocation
> >>> behavior. Combined with my previous analysis that applications will
> >>> start to fail if we remove the fallback and return EINVAL, I now think
> >>> we need to keep the allocation loop.
> >
> > I should point out that the above patch isn't in elfutils yet. It is
> > waiting on how this discussion turns out.
> >
> > At the moment we simply use ftruncate. The problem that is solved by
> > using posix_fallocate is that we are about the write to the memory of an
> > mmapped file. Since ftruncate doesn't guarantee that the backing store
> > is really there we risk getting a SIGBUS if the disk is full and we
> > write to a memory area that hasn't been allocated yet. Since this is in
> > library code, we cannot simply catch the SIGBUS. And we cannot use
> > fallocate since that doesn't guarantee that the backing store is really
> > allocated since it depends on whether the underlying file system support
> > fallocate.
>
> posix_fallocate does not guarantee this, either. See my patch with the
> documentation update. Compression, COW, thin provision all can result
> in ENOSPC.
These are all buggy, non-conforming implementations. That doesn't mean
users can't use them, but when applications malfunction, it's because
they're using a low-quality, wrong implementation, not because the
application has a bug. When the implementation intentionally cuts
corners on correctness, it's not the application's job to make up for
it.
> And obviously, there can be other I/O errors.
That's faulty hardware which is not covered by the specification.
Technically, having faulty hardware is non-conforming too.
> If you absolutely, truly need to use mmap, we need either have to
> provide a way to intercept SIGBUS (perhaps à la SHEâthe technology is
> there, it's just not available to C code in a deeply nested library
> right now), or another mmap flag that prevents the kernel from sending
> SIGBUS, and some way to tell if a mapping had been subject to write
> errors (perhaps revive msync(MS_ASYNC)?).
This would be really nice. Handling SIGBUS is not a viable approach
because it's not library-safe and difficult to make thread-safe.
Rich
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-26 16:57 ` Rich Felker
@ 2015-05-27 19:48 ` Mark Wielaard
2015-05-28 14:43 ` Florian Weimer
1 sibling, 0 replies; 25+ messages in thread
From: Mark Wielaard @ 2015-05-27 19:48 UTC (permalink / raw)
To: Rich Felker
Cc: Florian Weimer, Roland McGrath, Paul Eggert, libc-alpha,
Carlos O'Donell
On Tue, 2015-05-26 at 11:13 -0400, Rich Felker wrote:
> > posix_fallocate does not guarantee this, either. See my patch with the
> > documentation update. Compression, COW, thin provision all can result
> > in ENOSPC.
>
> These are all buggy, non-conforming implementations. That doesn't mean
> users can't use them, but when applications malfunction, it's because
> they're using a low-quality, wrong implementation, not because the
> application has a bug. When the implementation intentionally cuts
> corners on correctness, it's not the application's job to make up for
> it.
Well, if there are issues application/library writers certainly would
like to know or have a way to detect them. In general relying on the
glibc implementation details is a must, even if there are standards that
define some corner cases differently. It looks like the guarantees that
are needed in this case are (mostly) guaranteed by the glibc fallback
code in those cases that the kernel doesn't have direct support. At
least it seems that using posix_fallocate in this particular case is
always the better choice over ftruncate. posix_fallocate will at least
give a sensible error up front instead of just causing a SIGBUS much
later in the code.
> > If you absolutely, truly need to use mmap, we need either have to
> > provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
> > there, it's just not available to C code in a deeply nested library
> > right now), or another mmap flag that prevents the kernel from sending
> > SIGBUS, and some way to tell if a mapping had been subject to write
> > errors (perhaps revive msync(MS_ASYNC)?).
>
> This would be really nice. Handling SIGBUS is not a viable approach
> because it's not library-safe and difficult to make thread-safe.
Yes!
Thanks,
Mark
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-26 16:57 ` Rich Felker
2015-05-27 19:48 ` Mark Wielaard
@ 2015-05-28 14:43 ` Florian Weimer
1 sibling, 0 replies; 25+ messages in thread
From: Florian Weimer @ 2015-05-28 14:43 UTC (permalink / raw)
To: Rich Felker
Cc: Mark Wielaard, Roland McGrath, Paul Eggert, libc-alpha,
Carlos O'Donell
On 05/26/2015 05:13 PM, Rich Felker wrote:
> On Tue, May 26, 2015 at 11:12:15AM +0200, Florian Weimer wrote:
>> On 05/22/2015 12:18 PM, Mark Wielaard wrote:
>>> On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
>>>> Another very recent example is here:
>>>>
>>>> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
>>>>
>>>>> This suggests that people actually rely on the current allocation
>>>>> behavior. Combined with my previous analysis that applications will
>>>>> start to fail if we remove the fallback and return EINVAL, I now think
>>>>> we need to keep the allocation loop.
>>>
>>> I should point out that the above patch isn't in elfutils yet. It is
>>> waiting on how this discussion turns out.
>>>
>>> At the moment we simply use ftruncate. The problem that is solved by
>>> using posix_fallocate is that we are about the write to the memory of an
>>> mmapped file. Since ftruncate doesn't guarantee that the backing store
>>> is really there we risk getting a SIGBUS if the disk is full and we
>>> write to a memory area that hasn't been allocated yet. Since this is in
>>> library code, we cannot simply catch the SIGBUS. And we cannot use
>>> fallocate since that doesn't guarantee that the backing store is really
>>> allocated since it depends on whether the underlying file system support
>>> fallocate.
>>
>> posix_fallocate does not guarantee this, either. See my patch with the
>> documentation update. Compression, COW, thin provision all can result
>> in ENOSPC.
>
> These are all buggy, non-conforming implementations.
It's a trade-off. A malloc with over-commitment is buggy and
non-conforming as well, but it allows you to run more applications at
the same time, actually eliminating failure scenarios.
The same thing is true for posix_fallocate. You can at least make sure
that quota and file-system-level limits are not in the way. And if you
want pure fallocate, that's still available.
> That doesn't mean
> users can't use them, but when applications malfunction, it's because
> they're using a low-quality, wrong implementation, not because the
> application has a bug. When the implementation intentionally cuts
> corners on correctness, it's not the application's job to make up for
> it.
I think this isn't really helpful. There is a need for storage
allocation, and there is empirical evidence that a userspace allocation
loop is worthwhile. If the kernel does not provide proper fallocate
support, it does not matter from the user perspective if it's in libc or
in the application. I'd rather provide the best possible implementation
in glibc.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-05-18 13:14 ` Florian Weimer
2015-05-22 15:03 ` Mark Wielaard
@ 2015-06-03 6:59 ` Carlos O'Donell
2015-06-05 9:55 ` Florian Weimer
1 sibling, 1 reply; 25+ messages in thread
From: Carlos O'Donell @ 2015-06-03 6:59 UTC (permalink / raw)
To: Florian Weimer, Roland McGrath, Rich Felker
Cc: Paul Eggert, libc-alpha, Mark Wielaard
On 05/18/2015 06:43 AM, Florian Weimer wrote:
> On 05/07/2015 09:05 PM, Florian Weimer wrote:
>> > On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>>> >>> If I'm not mistaken ftruncate could still reduce the file size if it
>>>> >>> races with another operation that would extend the file. This is also
>>>> >>> a data loss bug.
>>> >>
>>> >> I concur.
>> >
>> > It happens with length == 0. We could error out with EINVAL instead of
>> > calling ftruncate.
>> >
>> > Daniel Berrange pointed me to these bugs:
>> >
>> > https://sourceware.org/bugzilla/show_bug.cgi?id=17322
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1140250
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1077068
> Another very recent example is here:
>
>
> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
>
>> > This suggests that people actually rely on the current allocation
>> > behavior. Combined with my previous analysis that applications will
>> > start to fail if we remove the fallback and return EINVAL, I now think
>> > we need to keep the allocation loop.
> This is the patch I currently have. It fixes the avoidable bugs. I
> still think we are in a bad situation here, that even a compatibility
> symbol cannot fix.
>
> -- Florian Weimer / Red Hat Product Security
OK to checkin.
I believe this is the best patch.
It fixes the O_APPEND case (real bug) and makes the NFS usages of
posix_fallocate sensible.
It leaves the fallback code in place for applications that need it
and do not wish to write their own allocation loops, as all
applications would need to do.
Lastly I believe the race conditions outlined in bug 15661 are
application bugs.
Given that posix_fallocate is not listed in POSIX section 2.9.7
"Thread Interactions with Regular File Operations" the call is not
required to be atomic with respect to any other operation. Therefore
it is the responsibility of the "other thread" to synchronize the
file operations with the thread calling posix_fallocate to ensure
posix_fallocate has returned before it starts writing to the file.
Despite the fact that posix_fallocate does not explicitly say that
it will write to the newly allocated storage, it also doesn't say
that it won't, and thus if you are extending the file with posix_fallocate
and also writing to the file, it's safest to be conservative and
synchronize until and when POSIX clarifies that posix_fallocate does
not modify the contents (as it does in at least according to FreeBSD[1]).
[1] https://www.freebsd.org/cgi/man.cgi?query=posix_fallocate&sektion=2&manpath=FreeBSD+8.3-RELEASE
"... Any existing file data in the specified range is unmodified. ..."
> 0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch
>
>
> From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
> Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
> From: Florian Weimer <fweimer@redhat.com>
> Date: Mon, 18 May 2015 11:32:44 +0100
> Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
> #15661]
> To: libc-alpha@sourceware.org
>
> Handle signed integer overflow correctly. Detect and reject O_APPEND.
> Document drawbacks of emulation.
>
> This does not completely address bug 15661, but improves the situation
> somewhat.
> ---
> ChangeLog | 9 ++++
> manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++
> sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++--------
> sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
> 4 files changed, 199 insertions(+), 38 deletions(-)
>
> diff --git a/ChangeLog b/ChangeLog
> index 4de8a25..603847b 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,12 @@
> +2015-05-18 Florian Weimer <fweimer@redhat.com>
> +
> + [BZ #15661]
> + * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
> + Check for overflow properly. Check for O_APPEND. Ignore large
> + file system block sizes. Add comments about problems.
> + * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
> + * manual/filesys.texi (Storage Allocation): New node.
> +
> 2015-05-18 Arjun Shankar <arjun.is@lostca.se>
>
> * include/stdio.h: Define __need_wint_t.
> diff --git a/manual/filesys.texi b/manual/filesys.texi
> index 7d55b43..0f2e3dc 100644
> --- a/manual/filesys.texi
> +++ b/manual/filesys.texi
> @@ -1723,6 +1723,7 @@ modify the attributes of a file.
> access a file.
> * File Times:: About the time attributes of a file.
> * File Size:: Manually changing the size of a file.
> +* Storage Allocation:: Allocate backing storage for files.
OK.
> @end menu
>
> @node Attribute Meanings
> @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the
> real size, and when it has finished a final @code{ftruncate} call should
> set the real size of the file.
>
> +@node Storage Allocation
> +@subsection Storage Allocation
> +@cindex allocating file storage
> +@cindex file allocation
> +@cindex storage allocating
> +
> +@cindex file fragmentation
> +@cindex fragmentation of files
> +@cindex sparse files
> +@cindex files, sparse
> +Most file systems support allocating large files in a non-contiguous
> +fashion: the file is split into @emph{fragments} which are allocated
> +sequentially, but the fragments themselves can be scattered across the
> +disk. File systems generally try to avoid such fragmentation because it
> +decreases performance, but if a file gradually increases in size, there
> +might be no other option than to fragment it. In addition, many file
> +systems support @emph{sparse files} with @emph{holes}: regions of null
> +bytes for which no backing storage has been allocated by the file
> +system. When the holes are finally overwritten with data, fragmentation
> +can occur as well.
> +
> +Explicit allocation of storage for yet-unwritten parts of the file can
> +help the system to avoid fragmentation. Additionally, if storage
> +pre-allocation fails, it is possible to report the out-of-disk error
> +early, often without filling up the entire disk. However, due to
> +deduplication, copy-on-write semantics, and file compression, such
> +pre-allocation may not reliably prevent the out-of-disk-space error from
> +occurring later. Checking for write errors is still required, and
> +writes to memory-mapped regions created with @code{mmap} can still
> +result in @code{SIGBUS}.
> +
> +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
OK.
> +@c If the file system does not support allocation,
> +@c @code{posix_fallocate} has a race with file extension (if
> +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
> +@c @var{length} is positive).
> +
> +Allocate backing store for the region of @var{length} bytes starting at
> +byte @var{offset} in the file for the descriptor @var{fd}. The file
> +length is increased to @samp{@var{length} + @var{offset}} if necessary.
> +
> +@var{fd} must be a regular file opened for writing, or @code{EBADF} is
> +returned. If there is insufficient disk space to fulfill the allocation
> +request, @code{ENOSPC} is returned.
> +
> +@strong{Note:} If @code{fallocate} is not available (because the file
> +system does not support it), @code{posix_fallocate} is emulated, which
> +has the following drawbacks:
> +
> +@itemize @bullet
> +@item
> +It is very inefficient because all file system blocks in the requested
> +range need to be examined (even if they have been allocated before) and
> +potentially rewritten. In contrast, with proper @code{fallocate}
> +support (see below), the file system can examine the internal file
> +allocation data structures and eliminate holes directly, maybe even
> +using unwritten extents (which are pre-allocated but uninitialized on
> +disk).
> +
> +@item
> +There is a race condition if another thread or process modifies the
> +underlying file in the to-be-allocated area. Non-null bytes could be
> +overwritten with null bytes.
> +
> +@item
> +If @var{fd} has been opened with the @code{O_APPEND} flag, the function
> +will fail with an @code{errno} value of @code{EBADF}.
> +
> +@item
> +If @var{length} is zero, @code{ftruncate} is used to increase the file
> +size as requested, without allocating file system blocks. There is a
> +race condition which means that @code{ftruncate} can accidentally
> +truncate the file if it has been extended concurrently.
> +@end itemize
> +
> +On Linux, if an application does not benefit from emulation or if the
> +emulation is harmful due to its inherent race conditions, the
> +application can use the Linux-specific @code{fallocate} function, with a
> +zero flag argument. For the @code{fallocate} function, @theglibc{} does
> +not perform allocation emulation if the file system does not support
> +allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller.
> +
> +@end deftypefun
OK.
> +
> +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
OK.
> +
> +This function is a variant of @code{posix_fallocate64} which accepts
> +64-bit file offsets on all platforms.
> +
> +@end deftypefun
> +
> @node Making Special Files
> @section Making Special Files
> @cindex creating special files
> diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
> index d15d603..e7fe201 100644
> --- a/sysdeps/posix/posix_fallocate.c
> +++ b/sysdeps/posix/posix_fallocate.c
> @@ -18,26 +18,36 @@
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <sys/statfs.h>
>
> -/* Reserve storage for the data of the file associated with FD. */
> +/* Reserve storage for the data of the file associated with FD. This
> + emulation is far from perfect, but the kernel cannot do not much
> + better for network file systems, either. */
>
> int
> posix_fallocate (int fd, __off_t offset, __off_t len)
> {
> struct stat64 st;
> - struct statfs f;
>
> - /* `off_t' is a signed type. Therefore we can determine whether
> - OFFSET + LEN is too large if it is a negative value. */
> if (offset < 0 || len < 0)
> return EINVAL;
> - if (offset + len < 0)
> +
> + /* Perform overflow check. The outer cast relies on a GCC
> + extension. */
> + if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
> return EFBIG;
>
> - /* First thing we have to make sure is that this is really a regular
> - file. */
> + /* pwrite below will not do the right thing in O_APPEND mode. */
> + {
> + int flags = __fcntl (fd, F_GETFL, 0);
> + if (flags < 0 || (flags & O_APPEND) != 0)
> + return EBADF;
> + }
> +
> + /* We have to make sure that this is really a regular file. */
> if (__fxstat64 (_STAT_VER, fd, &st) != 0)
> return EBADF;
> if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
>
> if (len == 0)
> {
> + /* This is racy, but there is no good way to satisfy a
> + zero-length allocation request. */
> if (st.st_size < offset)
> {
> int ret = __ftruncate (fd, offset);
> @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
> return 0;
> }
>
> - /* We have to know the block size of the filesystem to get at least some
> - sort of performance. */
> - if (__fstatfs (fd, &f) != 0)
> - return errno;
> -
> - /* Try to play safe. */
> - if (f.f_bsize == 0)
> - f.f_bsize = 512;
> -
> - /* Write something to every block. */
> - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> + /* Minimize data transfer for network file systems, by issuing
> + single-byte write requests spaced by the file system block size.
> + (Most local file systems have fallocate support, so this fallback
> + code is not used there.) */
> +
> + unsigned increment;
> + {
> + struct statfs64 f;
> +
> + if (__fstatfs64 (fd, &f) != 0)
> + return errno;
> + if (f.f_bsize == 0)
> + increment = 512;
> + else if (f.f_bsize < 4096)
> + increment = f.f_bsize;
> + else
> + /* NFS does not propagate the block size of the underlying
> + storage and may report a much larger value which would still
> + leave holes after the loop below, so we cap the increment at
> + 4096. */
> + increment = 4096;
OK.
> + }
> +
> + /* Write a null byte to every block. This is racy; we currently
> + lack a better option. Compare-and-swap against a file mapping
> + might additional local races, but requires interposition of a
> + signal handler to catch SIGBUS. */
> + for (offset += (len - 1) % increment; len > 0; offset += increment)
> {
> - len -= f.f_bsize;
> + len -= increment;
>
> if (offset < st.st_size)
> {
> diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
> index b845df7..ee32679 100644
> --- a/sysdeps/posix/posix_fallocate64.c
> +++ b/sysdeps/posix/posix_fallocate64.c
> @@ -18,26 +18,36 @@
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <sys/statfs.h>
>
> -/* Reserve storage for the data of the file associated with FD. */
> +/* Reserve storage for the data of the file associated with FD. This
> + emulation is far from perfect, but the kernel cannot do not much
> + better for network file systems, either. */
>
> int
> __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
> {
> struct stat64 st;
> - struct statfs64 f;
>
> - /* `off64_t' is a signed type. Therefore we can determine whether
> - OFFSET + LEN is too large if it is a negative value. */
> if (offset < 0 || len < 0)
> return EINVAL;
> - if (offset + len < 0)
> +
> + /* Perform overflow check. The outer cast relies on a GCC
> + extension. */
> + if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
> return EFBIG;
>
> - /* First thing we have to make sure is that this is really a regular
> - file. */
> + /* pwrite64 below will not do the right thing in O_APPEND mode. */
> + {
> + int flags = __fcntl (fd, F_GETFL, 0);
> + if (flags < 0 || (flags & O_APPEND) != 0)
> + return EBADF;
> + }
> +
> + /* We have to make sure that this is really a regular file. */
> if (__fxstat64 (_STAT_VER, fd, &st) != 0)
> return EBADF;
> if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>
> if (len == 0)
> {
> + /* This is racy, but there is no good way to satisfy a
> + zero-length allocation request. */
> if (st.st_size < offset)
> {
> int ret = __ftruncate64 (fd, offset);
> @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
> return 0;
> }
>
> - /* We have to know the block size of the filesystem to get at least some
> - sort of performance. */
> - if (__fstatfs64 (fd, &f) != 0)
> - return errno;
> -
> - /* Try to play safe. */
> - if (f.f_bsize == 0)
> - f.f_bsize = 512;
> -
> - /* Write something to every block. */
> - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> + /* Minimize data transfer for network file systems, by issuing
> + single-byte write requests spaced by the file system block size.
> + (Most local file systems have fallocate support, so this fallback
> + code is not used there.) */
> +
> + unsigned increment;
> + {
> + struct statfs64 f;
> +
> + if (__fstatfs64 (fd, &f) != 0)
> + return errno;
> + if (f.f_bsize == 0)
> + increment = 512;
> + else if (f.f_bsize < 4096)
> + increment = f.f_bsize;
> + else
> + /* NFS clients do not propagate the block size of the underlying
> + storage and may report a much larger value which would still
> + leave holes after the loop below, so we cap the increment at
> + 4096. */
> + increment = 4096;
> + }
> +
> + /* Write a null byte to every block. This is racy; we currently
> + lack a better option. Compare-and-swap against a file mapping
> + might address local races, but requires interposition of a signal
> + handler to catch SIGBUS. */
> + for (offset += (len - 1) % increment; len > 0; offset += increment)
> {
> - len -= f.f_bsize;
> + len -= increment;
>
> if (offset < st.st_size)
> {
> -- 2.1.0
OK.
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-06-03 6:59 ` Carlos O'Donell
@ 2015-06-05 9:55 ` Florian Weimer
2015-06-05 13:55 ` Carlos O'Donell
0 siblings, 1 reply; 25+ messages in thread
From: Florian Weimer @ 2015-06-05 9:55 UTC (permalink / raw)
To: Carlos O'Donell, Roland McGrath, Rich Felker
Cc: Paul Eggert, libc-alpha, Mark Wielaard
On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
> OK to checkin.
Thanks, committed. I did not at bug 15661 to the NEWS file and resolved
it as WONTFIX.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-06-05 9:55 ` Florian Weimer
@ 2015-06-05 13:55 ` Carlos O'Donell
2015-06-05 13:57 ` Florian Weimer
0 siblings, 1 reply; 25+ messages in thread
From: Carlos O'Donell @ 2015-06-05 13:55 UTC (permalink / raw)
To: Florian Weimer, Roland McGrath, Rich Felker
Cc: Paul Eggert, libc-alpha, Mark Wielaard
On 06/05/2015 05:52 AM, Florian Weimer wrote:
> On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
>
>> OK to checkin.
>
> Thanks, committed. I did not at bug 15661 to the NEWS file and resolved
> it as WONTFIX.
Since this commit fixes a user-visible issue it should have a bug filed
and that bug should be marked in the fixed bug list. This way users
can reopen if they have problems, or review the bugs fixed in the release.
Would you mind opening a bug for the NFS block size issue and then
marking it closed, and updating your ChangeLog entry and NEWS?
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]
2015-06-05 13:55 ` Carlos O'Donell
@ 2015-06-05 13:57 ` Florian Weimer
0 siblings, 0 replies; 25+ messages in thread
From: Florian Weimer @ 2015-06-05 13:57 UTC (permalink / raw)
To: Carlos O'Donell, Roland McGrath, Rich Felker
Cc: Paul Eggert, libc-alpha, Mark Wielaard
On 06/05/2015 03:11 PM, Carlos O'Donell wrote:
> On 06/05/2015 05:52 AM, Florian Weimer wrote:
>> On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
>>
>>> OK to checkin.
>>
>> Thanks, committed. I did not at bug 15661 to the NEWS file and resolved
>> it as WONTFIX.
>
> Since this commit fixes a user-visible issue it should have a bug filed
> and that bug should be marked in the fixed bug list. This way users
> can reopen if they have problems, or review the bugs fixed in the release.
>
> Would you mind opening a bug for the NFS block size issue and then
> marking it closed, and updating your ChangeLog entry and NEWS?
Turns out there was already a bug specifically for the NFS issue, 17322,
so used that.
--
Florian Weimer / Red Hat Product Security
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2015-06-05 13:22 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-24 13:45 [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661] Florian Weimer
2015-05-05 15:37 ` Florian Weimer
2015-05-05 15:59 ` Paul Eggert
2015-05-05 20:28 ` Carlos O'Donell
2015-05-05 20:48 ` Christoph Hellwig
2015-05-06 20:59 ` Carlos O'Donell
2015-05-06 23:29 ` Rich Felker
2015-05-06 7:19 ` Florian Weimer
2015-05-06 22:48 ` Paul Eggert
2015-05-06 23:31 ` Rich Felker
2015-05-07 18:19 ` Roland McGrath
2015-05-07 19:05 ` Florian Weimer
2015-05-18 13:14 ` Florian Weimer
2015-05-22 15:03 ` Mark Wielaard
2015-05-26 12:24 ` Florian Weimer
2015-05-26 12:29 ` Mark Wielaard
2015-05-26 12:38 ` Florian Weimer
2015-05-26 16:57 ` Rich Felker
2015-05-27 19:48 ` Mark Wielaard
2015-05-28 14:43 ` Florian Weimer
2015-06-03 6:59 ` Carlos O'Donell
2015-06-05 9:55 ` Florian Weimer
2015-06-05 13:55 ` Carlos O'Donell
2015-06-05 13:57 ` Florian Weimer
2015-05-07 19:01 ` Carlos O'Donell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).