* (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? @ 2021-02-14 11:13 Tobias Bading 2021-02-14 12:18 ` Tobias Bading 0 siblings, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-14 11:13 UTC (permalink / raw) To: libc-help Hi. A few days ago I encountered strange problems at home while using GNU Emacs on GNU/Linux with Windows SMB shares in the office automounted through the company's VPN. Details and my current findings on the Emacs devel mailing list: https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html. TL;DR: Emacs' stat() and faccessat() calls sometimes return -1 and set errno to EINTR (according to strace). Most of the time the interruption seems to come from a SIGIO, but at least once a SIGCHLD was in the mix as well. The behavior of returning with errno == EINTR seems to be new and/or some kind of bug, because otherwise Emacs would handle that case with TEMP_FAILURE_RETRY() or something to that effect, right? I've used pretty much the same Emacs in the office (on an older GNU/Linux release) for ages with the same Windows SMB shares and never had this problem. The GNU/Linux man pages of stat() and faccessat() don't mention EINTR at all, and neither does POSIX (https://pubs.opengroup.org/onlinepubs/9699919799/). So I guess the one million dollar question is this: Are stat() and faccessat() permitted to return -1 with errno == EINTR, or does that violate the POSIX (or some other) spec? I'll probably write a small test program next to check if I can reproduce this problem outside of Emacs, maybe with a simple SIGALRM and an endless loop of stat()/faccessat() calls. If that reproduces the problem it would be nice to know whether using SA_RESTART in sigaction() has any effect. Any ideas are very welcome... :) Tobias PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently uses libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic and openconnect 8.05-1 to access the VPN. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 11:13 (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? Tobias Bading @ 2021-02-14 12:18 ` Tobias Bading 2021-02-14 13:17 ` Tobias Bading 2021-02-14 17:56 ` Konstantin Kharlamov 0 siblings, 2 replies; 20+ messages in thread From: Tobias Bading @ 2021-02-14 12:18 UTC (permalink / raw) To: libc-help [-- Attachment #1: Type: text/plain, Size: 2496 bytes --] Hello again. I've been able to reproduce the problem with the attached program. With SIGALRMs firing 10 times per second, I get maybe a dozen "handler called" lines before stat() or faccessat() fails with errno EINTR. When I increase the rate to 50 SIGALRMs per second, the very first stat() fails in every test run. Specifying SA_RESTART in sigaction() has no effect. Unfortunately, I don't have any other network shares available at the moment to test whether only CIFS through VPN is affected, or the problem would occur with e.g. NFS or CIFS without a VPN as well. Tobias --- On 14.02.21 12:13, Tobias Bading wrote: > Hi. > > A few days ago I encountered strange problems at home while using GNU > Emacs on GNU/Linux with Windows SMB shares in the office automounted > through the company's VPN. > > Details and my current findings on the Emacs devel mailing list: > https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html. > > TL;DR: > Emacs' stat() and faccessat() calls sometimes return -1 and set errno > to EINTR (according to strace). Most of the time the interruption > seems to come from a SIGIO, but at least once a SIGCHLD was in the mix > as well. > > The behavior of returning with errno == EINTR seems to be new and/or > some kind of bug, because otherwise Emacs would handle that case with > TEMP_FAILURE_RETRY() or something to that effect, right? I've used > pretty much the same Emacs in the office (on an older GNU/Linux > release) for ages with the same Windows SMB shares and never had this > problem. > > The GNU/Linux man pages of stat() and faccessat() don't mention EINTR > at all, and neither does POSIX > (https://pubs.opengroup.org/onlinepubs/9699919799/). > > So I guess the one million dollar question is this: > Are stat() and faccessat() permitted to return -1 with errno == EINTR, > or does that violate the POSIX (or some other) spec? > > I'll probably write a small test program next to check if I can > reproduce this problem outside of Emacs, maybe with a simple SIGALRM > and an endless loop of stat()/faccessat() calls. If that reproduces > the problem it would be nice to know whether using SA_RESTART in > sigaction() has any effect. > > Any ideas are very welcome... :) > > Tobias > > PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently > uses libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic and > openconnect 8.05-1 to access the VPN. > [-- Attachment #2: dont-interrupt-me.c --] [-- Type: text/x-csrc, Size: 1104 bytes --] #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <string.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> static const char path[] = "/smb/server/share/dir/subdir"; void handler (int signum) { static const char t[] = "handler called\n"; write (1, t, sizeof t); } int main () { struct stat st; struct sigaction action; struct itimerval timer; memset (&action, 0, sizeof action); action.sa_handler = handler; action.sa_flags = SA_RESTART; if (sigaction (SIGALRM, &action, NULL)) { perror ("sigaction() failed"); return 1; } timer.it_value.tv_sec = timer.it_interval.tv_sec = 0; timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10; if (setitimer (ITIMER_REAL, &timer, NULL)) { perror ("setitimer() failed"); return 1; } for (;;) { int r = stat (path, &st); if (r) { perror ("stat() failed"); return 1; } r = faccessat (AT_FDCWD, path, R_OK, 0); if (r) { perror ("faccessat() failed"); return 1; } } return 0; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 12:18 ` Tobias Bading @ 2021-02-14 13:17 ` Tobias Bading 2021-02-14 15:14 ` Godmar Back 2021-02-14 17:56 ` Konstantin Kharlamov 1 sibling, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-14 13:17 UTC (permalink / raw) To: libc-help o/ I ran the test program against different servers and shares in the office. The share on which I encountered the problem first seems to be affected the worst: 50 SIGALRMs per second and the first stat() has no chance. Other shares, some on Windows servers and one on an older GNU/Linux with Samba, seemed not to have this problem at first. But when I removed the teminal output from the SIGALRM handler and increased the SIGALRM rate to 10000 per second, all those other shares produced the error as well sooner or later. Hmm... Tobias On 14.02.21 13:18, Tobias Bading wrote: > Hello again. > > I've been able to reproduce the problem with the attached program. > With SIGALRMs firing 10 times per second, I get maybe a dozen "handler > called" lines before stat() or faccessat() fails with errno EINTR. > When I increase the rate to 50 SIGALRMs per second, the very first > stat() fails in every test run. > > Specifying SA_RESTART in sigaction() has no effect. > > Unfortunately, I don't have any other network shares available at the > moment to test whether only CIFS through VPN is affected, or the > problem would occur with e.g. NFS or CIFS without a VPN as well. > > Tobias > > --- > > On 14.02.21 12:13, Tobias Bading wrote: >> Hi. >> >> A few days ago I encountered strange problems at home while using GNU >> Emacs on GNU/Linux with Windows SMB shares in the office automounted >> through the company's VPN. >> >> Details and my current findings on the Emacs devel mailing list: >> https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html. >> >> TL;DR: >> Emacs' stat() and faccessat() calls sometimes return -1 and set errno >> to EINTR (according to strace). Most of the time the interruption >> seems to come from a SIGIO, but at least once a SIGCHLD was in the >> mix as well. >> >> The behavior of returning with errno == EINTR seems to be new and/or >> some kind of bug, because otherwise Emacs would handle that case with >> TEMP_FAILURE_RETRY() or something to that effect, right? I've used >> pretty much the same Emacs in the office (on an older GNU/Linux >> release) for ages with the same Windows SMB shares and never had this >> problem. >> >> The GNU/Linux man pages of stat() and faccessat() don't mention EINTR >> at all, and neither does POSIX >> (https://pubs.opengroup.org/onlinepubs/9699919799/). >> >> So I guess the one million dollar question is this: >> Are stat() and faccessat() permitted to return -1 with errno == >> EINTR, or does that violate the POSIX (or some other) spec? >> >> I'll probably write a small test program next to check if I can >> reproduce this problem outside of Emacs, maybe with a simple SIGALRM >> and an endless loop of stat()/faccessat() calls. If that reproduces >> the problem it would be nice to know whether using SA_RESTART in >> sigaction() has any effect. >> >> Any ideas are very welcome... :) >> >> Tobias >> >> PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently >> uses libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic >> and openconnect 8.05-1 to access the VPN. >> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 13:17 ` Tobias Bading @ 2021-02-14 15:14 ` Godmar Back 2021-02-14 15:20 ` Godmar Back 2021-02-14 17:30 ` Tobias Bading 0 siblings, 2 replies; 20+ messages in thread From: Godmar Back @ 2021-02-14 15:14 UTC (permalink / raw) To: Tobias Bading; +Cc: William Tambe via Libc-help Tobias, my assessment would be that this is the wrong email list - what you describe, if correct, would be a kernel bug. Specifically, inside the kernel, most system calls are restartable (and should be restarted if `SA_RESTART` is given), but occasionally you find a place where the kernel implementor decided that they can't restart the system call, in which case `EINTR` is propagated to user land. I've encountered this personally in the past with `tcsetattr()`. Now whether it is indeed impossible for the implementor to safely restart the call or not depends on the specific implementation details (and path taken through the kernel). Your choices are, in my opinion: - to handle `EINTR` in user mode - to file a bug (and propose a fix) against the kernel. You'd need to look in https://github.com/torvalds/linux/tree/master/fs/cifs for places where they set something to `-EINTR` and then fail to turn it into `-ERESTARTSYS` and do not have good reason to do so. I suspect this would require in-depth knowledge of the cifs file system code. Bringing up POSIX may not be the most fruitful course of action, and asking for compensation in the C library (as in doing an automatic restart in the C library system call wrapper) is generally not a good idea to my knowledge. My 2c. - Godmar ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 15:14 ` Godmar Back @ 2021-02-14 15:20 ` Godmar Back 2021-02-14 17:30 ` Tobias Bading 1 sibling, 0 replies; 20+ messages in thread From: Godmar Back @ 2021-02-14 15:20 UTC (permalink / raw) To: Tobias Bading; +Cc: William Tambe via Libc-help PS: This discussion may be helpful https://www.spinics.net/lists/stable/msg440182.html - maybe it's your bug? Also, a link to when I brought up a similar issue in 2011 about tcsetattr(): http://lkml.iu.edu/hypermail/linux/kernel/1110.1/02954.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 15:14 ` Godmar Back 2021-02-14 15:20 ` Godmar Back @ 2021-02-14 17:30 ` Tobias Bading 2021-02-14 22:04 ` Godmar Back 2021-02-15 8:33 ` Tobias Bading 1 sibling, 2 replies; 20+ messages in thread From: Tobias Bading @ 2021-02-14 17:30 UTC (permalink / raw) To: Godmar Back; +Cc: libc-help Hi Godmar, thanks a lot for your feedback. > Bringing up POSIX may not be the most fruitful course of action, and > asking for compensation in the C library (as in doing an automatic > restart in the C library system call wrapper) is generally not a good > idea to my knowledge. Sorry if my mail sounded like I was trying to point fingers or anything. I simply chose the GNU libc mailing list because I couldn't find the definitive answer to the seemingly innocent question "Can stat() or faccessat() return -1 with errno == EINTR?", neither in their man pages, some standard spec like POSIX, nor the depth of the internet. I've been writing software for different UNIXes for a few decades now and so far I primarily followed the(/my own?) rule "if a man page doesn't mention EINTR, you don't need to worry about it". If that rule is incorrect, I have quite a few lines of code to review and wouldn't even know where to start. XD EINTR is definitely a very special errno that needs to be handled in userland, *if* a function is known to return with errno EINTR in (more or less) specific cases. But there's a ton of ancient userland code that calls stat() without handling EINTR, probably because the function never returned EINTR before and no documentation ever claimed that it could. If that function is able to return with EINTR now, it has the potential to break a lot of existing code. > Your choices are, in my opinion: > - to handle `EINTR` in user mode Yes, wrapping TEMP_FAILURE_RETRY() macros around the stat() and faccessat() calls in Emacs' source code did work as a band-aid. > - to file a bug (and propose a fix) against the kernel. If the GNU libc devs agree that neither stat() nor faccessat() should ever return with errno EINTR and this error code is coming directly from a syscall, then that's the way to go I guess. > Specifically, inside the kernel, most system calls are restartable > (and should be restarted if `SA_RESTART` is given), but occasionally > you find a place where the kernel implementor decided that they can't > restart the system call, in which case `EINTR` is propagated to user > land. I've encountered this personally in the past with > `tcsetattr()`. > Also, a link to when I brought up a similar issue in 2011 about tcsetattr(): > http://lkml.iu.edu/hypermail/linux/kernel/1110.1/02954.html A kernel implementor decides to let EINTR propagate into userland? o.O In case of tcsetattr() or stat(), wouldn't that be in clear violation of Linus' "WE DO NOT BREAK USERSPACE!" first rule of kernel maintenance? (https://lkml.org/lkml/2012/12/23/75) What happened in the tcsetattr() case? The man page of tcsetattr() doesn't mention EINTR, did they revert the kernel change? > You'd need to > look in https://github.com/torvalds/linux/tree/master/fs/cifs for > places where they set something to `-EINTR` and then fail to turn it > into `-ERESTARTSYS` and do not have good reason to do so. Thanks for the pointers. Do you know whether -ERESTARTSYS is used to request an unconditional restart of the system call, or is SA_RESTART specified (or not) in userland also a factor? If we agree that stat() should never return with EINTR, than SA_RESTART shouldn't be a factor. (BTW, Emacs is intentionally not using SA_RESTART in emacs_sigaction_flags() in src/sysdep.c.) > PS: This discussion may be helpful > https://www.spinics.net/lists/stable/msg440182.html - maybe it's your > bug? Thanks!!! I'm running a 5.4 kernel, but since that mail is barely a month old I have to check whether my current kernel contains that patch. Tobias ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 17:30 ` Tobias Bading @ 2021-02-14 22:04 ` Godmar Back 2021-02-15 9:15 ` Tobias Bading 2021-02-15 8:33 ` Tobias Bading 1 sibling, 1 reply; 20+ messages in thread From: Godmar Back @ 2021-02-14 22:04 UTC (permalink / raw) To: Tobias Bading; +Cc: William Tambe via Libc-help I don't know if this qualifies as "breaking userland" or "this has never worked." You could use systemtap to patch the kernel to see if you can trap the path where they're returning `EINTR`. I recently came across a blog post: https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/ where some people did that and found a bug in the TCP code. Very impressive. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 22:04 ` Godmar Back @ 2021-02-15 9:15 ` Tobias Bading 2021-02-15 9:24 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-15 9:15 UTC (permalink / raw) To: Godmar Back; +Cc: William Tambe via Libc-help > You could use systemtap to patch the kernel to see if you can trap the > path where they're returning `EINTR`. > I recently came across a blog post: > https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/ > where some people did that and found a bug in the TCP code. Very > impressive. Thanks, very interesting read indeed. I haven't heard of systemtap before, sounds like a very cool tool for kernel hackers. Since I'm only doing userland stuff, I usually trust Valgrind to help me out when things get fishy. > I don't know if this qualifies as "breaking userland" [...] I'd say breaking stat() qualifies big time, but to be honest I don't really have the time/nerve to read https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html in full. The "Signal Effects on Other Functions" paragraph seems interesting, but my head explodes when I read sentences like "Therefore, implementations should generally make such functions (including ones defined as extensions) interruptible." "Functions not mentioned explicitly as interruptible may be so on some implementations, possibly as an extension where the function gives an [EINTR] error." English isn't my first language, so I must be reading this wrong or interpreting it in the wrong context. If a POSIX implementation was allowed to add EINTR to the list of possible error codes returned by a function defined in the standard, what kind of standard would that be? How about an implementation of malloc() which returns NULL and sets errno to EINTR when a signal is raised while the calling thread was asleep because it was waiting for a disk read from swapspace? XD Tobias --- On 14.02.21 23:04, Godmar Back wrote: > I don't know if this qualifies as "breaking userland" or "this has > never worked." > > You could use systemtap to patch the kernel to see if you can trap the > path where they're returning `EINTR`. > I recently came across a blog post: > https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/ > where some people did that and found a bug in the TCP code. Very > impressive. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 9:15 ` Tobias Bading @ 2021-02-15 9:24 ` Florian Weimer 2021-02-15 9:43 ` Tobias Bading 0 siblings, 1 reply; 20+ messages in thread From: Florian Weimer @ 2021-02-15 9:24 UTC (permalink / raw) To: Tobias Bading via Libc-help; +Cc: Godmar Back, Tobias Bading * Tobias Bading via Libc-help: > English isn't my first language, so I must be reading this wrong or > interpreting it in the wrong context. If a POSIX implementation was > allowed to add EINTR to the list of possible error codes returned by a > function defined in the standard, what kind of standard would that be? > How about an implementation of malloc() which returns NULL and sets > errno to EINTR when a signal is raised while the calling thread was > asleep because it was waiting for a disk read from swapspace? XD Returning EINTR in stat would allow relatively straightforward implementation of a timeout, in case the path resides on a network file system and the server is unreachable. So it's not a completely unreasonable thing to do. On the other hand, the cost in lost backwards compatibility with applications that do not know about this behavior appears to be pretty high, as this thread shows. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 9:24 ` Florian Weimer @ 2021-02-15 9:43 ` Tobias Bading 2021-02-15 9:45 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-15 9:43 UTC (permalink / raw) To: Florian Weimer; +Cc: libc-help, Godmar Back > Returning EINTR in stat would allow relatively straightforward > implementation of a timeout, in case the path resides on a network file > system and the server is unreachable. So it's not a completely > unreasonable thing to do. Good point. > On the other hand, the cost in lost backwards > compatibility with applications that do not know about this behavior > appears to be pretty high, as this thread shows. What's your interpretation of the POSIX standard? Does it permit such a backwards compatibility breaking change? Tobias ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 9:43 ` Tobias Bading @ 2021-02-15 9:45 ` Florian Weimer 2021-02-15 10:02 ` Tobias Bading 0 siblings, 1 reply; 20+ messages in thread From: Florian Weimer @ 2021-02-15 9:45 UTC (permalink / raw) To: Tobias Bading; +Cc: libc-help, Godmar Back * Tobias Bading: >> Returning EINTR in stat would allow relatively straightforward >> implementation of a timeout, in case the path resides on a network file >> system and the server is unreachable. So it's not a completely >> unreasonable thing to do. > > Good point. > >> On the other hand, the cost in lost backwards >> compatibility with applications that do not know about this behavior >> appears to be pretty high, as this thread shows. > > What's your interpretation of the POSIX standard? Does it permit such a > backwards compatibility breaking change? Yes, this is not a POSIX conformance issue. POSIX also does not make any requirements regarding backwards compatibility or bug-for-bug compatibility. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 9:45 ` Florian Weimer @ 2021-02-15 10:02 ` Tobias Bading 2021-02-15 10:20 ` Florian Weimer 0 siblings, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-15 10:02 UTC (permalink / raw) To: Florian Weimer; +Cc: libc-help, Godmar Back > Yes, this is not a POSIX conformance issue. o.O I don't get it. How is a developer supposed to decide in which cases EINTR handling is required? Check the man page on every platform the code is supposed to work on and hope that at least one platform mentions EINTR if the function is indeed able to fail that way? Tobias --- On 15.02.21 10:45, Florian Weimer wrote: > * Tobias Bading: > >>> Returning EINTR in stat would allow relatively straightforward >>> implementation of a timeout, in case the path resides on a network file >>> system and the server is unreachable. So it's not a completely >>> unreasonable thing to do. >> Good point. >> >>> On the other hand, the cost in lost backwards >>> compatibility with applications that do not know about this behavior >>> appears to be pretty high, as this thread shows. >> What's your interpretation of the POSIX standard? Does it permit such a >> backwards compatibility breaking change? > Yes, this is not a POSIX conformance issue. POSIX also does not make > any requirements regarding backwards compatibility or bug-for-bug > compatibility. > > Thanks, > Florian ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 10:02 ` Tobias Bading @ 2021-02-15 10:20 ` Florian Weimer 2021-02-15 10:36 ` Tobias Bading 0 siblings, 1 reply; 20+ messages in thread From: Florian Weimer @ 2021-02-15 10:20 UTC (permalink / raw) To: Tobias Bading; +Cc: libc-help, Godmar Back * Tobias Bading: >> Yes, this is not a POSIX conformance issue. > > o.O > > I don't get it. How is a developer supposed to decide in which cases > EINTR handling is required? Check the man page on every platform the > code is supposed to work on and hope that at least one platform mentions > EINTR if the function is indeed able to fail that way? The issue here is that one Linux file system (CIFS) behaves differently from most other file systems. That is not something that can be addressed with documentation at the level of the manual pages. Practically speaking, I would say this is a file system bug. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 10:20 ` Florian Weimer @ 2021-02-15 10:36 ` Tobias Bading 0 siblings, 0 replies; 20+ messages in thread From: Tobias Bading @ 2021-02-15 10:36 UTC (permalink / raw) To: Florian Weimer; +Cc: libc-help, Godmar Back On 15.02.21 11:20, Florian Weimer wrote: > * Tobias Bading: >>> Yes, this is not a POSIX conformance issue. >> >> o.O >> >> I don't get it. How is a developer supposed to decide in which cases >> EINTR handling is required? Check the man page on every platform the >> code is supposed to work on and hope that at least one platform mentions >> EINTR if the function is indeed able to fail that way? > > The issue here is that one Linux file system (CIFS) behaves differently > from most other file systems. That is not something that can be > addressed with documentation at the level of the manual pages. > Practically speaking, I would say this is a file system bug. Yes, looks like a file system bug in this case. But I'm wondering how this is supposed to work in general: I'm writing some piece of code and are about to call a POSIX function. How the heck am I supposed to figure out whether EINTR handling is required if I can't rely on the man page and the POSIX spec? Tobias ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 17:30 ` Tobias Bading 2021-02-14 22:04 ` Godmar Back @ 2021-02-15 8:33 ` Tobias Bading 2021-02-15 8:45 ` Konstantin Kharlamov 1 sibling, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-15 8:33 UTC (permalink / raw) To: libc-help [-- Attachment #1: Type: text/plain, Size: 1710 bytes --] Update: I've pimped the test program a bit and ran it against a local CIFS share like Konstantin did. Output of some of my test runs: $ ./its-raining-signals /mnt/shared 100 100 signals during 747470 loop iterations 100 signals during 752470 loop iterations 100 signals during 733729 loop iterations 100 signals during 754084 loop iterations 100 signals during 746603 loop iterations stat() failed: Interrupted system call $ ./its-raining-signals /mnt/shared 1000 1000 signals during 746063 loop iterations stat() failed: Interrupted system call $ ./its-raining-signals /mnt/shared 10000 stat() failed: Interrupted system call $ ./its-raining-signals /mnt/shared 1000 r 1000 signals during 724637 loop iterations 1000 signals during 749337 loop iterations 1000 signals during 747301 loop iterations [... keeps on going, no errors ...] $ ./its-raining-signals /mnt/shared 10000 r [no output, hangs] These are the two CIFS-related kernel patches I've found so far: - cifs: handle -EINTR in cifs_setattr https://github.com/torvalds/linux/commit/c6cc4c5a72505a0ecefc9b413f16bec512f38078 (earliest linux-stable tag including the patch: v5.10-rc1) - cifs: do not fail __smb_send_rqst if non-fatal signals are pending (already mentioned by Godmar) https://github.com/torvalds/linux/commit/214a5ea081e77346e4963dd6d20c5539ff8b6ae6 (earliest linux-stable tag including the patch: v5.11-rc5) If I checked the Ubuntu focal git repo correctly, both commits aren't in there, so apparently no backports yet in Ubuntu. I'll check how much effort it is to install a newer mainline kernel (including the kernel modules like cifs.ko) in Ubuntu focal. Tobias [-- Attachment #2: its-raining-signals.c --] [-- Type: text/x-csrc, Size: 1906 bytes --] #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> static long handler_calls, loop_iterations; #define TEST_STAT stat (path, &st) #define TEST_FACCESSAT faccessat (AT_FDCWD, path, R_OK, 0) void handler (int signum) { ++handler_calls; } int main (int argc, char *argv[]) { const char *path; long signals_per_second; int retry = 0; struct stat st; struct sigaction action; struct itimerval timer; if (argc < 3) { printf ("usage: %s PATH-TO-SHARE SIGNALS-PER-SECOND [r]\n" " SIGNALS-PER-SECOND between 2 and 1 million\n" " r repeats a call when errno is EINTR\n", argv[0]); return 1; } path = argv[1]; signals_per_second = strtol (argv[2], NULL, 0); if (argc >= 4 && !strcmp (argv[3], "r")) retry = 1; memset (&action, 0, sizeof action); action.sa_handler = handler; action.sa_flags = SA_RESTART; if (sigaction (SIGALRM, &action, NULL)) { perror ("sigaction() failed"); return 1; } timer.it_value.tv_sec = timer.it_interval.tv_sec = 0; timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / signals_per_second; if (setitimer (ITIMER_REAL, &timer, NULL)) { perror ("setitimer() failed"); return 1; } for (;;) { if (retry ? TEMP_FAILURE_RETRY (TEST_STAT) : TEST_STAT) { perror ("stat() failed"); return 1; } if (retry ? TEMP_FAILURE_RETRY (TEST_FACCESSAT) : TEST_FACCESSAT) { perror ("faccessat() failed"); return 1; } ++loop_iterations; if (handler_calls >= signals_per_second) { printf ("%ld signals during %ld loop iterations\n", handler_calls, loop_iterations); handler_calls = loop_iterations = 0; } } return 0; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 8:33 ` Tobias Bading @ 2021-02-15 8:45 ` Konstantin Kharlamov 2021-02-15 10:25 ` Tobias Bading 0 siblings, 1 reply; 20+ messages in thread From: Konstantin Kharlamov @ 2021-02-15 8:45 UTC (permalink / raw) To: Tobias Bading, libc-help On Mon, 2021-02-15 at 09:33 +0100, Tobias Bading wrote: > I'll check how much effort it is to install a newer mainline kernel > (including the kernel modules like cifs.ko) in Ubuntu focal. > > Tobias > Oh, it should be pretty easy, Canonical provides kernel packages for testing for all kernel versions here https://kernel.ubuntu.com/~kernel-ppa/mainline/ Idk though if `cifs.ko` is inside one of them, but I'd assume yes. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-15 8:45 ` Konstantin Kharlamov @ 2021-02-15 10:25 ` Tobias Bading 0 siblings, 0 replies; 20+ messages in thread From: Tobias Bading @ 2021-02-15 10:25 UTC (permalink / raw) To: Konstantin Kharlamov, libc-help Just installed the two packages linux-image-unsigned-5.4.98-050498-generic_5.4.98-050498.202102131336_amd64.deb linux-modules-5.4.98-050498-generic_5.4.98-050498.202102131336_amd64.deb from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4.98/ and my test program can't reproduce the bug any longer. :) Tobias --- On 15.02.21 09:45, Konstantin Kharlamov wrote: > On Mon, 2021-02-15 at 09:33 +0100, Tobias Bading wrote: >> I'll check how much effort it is to install a newer mainline kernel >> (including the kernel modules like cifs.ko) in Ubuntu focal. >> >> Tobias >> > Oh, it should be pretty easy, Canonical provides kernel packages for testing for all kernel versions here https://kernel.ubuntu.com/~kernel-ppa/mainline/ > > Idk though if `cifs.ko` is inside one of them, but I'd assume yes. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 12:18 ` Tobias Bading 2021-02-14 13:17 ` Tobias Bading @ 2021-02-14 17:56 ` Konstantin Kharlamov 2021-02-14 18:58 ` Tobias Bading 1 sibling, 1 reply; 20+ messages in thread From: Konstantin Kharlamov @ 2021-02-14 17:56 UTC (permalink / raw) To: Tobias Bading, libc-help On Sun, 2021-02-14 at 13:18 +0100, Tobias Bading via Libc-help wrote: > Hello again. > > I've been able to reproduce the problem with the attached program. With > SIGALRMs firing 10 times per second, I get maybe a dozen "handler > called" lines before stat() or faccessat() fails with errno EINTR. When > I increase the rate to 50 SIGALRMs per second, the very first stat() > fails in every test run. > > Specifying SA_RESTART in sigaction() has no effect. > > Unfortunately, I don't have any other network shares available at the > moment to test whether only CIFS through VPN is affected, or the problem > would occur with e.g. NFS or CIFS without a VPN as well. Hello! So, I tried to reproduce this by creating a samba share, then mounting it with at `/tmp/mnt` by using a command: sudo mount -t cifs -o username=guest,rw,exec,auto //127.0.0.1/export_bds mnt then I took your example, and modified the `path` variable to point to `/tmp/mnt`. Afterwards, it's been running for a minute I think, and I only ever seen `handler called` lines. So given other emails mentioned you may have stumbled upon a kernel bug, it apparently was fixed later. My kernel is 5.10.15, on Archlinux. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 17:56 ` Konstantin Kharlamov @ 2021-02-14 18:58 ` Tobias Bading 2021-02-14 20:46 ` Konstantin Kharlamov 0 siblings, 1 reply; 20+ messages in thread From: Tobias Bading @ 2021-02-14 18:58 UTC (permalink / raw) To: Konstantin Kharlamov, libc-help Hello Konstantin, thanks for trying to reproduce the problem. Could you please try void handler (int signum) { } and timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000; to increase the rate of SIGALRMs to 10000 (or even higher) per second? For me this reproduces the error also on shares that looked like they weren't affected when only 50 signals per second were created. There seems to be only a very short time window in which some piece of (probably CIFS-related) kernel code can be thrown of the rails by a signal, and the length of that time window probably depends on the performance of client and server, the network connection and whatnot. Since you're using a local share, your time window to produce the error is probably much smaller than mine. Tobias --- On 14.02.21 18:56, Konstantin Kharlamov wrote: > On Sun, 2021-02-14 at 13:18 +0100, Tobias Bading via Libc-help wrote: >> Hello again. >> >> I've been able to reproduce the problem with the attached program. With >> SIGALRMs firing 10 times per second, I get maybe a dozen "handler >> called" lines before stat() or faccessat() fails with errno EINTR. When >> I increase the rate to 50 SIGALRMs per second, the very first stat() >> fails in every test run. >> >> Specifying SA_RESTART in sigaction() has no effect. >> >> Unfortunately, I don't have any other network shares available at the >> moment to test whether only CIFS through VPN is affected, or the problem >> would occur with e.g. NFS or CIFS without a VPN as well. > Hello! So, I tried to reproduce this by creating a samba share, then mounting it with at `/tmp/mnt` by using a command: > > sudo mount -t cifs -o username=guest,rw,exec,auto //127.0.0.1/export_bds mnt > > then I took your example, and modified the `path` variable to point to `/tmp/mnt`. > > Afterwards, it's been running for a minute I think, and I only ever seen `handler called` lines. > > So given other emails mentioned you may have stumbled upon a kernel bug, it apparently was fixed later. My kernel is 5.10.15, on Archlinux. > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? 2021-02-14 18:58 ` Tobias Bading @ 2021-02-14 20:46 ` Konstantin Kharlamov 0 siblings, 0 replies; 20+ messages in thread From: Konstantin Kharlamov @ 2021-02-14 20:46 UTC (permalink / raw) To: Tobias Bading, libc-help On Sun, 2021-02-14 at 19:58 +0100, Tobias Bading wrote: > Hello Konstantin, > > thanks for trying to reproduce the problem. Could you please try > > void handler (int signum) > { > } > > and > > timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000; > > to increase the rate of SIGALRMs to 10000 (or even higher) per second? > For me this reproduces the error also on shares that looked like they > weren't affected when only 50 signals per second were created. There > seems to be only a very short time window in which some piece of > (probably CIFS-related) kernel code can be thrown of the rails by a > signal, and the length of that time window probably depends on the > performance of client and server, the network connection and whatnot. > Since you're using a local share, your time window to produce the error > is probably much smaller than mine. Thanks, did that, I ran the testcase under `time` utility: λ time ./a ^C ./a 18.98s user 49.15s system 99% cpu 1:08.69 total So, it ran for a minute before I interrupted it with ^C, no fails observed. For the record, here's the code I used (the smb share is mounted under /tmp/mnt, and it's a localhost share). #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <string.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> static const char path[] = "/tmp/mnt/"; void handler (int signum) { } int main () { struct stat st; struct sigaction action; struct itimerval timer; memset (&action, 0, sizeof action); action.sa_handler = handler; action.sa_flags = SA_RESTART; if (sigaction (SIGALRM, &action, NULL)) { perror ("sigaction() failed"); return 1; } timer.it_value.tv_sec = timer.it_interval.tv_sec = 0; timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000; if (setitimer (ITIMER_REAL, &timer, NULL)) { perror ("setitimer() failed"); return 1; } for (;;) { int r = stat (path, &st); if (r) { perror ("stat() failed"); return 1; } r = faccessat (AT_FDCWD, path, R_OK, 0); if (r) { perror ("faccessat() failed"); return 1; } } return 0; } I built it with `gcc dont-interrupt-me.c -o a -g3 -O0` ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2021-02-15 10:36 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-14 11:13 (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? Tobias Bading 2021-02-14 12:18 ` Tobias Bading 2021-02-14 13:17 ` Tobias Bading 2021-02-14 15:14 ` Godmar Back 2021-02-14 15:20 ` Godmar Back 2021-02-14 17:30 ` Tobias Bading 2021-02-14 22:04 ` Godmar Back 2021-02-15 9:15 ` Tobias Bading 2021-02-15 9:24 ` Florian Weimer 2021-02-15 9:43 ` Tobias Bading 2021-02-15 9:45 ` Florian Weimer 2021-02-15 10:02 ` Tobias Bading 2021-02-15 10:20 ` Florian Weimer 2021-02-15 10:36 ` Tobias Bading 2021-02-15 8:33 ` Tobias Bading 2021-02-15 8:45 ` Konstantin Kharlamov 2021-02-15 10:25 ` Tobias Bading 2021-02-14 17:56 ` Konstantin Kharlamov 2021-02-14 18:58 ` Tobias Bading 2021-02-14 20:46 ` Konstantin Kharlamov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).