(stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??

public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed

* (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
@ 2021-02-14 11:13 Tobias Bading
  2021-02-14 12:18 ` Tobias Bading
  0 siblings, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-14 11:13 UTC (permalink / raw)
  To: libc-help

Hi.

A few days ago I encountered strange problems at home while using GNU
Emacs on GNU/Linux with Windows SMB shares in the office automounted
through the company's VPN.

Details and my current findings on the Emacs devel mailing list:
https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html.

TL;DR:
Emacs' stat() and faccessat() calls sometimes return -1 and set errno to
EINTR (according to strace). Most of the time the interruption seems to
come from a SIGIO, but at least once a SIGCHLD was in the mix as well.

The behavior of returning with errno == EINTR seems to be new and/or
some kind of bug, because otherwise Emacs would handle that case with
TEMP_FAILURE_RETRY() or something to that effect, right? I've used
pretty much the same Emacs in the office (on an older GNU/Linux release)
for ages with the same Windows SMB shares and never had this problem.

The GNU/Linux man pages of stat() and faccessat() don't mention EINTR at
all, and neither does POSIX
(https://pubs.opengroup.org/onlinepubs/9699919799/).

So I guess the one million dollar question is this:
Are stat() and faccessat() permitted to return -1 with errno == EINTR,
or does that violate the POSIX (or some other) spec?

I'll probably write a small test program next to check if I can
reproduce this problem outside of Emacs, maybe with a simple SIGALRM and
an endless loop of stat()/faccessat() calls. If that reproduces the
problem it would be nice to know whether using SA_RESTART in sigaction()
has any effect.

Any ideas are very welcome... :)

Tobias

PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently uses
libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic and
openconnect 8.05-1 to access the VPN.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 11:13 (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? Tobias Bading
@ 2021-02-14 12:18 ` Tobias Bading
  2021-02-14 13:17   ` Tobias Bading
  2021-02-14 17:56   ` Konstantin Kharlamov
  0 siblings, 2 replies; 20+ messages in thread
From: Tobias Bading @ 2021-02-14 12:18 UTC (permalink / raw)
  To: libc-help

[-- Attachment #1: Type: text/plain, Size: 2496 bytes --]

Hello again.

I've been able to reproduce the problem with the attached program. With
SIGALRMs firing 10 times per second, I get maybe a dozen "handler
called" lines before stat() or faccessat() fails with errno EINTR. When
I increase the rate to 50 SIGALRMs per second, the very first stat()
fails in every test run.

Specifying SA_RESTART in sigaction() has no effect.

Unfortunately, I don't have any other network shares available at the
moment to test whether only CIFS through VPN is affected, or the problem
would occur with e.g. NFS or CIFS without a VPN as well.

Tobias

---

On 14.02.21 12:13, Tobias Bading wrote:
> Hi.
>
> A few days ago I encountered strange problems at home while using GNU
> Emacs on GNU/Linux with Windows SMB shares in the office automounted
> through the company's VPN.
>
> Details and my current findings on the Emacs devel mailing list:
> https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html.
>
> TL;DR:
> Emacs' stat() and faccessat() calls sometimes return -1 and set errno
> to EINTR (according to strace). Most of the time the interruption
> seems to come from a SIGIO, but at least once a SIGCHLD was in the mix
> as well.
>
> The behavior of returning with errno == EINTR seems to be new and/or
> some kind of bug, because otherwise Emacs would handle that case with
> TEMP_FAILURE_RETRY() or something to that effect, right? I've used
> pretty much the same Emacs in the office (on an older GNU/Linux
> release) for ages with the same Windows SMB shares and never had this
> problem.
>
> The GNU/Linux man pages of stat() and faccessat() don't mention EINTR
> at all, and neither does POSIX
> (https://pubs.opengroup.org/onlinepubs/9699919799/).
>
> So I guess the one million dollar question is this:
> Are stat() and faccessat() permitted to return -1 with errno == EINTR,
> or does that violate the POSIX (or some other) spec?
>
> I'll probably write a small test program next to check if I can
> reproduce this problem outside of Emacs, maybe with a simple SIGALRM
> and an endless loop of stat()/faccessat() calls. If that reproduces
> the problem it would be nice to know whether using SA_RESTART in
> sigaction() has any effect.
>
> Any ideas are very welcome... :)
>
> Tobias
>
> PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently
> uses libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic and
> openconnect 8.05-1 to access the VPN.
>


[-- Attachment #2: dont-interrupt-me.c --]
[-- Type: text/x-csrc, Size: 1104 bytes --]

#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

static const char path[] = "/smb/server/share/dir/subdir";

void handler (int signum)
{
  static const char t[] = "handler called\n";
  write (1, t, sizeof t);
}

int main ()
{
  struct stat st;
  struct sigaction action;
  struct itimerval timer;

  memset (&action, 0, sizeof action);
  action.sa_handler = handler;
  action.sa_flags = SA_RESTART;
  if (sigaction (SIGALRM, &action, NULL))
  {
    perror ("sigaction() failed");
    return 1;
  }

  timer.it_value.tv_sec  = timer.it_interval.tv_sec  = 0;
  timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10;
  if (setitimer (ITIMER_REAL, &timer, NULL))
  {
    perror ("setitimer() failed");
    return 1;
  }

  for (;;)
  {
    int r = stat (path, &st);
    if (r)
    {
      perror ("stat() failed");
      return 1;
    }

    r = faccessat (AT_FDCWD, path, R_OK, 0);
    if (r)
    {
      perror ("faccessat() failed");
      return 1;
    }
  }

  return 0;
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 12:18 ` Tobias Bading
@ 2021-02-14 13:17   ` Tobias Bading
  2021-02-14 15:14     ` Godmar Back
  2021-02-14 17:56   ` Konstantin Kharlamov
  1 sibling, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-14 13:17 UTC (permalink / raw)
  To: libc-help

o/

I ran the test program against different servers and shares in the
office. The share on which I encountered the problem first seems to be
affected the worst: 50 SIGALRMs per second and the first stat() has no
chance. Other shares, some on Windows servers and one on an older
GNU/Linux with Samba, seemed not to have this problem at first. But when
I removed the teminal output from the SIGALRM handler and increased the
SIGALRM rate to 10000 per second, all those other shares produced the
error as well sooner or later.

Hmm...

Tobias

On 14.02.21 13:18, Tobias Bading wrote:
> Hello again.
>
> I've been able to reproduce the problem with the attached program.
> With SIGALRMs firing 10 times per second, I get maybe a dozen "handler
> called" lines before stat() or faccessat() fails with errno EINTR.
> When I increase the rate to 50 SIGALRMs per second, the very first
> stat() fails in every test run.
>
> Specifying SA_RESTART in sigaction() has no effect.
>
> Unfortunately, I don't have any other network shares available at the
> moment to test whether only CIFS through VPN is affected, or the
> problem would occur with e.g. NFS or CIFS without a VPN as well.
>
> Tobias
>
> ---
>
> On 14.02.21 12:13, Tobias Bading wrote:
>> Hi.
>>
>> A few days ago I encountered strange problems at home while using GNU
>> Emacs on GNU/Linux with Windows SMB shares in the office automounted
>> through the company's VPN.
>>
>> Details and my current findings on the Emacs devel mailing list:
>> https://lists.gnu.org/archive/html/emacs-devel/2021-02/msg00835.html.
>>
>> TL;DR:
>> Emacs' stat() and faccessat() calls sometimes return -1 and set errno
>> to EINTR (according to strace). Most of the time the interruption
>> seems to come from a SIGIO, but at least once a SIGCHLD was in the
>> mix as well.
>>
>> The behavior of returning with errno == EINTR seems to be new and/or
>> some kind of bug, because otherwise Emacs would handle that case with
>> TEMP_FAILURE_RETRY() or something to that effect, right? I've used
>> pretty much the same Emacs in the office (on an older GNU/Linux
>> release) for ages with the same Windows SMB shares and never had this
>> problem.
>>
>> The GNU/Linux man pages of stat() and faccessat() don't mention EINTR
>> at all, and neither does POSIX
>> (https://pubs.opengroup.org/onlinepubs/9699919799/).
>>
>> So I guess the one million dollar question is this:
>> Are stat() and faccessat() permitted to return -1 with errno ==
>> EINTR, or does that violate the POSIX (or some other) spec?
>>
>> I'll probably write a small test program next to check if I can
>> reproduce this problem outside of Emacs, maybe with a simple SIGALRM
>> and an endless loop of stat()/faccessat() calls. If that reproduces
>> the problem it would be nice to know whether using SA_RESTART in
>> sigaction() has any effect.
>>
>> Any ideas are very welcome... :)
>>
>> Tobias
>>
>> PS: I'm running Ubuntu focal (20.04.2 LTS, amd64), which currently
>> uses libc-bin 2.31-0ubuntu9.2, kernel linux-image-5.4.0-64-generic
>> and openconnect 8.05-1 to access the VPN.
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 13:17   ` Tobias Bading
@ 2021-02-14 15:14     ` Godmar Back
  2021-02-14 15:20       ` Godmar Back
  2021-02-14 17:30       ` Tobias Bading
  0 siblings, 2 replies; 20+ messages in thread
From: Godmar Back @ 2021-02-14 15:14 UTC (permalink / raw)
  To: Tobias Bading; +Cc: William Tambe via Libc-help

Tobias,

my assessment would be that this is the wrong email list - what you
describe, if correct, would be a kernel bug.

Specifically, inside the kernel, most system calls are restartable
(and should be restarted if `SA_RESTART` is given), but occasionally
you find a place where the kernel implementor decided that they can't
restart the system call, in which case `EINTR` is propagated to user
land.  I've encountered this personally in the past with
`tcsetattr()`.

Now whether it is indeed impossible for the implementor to safely
restart the call or not depends on the specific implementation details
(and path taken through the kernel).
Your choices are, in my opinion:

- to handle `EINTR` in user mode
- to file a bug (and propose a fix) against the kernel. You'd need to
look in https://github.com/torvalds/linux/tree/master/fs/cifs for
places where they set something to `-EINTR` and then fail to turn it
into `-ERESTARTSYS` and do not have good reason to do so. I suspect
this would require in-depth knowledge of the cifs file system code.

Bringing up POSIX may not be the most fruitful course of action, and
asking for compensation in the C library (as in doing an automatic
restart in the C library system call wrapper) is generally not a good
idea to my knowledge.

My 2c.

 - Godmar

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 15:14     ` Godmar Back
@ 2021-02-14 15:20       ` Godmar Back
  2021-02-14 17:30       ` Tobias Bading
  1 sibling, 0 replies; 20+ messages in thread
From: Godmar Back @ 2021-02-14 15:20 UTC (permalink / raw)
  To: Tobias Bading; +Cc: William Tambe via Libc-help

PS: This discussion may be helpful
https://www.spinics.net/lists/stable/msg440182.html - maybe it's your
bug?

Also, a link to when I brought up a similar issue in 2011 about tcsetattr():
http://lkml.iu.edu/hypermail/linux/kernel/1110.1/02954.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 15:14     ` Godmar Back
  2021-02-14 15:20       ` Godmar Back
@ 2021-02-14 17:30       ` Tobias Bading
  2021-02-14 22:04         ` Godmar Back
  2021-02-15  8:33         ` Tobias Bading
  1 sibling, 2 replies; 20+ messages in thread
From: Tobias Bading @ 2021-02-14 17:30 UTC (permalink / raw)
  To: Godmar Back; +Cc: libc-help

Hi Godmar,

thanks a lot for your feedback.

 > Bringing up POSIX may not be the most fruitful course of action, and
 > asking for compensation in the C library (as in doing an automatic
 > restart in the C library system call wrapper) is generally not a good
 > idea to my knowledge.

Sorry if my mail sounded like I was trying to point fingers or anything.
I simply chose the GNU libc mailing list because I couldn't find the
definitive answer to the seemingly innocent question "Can stat() or
faccessat() return -1 with errno == EINTR?", neither in their man pages,
some standard spec like POSIX, nor the depth of the internet.

I've been writing software for different UNIXes for a few decades now
and so far I primarily followed the(/my own?) rule "if a man page
doesn't mention EINTR, you don't need to worry about it". If that rule
is incorrect, I have quite a few lines of code to review and wouldn't
even know where to start. XD

EINTR is definitely a very special errno that needs to be handled in
userland, *if* a function is known to return with errno EINTR in (more
or less) specific cases. But there's a ton of ancient userland code that
calls stat() without handling EINTR, probably because the function never
returned EINTR before and no documentation ever claimed that it could.
If that function is able to return with EINTR now, it has the potential
to break a lot of existing code.

 > Your choices are, in my opinion:
 > - to handle `EINTR` in user mode

Yes, wrapping TEMP_FAILURE_RETRY() macros around the stat() and
faccessat() calls in Emacs' source code did work as a band-aid.

 > - to file a bug (and propose a fix) against the kernel.

If the GNU libc devs agree that neither stat() nor faccessat() should
ever return with errno EINTR and this error code is coming directly from
a syscall, then that's the way to go I guess.

 > Specifically, inside the kernel, most system calls are restartable
 > (and should be restarted if `SA_RESTART` is given), but occasionally
 > you find a place where the kernel implementor decided that they can't
 > restart the system call, in which case `EINTR` is propagated to user
 > land.  I've encountered this personally in the past with
 > `tcsetattr()`.

 > Also, a link to when I brought up a similar issue in 2011 about
tcsetattr():
 > http://lkml.iu.edu/hypermail/linux/kernel/1110.1/02954.html

A kernel implementor decides to let EINTR propagate into userland? o.O
In case of tcsetattr() or stat(), wouldn't that be in clear violation of
Linus' "WE DO NOT BREAK USERSPACE!" first rule of kernel maintenance?
(https://lkml.org/lkml/2012/12/23/75)
What happened in the tcsetattr() case? The man page of tcsetattr()
doesn't mention EINTR, did they revert the kernel change?

 > You'd need to
 > look in https://github.com/torvalds/linux/tree/master/fs/cifs for
 > places where they set something to `-EINTR` and then fail to turn it
 > into `-ERESTARTSYS` and do not have good reason to do so.

Thanks for the pointers.
Do you know whether -ERESTARTSYS is used to request an unconditional
restart of the system call, or is SA_RESTART specified (or not) in
userland also a factor? If we agree that stat() should never return with
EINTR, than SA_RESTART shouldn't be a factor. (BTW, Emacs is
intentionally not using SA_RESTART in emacs_sigaction_flags() in
src/sysdep.c.)

 > PS: This discussion may be helpful
 > https://www.spinics.net/lists/stable/msg440182.html - maybe it's your
 > bug?

Thanks!!!
I'm running a 5.4 kernel, but since that mail is barely a month old I
have to check whether my current kernel contains that patch.

Tobias

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 12:18 ` Tobias Bading
  2021-02-14 13:17   ` Tobias Bading
@ 2021-02-14 17:56   ` Konstantin Kharlamov
  2021-02-14 18:58     ` Tobias Bading
  1 sibling, 1 reply; 20+ messages in thread
From: Konstantin Kharlamov @ 2021-02-14 17:56 UTC (permalink / raw)
  To: Tobias Bading, libc-help

On Sun, 2021-02-14 at 13:18 +0100, Tobias Bading via Libc-help wrote:
> Hello again.
>
> I've been able to reproduce the problem with the attached program. With
> SIGALRMs firing 10 times per second, I get maybe a dozen "handler
> called" lines before stat() or faccessat() fails with errno EINTR. When
> I increase the rate to 50 SIGALRMs per second, the very first stat()
> fails in every test run.
>
> Specifying SA_RESTART in sigaction() has no effect.
>
> Unfortunately, I don't have any other network shares available at the
> moment to test whether only CIFS through VPN is affected, or the problem
> would occur with e.g. NFS or CIFS without a VPN as well.

Hello! So, I tried to reproduce this by creating a samba share, then mounting it with at `/tmp/mnt` by using a command:

    sudo mount -t cifs -o username=guest,rw,exec,auto //127.0.0.1/export_bds mnt

then I took your example, and modified the `path` variable to point to `/tmp/mnt`.

Afterwards, it's been running for a minute I think, and I only ever seen `handler called` lines.

So given other emails mentioned you may have stumbled upon a kernel bug, it apparently was fixed later. My kernel is 5.10.15, on Archlinux.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 17:56   ` Konstantin Kharlamov
@ 2021-02-14 18:58     ` Tobias Bading
  2021-02-14 20:46       ` Konstantin Kharlamov
  0 siblings, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-14 18:58 UTC (permalink / raw)
  To: Konstantin Kharlamov, libc-help

Hello Konstantin,

thanks for trying to reproduce the problem. Could you please try

void handler (int signum)
{
}

and

timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000;

to increase the rate of SIGALRMs to 10000 (or even higher) per second?
For me this reproduces the error also on shares that looked like they
weren't affected when only 50 signals per second were created. There
seems to be only a very short time window in which some piece of
(probably CIFS-related) kernel code can be thrown of the rails by a
signal, and the length of that time window probably depends on the
performance of client and server, the network connection and whatnot.
Since you're using a local share, your time window to produce the error
is probably much smaller than mine.

Tobias

---

On 14.02.21 18:56, Konstantin Kharlamov wrote:
> On Sun, 2021-02-14 at 13:18 +0100, Tobias Bading via Libc-help wrote:
>> Hello again.
>>
>> I've been able to reproduce the problem with the attached program. With
>> SIGALRMs firing 10 times per second, I get maybe a dozen "handler
>> called" lines before stat() or faccessat() fails with errno EINTR. When
>> I increase the rate to 50 SIGALRMs per second, the very first stat()
>> fails in every test run.
>>
>> Specifying SA_RESTART in sigaction() has no effect.
>>
>> Unfortunately, I don't have any other network shares available at the
>> moment to test whether only CIFS through VPN is affected, or the problem
>> would occur with e.g. NFS or CIFS without a VPN as well.
> Hello! So, I tried to reproduce this by creating a samba share, then mounting it with at `/tmp/mnt` by using a command:
>
>      sudo mount -t cifs -o username=guest,rw,exec,auto //127.0.0.1/export_bds mnt
>
> then I took your example, and modified the `path` variable to point to `/tmp/mnt`.
>
> Afterwards, it's been running for a minute I think, and I only ever seen `handler called` lines.
>
> So given other emails mentioned you may have stumbled upon a kernel bug, it apparently was fixed later. My kernel is 5.10.15, on Archlinux.
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 18:58     ` Tobias Bading
@ 2021-02-14 20:46       ` Konstantin Kharlamov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Kharlamov @ 2021-02-14 20:46 UTC (permalink / raw)
  To: Tobias Bading, libc-help

On Sun, 2021-02-14 at 19:58 +0100, Tobias Bading wrote:
> Hello Konstantin,
>
> thanks for trying to reproduce the problem. Could you please try
>
> void handler (int signum)
> {
> }
>
> and
>
> timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000;
>
> to increase the rate of SIGALRMs to 10000 (or even higher) per second?
> For me this reproduces the error also on shares that looked like they
> weren't affected when only 50 signals per second were created. There
> seems to be only a very short time window in which some piece of
> (probably CIFS-related) kernel code can be thrown of the rails by a
> signal, and the length of that time window probably depends on the
> performance of client and server, the network connection and whatnot.
> Since you're using a local share, your time window to produce the error
> is probably much smaller than mine.

Thanks, did that, I ran the testcase under `time` utility:

     λ time ./a
    ^C
    ./a  18.98s user 49.15s system 99% cpu 1:08.69 total

So, it ran for a minute before I interrupted it with ^C, no fails observed.

For the record, here's the code I used (the smb share is mounted under /tmp/mnt, and
it's a localhost share).

    #include <fcntl.h>
    #include <signal.h>
    #include <stdio.h>
    #include <string.h>
    #include <sys/stat.h>
    #include <sys/time.h>
    #include <sys/types.h>
    #include <unistd.h>

    static const char path[] = "/tmp/mnt/";

    void handler (int signum)
    {
    }

    int main ()
    {
      struct stat st;
      struct sigaction action;
      struct itimerval timer;

      memset (&action, 0, sizeof action);
      action.sa_handler = handler;
      action.sa_flags = SA_RESTART;
      if (sigaction (SIGALRM, &action, NULL))
      {
        perror ("sigaction() failed");
        return 1;
      }

      timer.it_value.tv_sec  = timer.it_interval.tv_sec  = 0;
      timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / 10000;
      if (setitimer (ITIMER_REAL, &timer, NULL))
      {
        perror ("setitimer() failed");
        return 1;
      }

      for (;;)
      {
        int r = stat (path, &st);
        if (r)
        {
          perror ("stat() failed");
          return 1;
        }

        r = faccessat (AT_FDCWD, path, R_OK, 0);
        if (r)
        {
          perror ("faccessat() failed");
          return 1;
        }
      }

      return 0;
    }

I built it with `gcc dont-interrupt-me.c -o a -g3 -O0`


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 17:30       ` Tobias Bading
@ 2021-02-14 22:04         ` Godmar Back
  2021-02-15  9:15           ` Tobias Bading
  2021-02-15  8:33         ` Tobias Bading
  1 sibling, 1 reply; 20+ messages in thread
From: Godmar Back @ 2021-02-14 22:04 UTC (permalink / raw)
  To: Tobias Bading; +Cc: William Tambe via Libc-help

I don't know if this qualifies as "breaking userland" or "this has
never worked."

You could use systemtap to patch the kernel to see if you can trap the
path where they're returning `EINTR`.
I recently came across a blog post:
https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/
where some people did that and found a bug in the TCP code. Very
impressive.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 17:30       ` Tobias Bading
  2021-02-14 22:04         ` Godmar Back
@ 2021-02-15  8:33         ` Tobias Bading
  2021-02-15  8:45           ` Konstantin Kharlamov
  1 sibling, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-15  8:33 UTC (permalink / raw)
  To: libc-help

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

Update:
I've pimped the test program a bit and ran it against a local CIFS share
like Konstantin did.

Output of some of my test runs:

$ ./its-raining-signals /mnt/shared 100
100 signals during 747470 loop iterations
100 signals during 752470 loop iterations
100 signals during 733729 loop iterations
100 signals during 754084 loop iterations
100 signals during 746603 loop iterations
stat() failed: Interrupted system call

$ ./its-raining-signals /mnt/shared 1000
1000 signals during 746063 loop iterations
stat() failed: Interrupted system call

$ ./its-raining-signals /mnt/shared 10000
stat() failed: Interrupted system call

$ ./its-raining-signals /mnt/shared 1000 r
1000 signals during 724637 loop iterations
1000 signals during 749337 loop iterations
1000 signals during 747301 loop iterations
[... keeps on going, no errors ...]

$ ./its-raining-signals /mnt/shared 10000 r
[no output, hangs]

These are the two CIFS-related kernel patches I've found so far:
- cifs: handle -EINTR in cifs_setattr
https://github.com/torvalds/linux/commit/c6cc4c5a72505a0ecefc9b413f16bec512f38078
   (earliest linux-stable tag including the patch: v5.10-rc1)
- cifs: do not fail __smb_send_rqst if non-fatal signals are pending
   (already mentioned by Godmar)
https://github.com/torvalds/linux/commit/214a5ea081e77346e4963dd6d20c5539ff8b6ae6
   (earliest linux-stable tag including the patch: v5.11-rc5)

If I checked the Ubuntu focal git repo correctly, both commits aren't in
there, so apparently no backports yet in Ubuntu.

I'll check how much effort it is to install a newer mainline kernel
(including the kernel modules like cifs.ko) in Ubuntu focal.

Tobias


[-- Attachment #2: its-raining-signals.c --]
[-- Type: text/x-csrc, Size: 1906 bytes --]

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

static long handler_calls, loop_iterations;

#define TEST_STAT stat (path, &st)
#define TEST_FACCESSAT faccessat (AT_FDCWD, path, R_OK, 0)

void handler (int signum)
{
  ++handler_calls;
}

int main (int argc, char *argv[])
{
  const char *path;
  long signals_per_second;
  int retry = 0;
  struct stat st;
  struct sigaction action;
  struct itimerval timer;

  if (argc < 3)
  {
    printf ("usage: %s PATH-TO-SHARE SIGNALS-PER-SECOND [r]\n"
            "       SIGNALS-PER-SECOND between 2 and 1 million\n"
            "       r repeats a call when errno is EINTR\n", argv[0]);
    return 1;
  }

  path = argv[1];
  signals_per_second = strtol (argv[2], NULL, 0);
  if (argc >= 4 && !strcmp (argv[3], "r"))
    retry = 1;

  memset (&action, 0, sizeof action);
  action.sa_handler = handler;
  action.sa_flags = SA_RESTART;
  if (sigaction (SIGALRM, &action, NULL))
  {
    perror ("sigaction() failed");
    return 1;
  }

  timer.it_value.tv_sec  = timer.it_interval.tv_sec  = 0;
  timer.it_value.tv_usec = timer.it_interval.tv_usec = 1000000 / signals_per_second;
  if (setitimer (ITIMER_REAL, &timer, NULL))
  {
    perror ("setitimer() failed");
    return 1;
  }

  for (;;)
  {
    if (retry ? TEMP_FAILURE_RETRY (TEST_STAT) : TEST_STAT)
    {
      perror ("stat() failed");
      return 1;
    }

    if (retry ? TEMP_FAILURE_RETRY (TEST_FACCESSAT) : TEST_FACCESSAT)
    {
      perror ("faccessat() failed");
      return 1;
    }

    ++loop_iterations;

    if (handler_calls >= signals_per_second)
    {
      printf ("%ld signals during %ld loop iterations\n", handler_calls, loop_iterations);
      handler_calls = loop_iterations = 0;
    }
  }

  return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  8:33         ` Tobias Bading
@ 2021-02-15  8:45           ` Konstantin Kharlamov
  2021-02-15 10:25             ` Tobias Bading
  0 siblings, 1 reply; 20+ messages in thread
From: Konstantin Kharlamov @ 2021-02-15  8:45 UTC (permalink / raw)
  To: Tobias Bading, libc-help

On Mon, 2021-02-15 at 09:33 +0100, Tobias Bading wrote:
> I'll check how much effort it is to install a newer mainline kernel
> (including the kernel modules like cifs.ko) in Ubuntu focal.
> 
> Tobias
> 

Oh, it should be pretty easy, Canonical provides kernel packages for testing for all kernel versions here https://kernel.ubuntu.com/~kernel-ppa/mainline/ 

Idk though if `cifs.ko` is inside one of them, but I'd assume yes.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-14 22:04         ` Godmar Back
@ 2021-02-15  9:15           ` Tobias Bading
  2021-02-15  9:24             ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-15  9:15 UTC (permalink / raw)
  To: Godmar Back; +Cc: William Tambe via Libc-help

 > You could use systemtap to patch the kernel to see if you can trap the
 > path where they're returning `EINTR`.
 > I recently came across a blog post:
 >
https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/
 > where some people did that and found a bug in the TCP code. Very
 > impressive.

Thanks, very interesting read indeed. I haven't heard of systemtap
before, sounds like a very cool tool for kernel hackers. Since I'm only
doing userland stuff, I usually trust Valgrind to help me out when
things get fishy.

 > I don't know if this qualifies as "breaking userland" [...]

I'd say breaking stat() qualifies big time, but to be honest I don't
really have the time/nerve to read
https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
in full. The "Signal Effects on Other Functions" paragraph seems
interesting, but my head explodes when I read sentences like

   "Therefore, implementations should generally make such functions
(including ones defined as extensions) interruptible."

   "Functions not mentioned explicitly as interruptible may be so on
some implementations, possibly as an extension where the function gives
an [EINTR] error."

English isn't my first language, so I must be reading this wrong or
interpreting it in the wrong context. If a POSIX implementation was
allowed to add EINTR to the list of possible error codes returned by a
function defined in the standard, what kind of standard would that be?
How about an implementation of malloc() which returns NULL and sets
errno to EINTR when a signal is raised while the calling thread was
asleep because it was waiting for a disk read from swapspace? XD

Tobias

---

On 14.02.21 23:04, Godmar Back wrote:
> I don't know if this qualifies as "breaking userland" or "this has
> never worked."
>
> You could use systemtap to patch the kernel to see if you can trap the
> path where they're returning `EINTR`.
> I recently came across a blog post:
> https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/
> where some people did that and found a bug in the TCP code. Very
> impressive.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  9:15           ` Tobias Bading
@ 2021-02-15  9:24             ` Florian Weimer
  2021-02-15  9:43               ` Tobias Bading
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2021-02-15  9:24 UTC (permalink / raw)
  To: Tobias Bading via Libc-help; +Cc: Godmar Back, Tobias Bading

* Tobias Bading via Libc-help:

> English isn't my first language, so I must be reading this wrong or
> interpreting it in the wrong context. If a POSIX implementation was
> allowed to add EINTR to the list of possible error codes returned by a
> function defined in the standard, what kind of standard would that be?
> How about an implementation of malloc() which returns NULL and sets
> errno to EINTR when a signal is raised while the calling thread was
> asleep because it was waiting for a disk read from swapspace? XD

Returning EINTR in stat would allow relatively straightforward
implementation of a timeout, in case the path resides on a network file
system and the server is unreachable.  So it's not a completely
unreasonable thing to do.  On the other hand, the cost in lost backwards
compatibility with applications that do not know about this behavior
appears to be pretty high, as this thread shows.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  9:24             ` Florian Weimer
@ 2021-02-15  9:43               ` Tobias Bading
  2021-02-15  9:45                 ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-15  9:43 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-help, Godmar Back

 > Returning EINTR in stat would allow relatively straightforward
 > implementation of a timeout, in case the path resides on a network file
 > system and the server is unreachable.  So it's not a completely
 > unreasonable thing to do.

Good point.

 > On the other hand, the cost in lost backwards
 > compatibility with applications that do not know about this behavior
 > appears to be pretty high, as this thread shows.

What's your interpretation of the POSIX standard? Does it permit such a
backwards compatibility breaking change?

Tobias


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  9:43               ` Tobias Bading
@ 2021-02-15  9:45                 ` Florian Weimer
  2021-02-15 10:02                   ` Tobias Bading
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2021-02-15  9:45 UTC (permalink / raw)
  To: Tobias Bading; +Cc: libc-help, Godmar Back

* Tobias Bading:

>> Returning EINTR in stat would allow relatively straightforward
>> implementation of a timeout, in case the path resides on a network file
>> system and the server is unreachable.  So it's not a completely
>> unreasonable thing to do.
>
> Good point.
>
>> On the other hand, the cost in lost backwards
>> compatibility with applications that do not know about this behavior
>> appears to be pretty high, as this thread shows.
>
> What's your interpretation of the POSIX standard? Does it permit such a
> backwards compatibility breaking change?

Yes, this is not a POSIX conformance issue.  POSIX also does not make
any requirements regarding backwards compatibility or bug-for-bug
compatibility.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  9:45                 ` Florian Weimer
@ 2021-02-15 10:02                   ` Tobias Bading
  2021-02-15 10:20                     ` Florian Weimer
  0 siblings, 1 reply; 20+ messages in thread
From: Tobias Bading @ 2021-02-15 10:02 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-help, Godmar Back

 > Yes, this is not a POSIX conformance issue.

o.O

I don't get it. How is a developer supposed to decide in which cases
EINTR handling is required? Check the man page on every platform the
code is supposed to work on and hope that at least one platform mentions
EINTR if the function is indeed able to fail that way?

Tobias

---

On 15.02.21 10:45, Florian Weimer wrote:
> * Tobias Bading:
>
>>> Returning EINTR in stat would allow relatively straightforward
>>> implementation of a timeout, in case the path resides on a network file
>>> system and the server is unreachable.  So it's not a completely
>>> unreasonable thing to do.
>> Good point.
>>
>>> On the other hand, the cost in lost backwards
>>> compatibility with applications that do not know about this behavior
>>> appears to be pretty high, as this thread shows.
>> What's your interpretation of the POSIX standard? Does it permit such a
>> backwards compatibility breaking change?
> Yes, this is not a POSIX conformance issue.  POSIX also does not make
> any requirements regarding backwards compatibility or bug-for-bug
> compatibility.
>
> Thanks,
> Florian


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15 10:02                   ` Tobias Bading
@ 2021-02-15 10:20                     ` Florian Weimer
  2021-02-15 10:36                       ` Tobias Bading
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2021-02-15 10:20 UTC (permalink / raw)
  To: Tobias Bading; +Cc: libc-help, Godmar Back

* Tobias Bading:

>> Yes, this is not a POSIX conformance issue.
>
> o.O
>
> I don't get it. How is a developer supposed to decide in which cases
> EINTR handling is required? Check the man page on every platform the
> code is supposed to work on and hope that at least one platform mentions
> EINTR if the function is indeed able to fail that way?

The issue here is that one Linux file system (CIFS) behaves differently
from most other file systems.  That is not something that can be
addressed with documentation at the level of the manual pages.
Practically speaking, I would say this is a file system bug.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15  8:45           ` Konstantin Kharlamov
@ 2021-02-15 10:25             ` Tobias Bading
  0 siblings, 0 replies; 20+ messages in thread
From: Tobias Bading @ 2021-02-15 10:25 UTC (permalink / raw)
  To: Konstantin Kharlamov, libc-help

Just installed the two packages

linux-image-unsigned-5.4.98-050498-generic_5.4.98-050498.202102131336_amd64.deb
linux-modules-5.4.98-050498-generic_5.4.98-050498.202102131336_amd64.deb

from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4.98/ and my test
program can't reproduce the bug any longer. :)

Tobias

---

On 15.02.21 09:45, Konstantin Kharlamov wrote:
> On Mon, 2021-02-15 at 09:33 +0100, Tobias Bading wrote:
>> I'll check how much effort it is to install a newer mainline kernel
>> (including the kernel modules like cifs.ko) in Ubuntu focal.
>>
>> Tobias
>>
> Oh, it should be pretty easy, Canonical provides kernel packages for testing for all kernel versions here https://kernel.ubuntu.com/~kernel-ppa/mainline/
>
> Idk though if `cifs.ko` is inside one of them, but I'd assume yes.
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!??
  2021-02-15 10:20                     ` Florian Weimer
@ 2021-02-15 10:36                       ` Tobias Bading
  0 siblings, 0 replies; 20+ messages in thread
From: Tobias Bading @ 2021-02-15 10:36 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-help, Godmar Back

On 15.02.21 11:20, Florian Weimer wrote:
 > * Tobias Bading:
 >>> Yes, this is not a POSIX conformance issue.
 >>
 >> o.O
 >>
 >> I don't get it. How is a developer supposed to decide in which cases
 >> EINTR handling is required? Check the man page on every platform the
 >> code is supposed to work on and hope that at least one platform mentions
 >> EINTR if the function is indeed able to fail that way?
 >
 > The issue here is that one Linux file system (CIFS) behaves differently
 > from most other file systems.  That is not something that can be
 > addressed with documentation at the level of the manual pages.
 > Practically speaking, I would say this is a file system bug.

Yes, looks like a file system bug in this case.

But I'm wondering how this is supposed to work in general:
I'm writing some piece of code and are about to call a POSIX function.
How the heck am I supposed to figure out whether EINTR handling is
required if I can't rely on the man page and the POSIX spec?

Tobias


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-02-15 10:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-14 11:13 (stat(...) == -1 || faccessat(...) == -1) && errno == EINTR ?!?? Tobias Bading
2021-02-14 12:18 ` Tobias Bading
2021-02-14 13:17   ` Tobias Bading
2021-02-14 15:14     ` Godmar Back
2021-02-14 15:20       ` Godmar Back
2021-02-14 17:30       ` Tobias Bading
2021-02-14 22:04         ` Godmar Back
2021-02-15  9:15           ` Tobias Bading
2021-02-15  9:24             ` Florian Weimer
2021-02-15  9:43               ` Tobias Bading
2021-02-15  9:45                 ` Florian Weimer
2021-02-15 10:02                   ` Tobias Bading
2021-02-15 10:20                     ` Florian Weimer
2021-02-15 10:36                       ` Tobias Bading
2021-02-15  8:33         ` Tobias Bading
2021-02-15  8:45           ` Konstantin Kharlamov
2021-02-15 10:25             ` Tobias Bading
2021-02-14 17:56   ` Konstantin Kharlamov
2021-02-14 18:58     ` Tobias Bading
2021-02-14 20:46       ` Konstantin Kharlamov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).