public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
* getaddrinfo() fails to use latest DNS address - v2.27
@ 2020-01-07  6:01 Tarun Tej K
  2020-01-07  8:52 ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Tarun Tej K @ 2020-01-07  6:01 UTC (permalink / raw)
  To: libc-help

Hi,

Environment:
glibc version -  v2.27
platform - NXP's iMX6
cross-compiler - arm-poky-linux-gnueabi-gcc
Built using the Yocto recipes
As a part of long term testing of our system, I have a setup of
automatic network switching between different interfaces like
ethernet, wlan and ppp. During this automation, the DNS addresses in
the /etc/resolv.conf keep changing because the active network
interface i.e., WLAN/Ethernet/PPP keeps changing.

Issue Description:
The issue might be related to
https://sourceware.org/bugzilla/show_bug.cgi?id=984
It is observed that once in a while, after certain duration like 5
hours or so, the getaddrinfo() fails to resolve the addresses and keep
getting EAGAIN 'Temporary failure in name resolution' as return code.
'strace' output of the failing process shows that the getaddrinfo() is
doing neither stat64 nor openat() of /etc/resolv.conf (to check for
latest DNS change)  at all when the process is in this state and may
be due to this reason it is not updating the global config
(resolv_conf_global) with correct DNS values.

I am yet to get the steps to reproduce this issue easily.
I have tried a simple application which just calls getaddrinfo() based
on user input and that application always does 'stat64' of
/etc/resolv.conf and openat when there is change in time or size or
inode of  /etc/resolv.conf
But I am not sure what is causing my actual application to get into a
state where it is not even doing 'stat64' of /etc/resolv.conf after
some time of running

I have gone through glibc code and have a query regarding below part
from the function maybe_init() in file resolv/resolv_context.c

if (ctx->conf != NULL && replicated_configuration_matches (ctx))
        {
          struct resolv_conf *current = __resolv_conf_get_current ();
          if (current == NULL)
            return false;
          /* Check if the configuration changed.  */
          if (current != ctx->conf)
            {
              /* This call will detach the extended resolver state.  */
              if (resp->nscount > 0)
                __res_iclose (resp, true);
              /* Reattach the current configuration.  */
              if (__resolv_conf_attach (ctx->resp, current))
                {
                  __resolv_conf_put (ctx->conf);
                  /* ctx takes ownership, so we do not release current.  */
                  ctx->conf = current;
                }
            }
          else
            /* No change.  Drop the reference count for current.  */
            __resolv_conf_put (current);
        }
      return true;

Here the return value will be 'true' even when the condition   if
(ctx->conf != NULL && replicated_configuration_matches (ctx)) fails. I
think that  this is one case where __resolv_conf_get_current() or
__resolv_conf_load()  would not be  called and so 'stat64' or openat()
would not be done on /etc/resolv.conf. Why is the function maybe_init
returning 'true' when the condition (ctx->conf != NULL &&
replicated_configuration_matches (ctx)) fails?

Note:
One thing about /etc/resolv.conf if it helps. Depending the type of
active network interface the application changes file type of
/etc/resolv.conf is sometimes regular file or symlink to
/var/run/resolv.conf.  Could the /etc/resolv.conf being a symlink
cause any problem like this.

Thanks
Tarun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-07  6:01 getaddrinfo() fails to use latest DNS address - v2.27 Tarun Tej K
@ 2020-01-07  8:52 ` Florian Weimer
  2020-01-07  9:45   ` Tarun Tej K
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2020-01-07  8:52 UTC (permalink / raw)
  To: Tarun Tej K; +Cc: libc-help

* Tarun Tej K.:

> I have gone through glibc code and have a query regarding below part
> from the function maybe_init() in file resolv/resolv_context.c
>
> if (ctx->conf != NULL && replicated_configuration_matches (ctx))
>         {
>           struct resolv_conf *current = __resolv_conf_get_current ();
>           if (current == NULL)
>             return false;
>           /* Check if the configuration changed.  */
>           if (current != ctx->conf)
>             {
>               /* This call will detach the extended resolver state.  */
>               if (resp->nscount > 0)
>                 __res_iclose (resp, true);
>               /* Reattach the current configuration.  */
>               if (__resolv_conf_attach (ctx->resp, current))
>                 {
>                   __resolv_conf_put (ctx->conf);
>                   /* ctx takes ownership, so we do not release current.  */
>                   ctx->conf = current;
>                 }
>             }
>           else
>             /* No change.  Drop the reference count for current.  */
>             __resolv_conf_put (current);
>         }
>       return true;
>
> Here the return value will be 'true' even when the condition   if
> (ctx->conf != NULL && replicated_configuration_matches (ctx)) fails. I
> think that  this is one case where __resolv_conf_get_current() or
> __resolv_conf_load()  would not be  called and so 'stat64' or openat()
> would not be done on /etc/resolv.conf. Why is the function maybe_init
> returning 'true' when the condition (ctx->conf != NULL &&
> replicated_configuration_matches (ctx)) fails?

The expectation here is that __resolv_conf_load performed the stat64
call and reloaded the configuration if necessary.  maybe_init returns
false only in case of resource exhaustion (e.g., memory allocation
failure).

Doe this answer your question?  I'm afraid the bug must be elsewhere.

We use stat64 in __resolv_conf_get_current, so it should not matter
whether /etc/resolv.conf is a symbolic link or not.  I believe the
function deals correctly with missing files or symbolic links that
cannot be resolved.  It produces an invalid, all-zero stat result and
caches that.  If the file becomes available again, the cached contents
no longer matches, and a reload is triggered.

How do you replace /etc/resolv.conf?  If the file is overwritten
in-place and the file system does not have nanosecond resolution for its
timestamps, then glibc might read an empty (or partial) /etc/resolv.conf
file, but might not realize that the file changed again later because
the subsequent writes do modify the file timestamps (due to the
timestamp resolution).  The fix is to write the new version of
/etc/resolv.conf to a temporary file (in the same directory) and rename
it into place, atomically.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-07  8:52 ` Florian Weimer
@ 2020-01-07  9:45   ` Tarun Tej K
  2020-01-07 12:27     ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Tarun Tej K @ 2020-01-07  9:45 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-help

Thanks for your reply, Florian

> How do you replace /etc/resolv.conf?  If the file is overwritten
> in-place and the file system does not have nanosecond resolution for its
> timestamps, then glibc might read an empty (or partial) /etc/resolv.conf
> file, but might not realize that the file changed again later because
> the subsequent writes do modify the file timestamps (due to the
> timestamp resolution).  The fix is to write the new version of
> /etc/resolv.conf to a temporary file (in the same directory) and rename
> it into place, atomically.
So, there are two daemons which are writing into /etc/resolv.conf.
1- ConnMan - for Ethernet and Wifi interfaces - creates symbolic link
to /var/run/resolv.conf
2- pppd - for PPP interfaces  - creates regular file (only when
internet is not reachable via ethernet and wifi)
(I know it is not recommended to have two different network daemons in
same system - work is in progress to merge pppd also into the ConnMan)

The filesystem is UBIFS on Raw NAND. I think you're right about the
not having nanosecond resolution.
As per the strace, when problem occurs even stat64 call is not taking
place. (That means __resolv_conf_get_current is not being called?)
In any case, shouldn't stat64() happen everytime the getaddrinfo() is called?
Can you please suggest any way to investigate why even stat64 was not
happening and getaddrinfo is returning with EAGAIN forever until the
process is restarted?
Also could you suggest any other tool other than strace to see the
call trace of libc runtime?

Thanks
Tarun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-07  9:45   ` Tarun Tej K
@ 2020-01-07 12:27     ` Florian Weimer
  2020-01-08  7:25       ` Tarun Tej K
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2020-01-07 12:27 UTC (permalink / raw)
  To: Tarun Tej K; +Cc: libc-help

* Tarun Tej K.:

> As per the strace, when problem occurs even stat64 call is not taking
> place. (That means __resolv_conf_get_current is not being called?)

Yes.

> In any case, shouldn't stat64() happen everytime the getaddrinfo() is
> called?

Indeed.

> Can you please suggest any way to investigate why even stat64 was not
> happening and getaddrinfo is returning with EAGAIN forever until the
> process is restarted?

It could mean that reloading is disabled.  There are a few possible
causes:

* The noreload option is specified in /etc/resolv.conf.

* The application sets RES_NORELOAD in res.options directly.

* The application updates other data in _res, so that it no longer
  matches the previously read configuration.

The last part could also happen due to completely different bugs,
e.g. after heap corruption.

Regarding bugs, I'm not aware of any.  In general, reloading of the
configuration is inhibited for a thread if it has an attached context.
Maybe there is a bug in this area.  Do you have a coredump?  It should
be visible in the current variable in resolv/resolv_context.c.

> Also could you suggest any other tool other than strace to see the
> call trace of libc runtime?

Systemtap might be an option.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-07 12:27     ` Florian Weimer
@ 2020-01-08  7:25       ` Tarun Tej K
  2020-01-08  7:41         ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Tarun Tej K @ 2020-01-08  7:25 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-help

> It could mean that reloading is disabled.  There are a few possible
> causes:
>
> * The noreload option is specified in /etc/resolv.conf.
>
> * The application sets RES_NORELOAD in res.options directly.
>
> * The application updates other data in _res, so that it no longer
>   matches the previously read configuration.
>
> The last part could also happen due to completely different bugs,
> e.g. after heap corruption.
>
> Regarding bugs, I'm not aware of any.  In general, reloading of the
> configuration is inhibited for a thread if it has an attached context.
> Maybe there is a bug in this area.  Do you have a coredump?  It should
> be visible in the current variable in resolv/resolv_context.c.
>
> > Also could you suggest any other tool other than strace to see the
> > call trace of libc runtime?
>
> Systemtap might be an option.
>

Thanks. I'll try those.
Is there any function apart from res_init() and res_ninit() which will
clear the global conf and force to load the global conf freshly upon
the next getaddrinfo()?

Thanks
Tarun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-08  7:25       ` Tarun Tej K
@ 2020-01-08  7:41         ` Florian Weimer
  2020-01-20 14:50           ` Filip Ochnik
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2020-01-08  7:41 UTC (permalink / raw)
  To: Tarun Tej K; +Cc: libc-help

* Tarun Tej K.:

> Is there any function apart from res_init() and res_ninit() which will
> clear the global conf and force to load the global conf freshly upon
> the next getaddrinfo()?

You mean, a function that you can call to work around the bug?  No,
there is not, sorry.

Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: getaddrinfo() fails to use latest DNS address - v2.27
  2020-01-08  7:41         ` Florian Weimer
@ 2020-01-20 14:50           ` Filip Ochnik
  0 siblings, 0 replies; 7+ messages in thread
From: Filip Ochnik @ 2020-01-20 14:50 UTC (permalink / raw)
  To: libc-help

Hi,

I believe I hit the same issue and found the root cause. See this bug report: https://sourceware.org/bugzilla/show_bug.cgi?id=25420

Filip

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-20 14:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-07  6:01 getaddrinfo() fails to use latest DNS address - v2.27 Tarun Tej K
2020-01-07  8:52 ` Florian Weimer
2020-01-07  9:45   ` Tarun Tej K
2020-01-07 12:27     ` Florian Weimer
2020-01-08  7:25       ` Tarun Tej K
2020-01-08  7:41         ` Florian Weimer
2020-01-20 14:50           ` Filip Ochnik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).