Issue with stale resolv.conf state

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Issue with stale resolv.conf state
@ 2024-03-11  9:08 John Levon
  2024-03-11 10:51 ` Florian Weimer
  0 siblings, 1 reply; 8+ messages in thread
From: John Levon @ 2024-03-11  9:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: fweimer


I have an intermittent issue where a getaddrinfo()-using application uses stale
nameservers. That is, /etc/resolv.conf has been updated, the original
nameservers are not reachable at all, but the application doesn't ever notice.
Note that this only reproduces very occassionally so difficult for me to distill
into a simple test case.

This is with glibc 2.35 but from a quick look I didn't see any changes in master
that would help.

I confirmed that glibc never stat()s the file, and this is because we are here:

 68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if           
 69    res_init in some other thread requested re-initializing.  */                  
 70 static __attribute__ ((warn_unused_result)) bool                                 
 71 maybe_init (struct resolv_context *ctx, bool preinit)                            
 72 {                                                                                
 73   struct __res_state *resp = ctx->resp;                                          
 74   if (resp->options & RES_INIT)                                                  
 75     {                                                                            
 76       if (resp->options & RES_NORELOAD)                                          
 77         /* Configuration reloading was explicitly disabled.  */                  
 78         return true;                                                             
 79                                                                                  
 80       /* If there is no associated resolv_conf object despite the                
 81          initialization, something modified *ctx->resp.  Do not                  
 82          override those changes.  */                                             
 83       if (ctx->conf != NULL && replicated_configuration_matches (ctx))           

And "replicated_configuration_matches()" is false. Thus we never examine the
file for any changes and continue using the old version indefinitely.

I don't understand the first part of the comment, but indeed, ->resp doesn't
match. In particular:

 62   return ctx->resp->options == ctx->conf->options                                

and ctx->resp (aka _resp) has 0x47002c1 whereas ctx->conf has 0x41002c1.

I'm not sure but I suspect the additional RES_SNGLKUP|RES_SNGLKUPREOP may be due
to this code:

1000                     /* There are quite a few broken name servers out             
1001                        there which don't handle two outstanding                  
1002                        requests from the same source.  There are also            
1003                        broken firewall settings.  If we time out after           
1004                        having received one answer switch to the mode             
1005                        where we send the second request only once we             
1006                        have received the first answer.  */                       
1007                     if (!single_request)                                         
1008                       {                                                          
1009                         statp->options |= RES_SNGLKUP;                           
1010                         single_request = true;                                   
1011                         *gotsomewhere = save_gotsomewhere;                       
1012                         goto retry;                                              
1013                       }                                                          
1014                     else if (!single_request_reopen)                             
1015                       {                                                          
1016                         statp->options |= RES_SNGLKUPREOP;                       
1017                         single_request_reopen = true;                            
1018                         *gotsomewhere = save_gotsomewhere;                       
1019                         __res_iclose (statp, false);                             
1020                         goto retry_reopen;                                       
1021                       }                                                          

I'm guessing these got set when the VPN dropped routing to the old nameservers,
but before the next getaddrinfo() came in, thus leading to the match failing.

I can't see where the application code itself can be at fault here, but I'm not
100% confident about the above analysis either. Any thoughts?

thanks
john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-11  9:08 Issue with stale resolv.conf state John Levon
@ 2024-03-11 10:51 ` Florian Weimer
  2024-03-12  0:51   ` Cristian Rodríguez
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Weimer @ 2024-03-11 10:51 UTC (permalink / raw)
  To: John Levon; +Cc: libc-alpha

* John Levon:

> I don't understand the first part of the comment, but indeed, ->resp doesn't
> match. In particular:
>
>  62   return ctx->resp->options == ctx->conf->options                                
>
> and ctx->resp (aka _resp) has 0x47002c1 whereas ctx->conf has 0x41002c1.
>
> I'm not sure but I suspect the additional RES_SNGLKUP|RES_SNGLKUPREOP
> may be due to this code:
>
> 1000                     /* There are quite a few broken name servers out             
> 1001                        there which don't handle two outstanding                  
> 1002                        requests from the same source.  There are also            
> 1003                        broken firewall settings.  If we time out after           
> 1004                        having received one answer switch to the mode             
> 1005                        where we send the second request only once we             
> 1006                        have received the first answer.  */                       
> 1007                     if (!single_request)                                         
> 1008                       {                                                          
> 1009                         statp->options |= RES_SNGLKUP;                           
> 1010                         single_request = true;                                   
> 1011                         *gotsomewhere = save_gotsomewhere;                       
> 1012                         goto retry;                                              
> 1013                       }                                                          
> 1014                     else if (!single_request_reopen)                             
> 1015                       {                                                          
> 1016                         statp->options |= RES_SNGLKUPREOP;                       
> 1017                         single_request_reopen = true;                            
> 1018                         *gotsomewhere = save_gotsomewhere;                       
> 1019                         __res_iclose (statp, false);                             
> 1020                         goto retry_reopen;                                       
> 1021                       }


That's a very good point.  Yes, the current reloading code does not take
into account that we change _res.options dynamically based on network
behavior.

That automatic configuration change based on temporary network glitches
is problematic in other contexts as well (it may further trigger bugs in
dual query processing).

Maybe we should just remove the automatic downgrade, basically not
persist this across queries anymore.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-11 10:51 ` Florian Weimer
@ 2024-03-12  0:51   ` Cristian Rodríguez
  2024-03-12  6:45     ` Florian Weimer
  0 siblings, 1 reply; 8+ messages in thread
From: Cristian Rodríguez @ 2024-03-12  0:51 UTC (permalink / raw)
  To: Florian Weimer; +Cc: John Levon, libc-alpha

On Mon, Mar 11, 2024 at 7:51 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * John Levon:
>
cessing).
>
> Maybe we should just remove the automatic downgrade, basically not
> persist this across queries anymore.

Yeah. +1. Users of those broken nameservers deserve at least noticing
they are wrong if such systems are really still around ..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-12  0:51   ` Cristian Rodríguez
@ 2024-03-12  6:45     ` Florian Weimer
  2024-03-12  9:09       ` Philip Sanetra
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Florian Weimer @ 2024-03-12  6:45 UTC (permalink / raw)
  To: Cristian Rodríguez; +Cc: John Levon, libc-alpha

* Cristian Rodríguez:

> On Mon, Mar 11, 2024 at 7:51 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * John Levon:
>>
> cessing).
>>
>> Maybe we should just remove the automatic downgrade, basically not
>> persist this across queries anymore.
>
> Yeah. +1. Users of those broken nameservers deserve at least noticing
> they are wrong if such systems are really still around ..

I filed:

  Automatic activation of single-request options break resolv.conf reloading
  <https://sourceware.org/bugzilla/show_bug.cgi?id=31476>

On the other hand, we have this request:

| Change resolv.conf default to single-request
| […]
| We have the year 2022 and these issues still occur, so it was not some
| kind of issue that went away by time as it was possibly expected when
| glibc 2.10 was released.

<https://sourceware.org/bugzilla/show_bug.cgi?id=29017>

So the solution might not be so straightforward.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-12  6:45     ` Florian Weimer
@ 2024-03-12  9:09       ` Philip Sanetra
  2024-03-12 10:04       ` John Levon
  2024-03-12 14:25       ` Cristian Rodríguez
  2 siblings, 0 replies; 8+ messages in thread
From: Philip Sanetra @ 2024-03-12  9:09 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Cristian Rodríguez, John Levon, libc-alpha


[-- Attachment #1.1: Type: text/plain, Size: 1671 bytes --]

Hi,

I think removing the automatic downgrade without also making the single-request option the defeault behavior would break a lot of systems.

I know of at least two environments in different companies where the default behavior results in 5 seconds timeouts and only the automatic downgrade improves performance in subsequent DNS lookups.

I would appreciate using single-request option as default, like mentioned in https://sourceware.org/bugzilla/show_bug.cgi?id=29017

Regards,
Philip Sanetra


On Tuesday, 12 March 2024 at 7:45 AM, Florian Weimer <fweimer@redhat.com> wrote:

> 

> 

> * Cristian Rodríguez:
> 

> > On Mon, Mar 11, 2024 at 7:51 AM Florian Weimer fweimer@redhat.com wrote:
> > 

> > > * John Levon:
> > 

> > cessing).
> > 

> > > Maybe we should just remove the automatic downgrade, basically not
> > > persist this across queries anymore.
> > 

> > Yeah. +1. Users of those broken nameservers deserve at least noticing
> > they are wrong if such systems are really still around ..
> 

> 

> I filed:
> 

> Automatic activation of single-request options break resolv.conf reloading
> https://sourceware.org/bugzilla/show_bug.cgi?id=31476
> 

> 

> On the other hand, we have this request:
> 

> | Change resolv.conf default to single-request
> | […]
> | We have the year 2022 and these issues still occur, so it was not some
> | kind of issue that went away by time as it was possibly expected when
> | glibc 2.10 was released.
> 

> https://sourceware.org/bugzilla/show_bug.cgi?id=29017
> 

> 

> So the solution might not be so straightforward.
> 

> Thanks,
> Florian


[-- Attachment #1.2: publickey - code@psanetra.de - 0x61B5EBD7.asc --]
[-- Type: application/pgp-keys, Size: 645 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-12  6:45     ` Florian Weimer
  2024-03-12  9:09       ` Philip Sanetra
@ 2024-03-12 10:04       ` John Levon
  2024-03-12 14:25       ` Cristian Rodríguez
  2 siblings, 0 replies; 8+ messages in thread
From: John Levon @ 2024-03-12 10:04 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Cristian Rodríguez, libc-alpha

On Tue, Mar 12, 2024 at 07:45:23AM +0100, Florian Weimer wrote:

> * Cristian Rodríguez:
> 
> > On Mon, Mar 11, 2024 at 7:51 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * John Levon:
> >>
> > cessing).
> >>
> >> Maybe we should just remove the automatic downgrade, basically not
> >> persist this across queries anymore.
> >
> > Yeah. +1. Users of those broken nameservers deserve at least noticing
> > they are wrong if such systems are really still around ..
> 
> I filed:
> 
>   Automatic activation of single-request options break resolv.conf reloading
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=31476>

Probably a stupid suggestion, but would filtering out these specific flags in
the replicated_configuration_matches() comparison help? Then we'd still go down
the "stat() /etc/resolv.conf" path.

regards
john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-12  6:45     ` Florian Weimer
  2024-03-12  9:09       ` Philip Sanetra
  2024-03-12 10:04       ` John Levon
@ 2024-03-12 14:25       ` Cristian Rodríguez
  2024-03-12 14:30         ` Florian Weimer
  2 siblings, 1 reply; 8+ messages in thread
From: Cristian Rodríguez @ 2024-03-12 14:25 UTC (permalink / raw)
  To: Florian Weimer; +Cc: John Levon, libc-alpha

On Tue, Mar 12, 2024 at 3:45 AM Florian Weimer <fweimer@redhat.com> wrote:

> So the solution might not be so straightforward.

I 'm not being sarcastic or anything.. but if the standards do not
recommend an approach for this.. I strongly suggest doing it like
Windows does.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Issue with stale resolv.conf state
  2024-03-12 14:25       ` Cristian Rodríguez
@ 2024-03-12 14:30         ` Florian Weimer
  0 siblings, 0 replies; 8+ messages in thread
From: Florian Weimer @ 2024-03-12 14:30 UTC (permalink / raw)
  To: Cristian Rodríguez; +Cc: John Levon, libc-alpha

* Cristian Rodríguez:

> On Tue, Mar 12, 2024 at 3:45 AM Florian Weimer <fweimer@redhat.com> wrote:
>
>> So the solution might not be so straightforward.
>
> I 'm not being sarcastic or anything.. but if the standards do not
> recommend an approach for this.. I strongly suggest doing it like
> Windows does.

I think Windows has a totally different architecture: their DNS stub
resolver isn't in-process, but system-wide.  So it's easier for them to
share state across multiple requests and processes.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-12 14:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-11  9:08 Issue with stale resolv.conf state John Levon
2024-03-11 10:51 ` Florian Weimer
2024-03-12  0:51   ` Cristian Rodríguez
2024-03-12  6:45     ` Florian Weimer
2024-03-12  9:09       ` Philip Sanetra
2024-03-12 10:04       ` John Levon
2024-03-12 14:25       ` Cristian Rodríguez
2024-03-12 14:30         ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).