public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed
* caching failed lookups of debuginfo?
@ 2022-04-08 19:58 Milian Wolff
  2022-04-08 20:05 ` Frank Ch. Eigler
  0 siblings, 1 reply; 11+ messages in thread
From: Milian Wolff @ 2022-04-08 19:58 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 3425 bytes --]

Hey all,

another debuginfod related question, but unrelated to the other thread I 
started earlier today. In a work branch I have ported my heaptrack profiler 
over to elfutils. I have then run the analyzer that uses elfutils (and thus 
debuginfod internally via dwfl) on a recorded data file to have it download 
all the debug info files it can find.

Now, on consecutive runs, nothing should be downloaded anymore. But I notice 
that there is still a huge slowdown incurred when enabled `DEBUGINFOD_URLS`:

With DEBUGINFOD_URLS set:
```
perf stat ./heaptrack_interpret ...
          5,765.36 msec task-clock                #    0.707 CPUs utilized          
...
       8.154873275 seconds time elapsed

       5.477158000 seconds user
       0.271662000 seconds sys
```

Without DEBUGINFOD_URLS set:
```
perf stat ./heaptrack_interpret ...

          2,460.09 msec task-clock                #    0.999 CPUs utilized          
...
       2.461661784 seconds time elapsed

       2.334429000 seconds user
       0.119698000 seconds sys
```

Note the significantly increased wall-time in the first case. The second case 
(without debuginfod) has basically zero off-CPU time and significantly reduced 
on-CPU time too.

Looking at the off-CPU time flame graph, and enabling verbose debuginfod 
output lets me come to the conclusion, that the issue above is due to some 
libraries simply not having any debug information uploaded. I.e. I see dozens 
of reports such as:

```
debuginfod_find_debuginfo 2e245c2bf12f95fd8ab79b3a4be99524677cbd70
server urls "https://debuginfod.archlinux.org/"
checking build-id
checking cache dir /home/milian/.debuginfod_client_cache
using timeout 90
init server 0 https://debuginfod.archlinux.org/buildid
url 0 https://debuginfod.archlinux.org/buildid/
2e245c2bf12f95fd8ab79b3a4be99524677cbd70/debuginfo
query 1 urls in parallel
Downloading |
server response HTTP response code said error
url 0 The requested URL returned error: 404
not found No such file or directory (err=-2)
```

These negative lookups are not cached. Meaning rerunning the same process 
using dwfl and debuginfod on the same data would always incur a significant 
slowdown, as we would again and again try to look for something that's not 
there. The lookups take roughly ~200ms for me to realize the data is not on 
the server.

What's worse, I'm seeing multiple lookups for the same buildid *within the 
same process*. I.e.:

```
export DEBUGINFOD_VERBOSE=1
./heaptrack_interpret ... |& egrep "^url 0 https" | sort | uniq -c | sort
...
      6 url 0 https://debuginfod.archlinux.org/buildid/
7f4b16b4b407cbae2d7118d6f99610e29a18a56a/debuginfo
      8 url 0 https://debuginfod.archlinux.org/buildid/
c09c6f50f6bcec73c64a0b4be77eadb8f7202410/debuginfo
     14 url 0 https://debuginfod.archlinux.org/buildid/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
```

Here, we are paying roughly `14 * 0.2s = 2.8s` just for a single library.

Can we find a way to improve this situation somehow generically? I would 
personally even be OK with caching the 404 error case locally for some time 
(say, one hour or one day or ...). Then at least we would at most pay this 
cost once per library, and not multiple times. And rerunning the analysis a 
second time would become much faster again.

Was there a deliberate decision against caching negative server side lookups?

Thanks
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 19:58 caching failed lookups of debuginfo? Milian Wolff
@ 2022-04-08 20:05 ` Frank Ch. Eigler
  2022-04-08 20:45   ` Milian Wolff
  0 siblings, 1 reply; 11+ messages in thread
From: Frank Ch. Eigler @ 2022-04-08 20:05 UTC (permalink / raw)
  To: Milian Wolff; +Cc: elfutils-devel

Hi -

> another debuginfod related question, but unrelated to the other thread I 
> started earlier today. In a work branch I have ported my heaptrack profiler 
> over to elfutils. I have then run the analyzer that uses elfutils (and thus 
> debuginfod internally via dwfl) on a recorded data file to have it download 
> all the debug info files it can find.

Nice.


> These negative lookups are not cached. Meaning rerunning the same process 
> using dwfl and debuginfod on the same data would always incur a significant 
> slowdown, as we would again and again try to look for something that's not 
> there. The lookups take roughly ~200ms for me to realize the data is not on 
> the server.

That's not correct, as of elfutils 0.184 (commit 5f72c51a7e5c0),
with some more recent tweaks in (commit 7d64173fb11c6).

- FChE



> What's worse, I'm seeing multiple lookups for the same buildid *within the 
> same process*. I.e.:
> 
> ```
> export DEBUGINFOD_VERBOSE=1
> ./heaptrack_interpret ... |& egrep "^url 0 https" | sort | uniq -c | sort
> ...
>       6 url 0 https://debuginfod.archlinux.org/buildid/
> 7f4b16b4b407cbae2d7118d6f99610e29a18a56a/debuginfo
>       8 url 0 https://debuginfod.archlinux.org/buildid/
> c09c6f50f6bcec73c64a0b4be77eadb8f7202410/debuginfo
>      14 url 0 https://debuginfod.archlinux.org/buildid/
> 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
> ```
> 
> Here, we are paying roughly `14 * 0.2s = 2.8s` just for a single library.
> 
> Can we find a way to improve this situation somehow generically? I would 
> personally even be OK with caching the 404 error case locally for some time 
> (say, one hour or one day or ...). Then at least we would at most pay this 
> cost once per library, and not multiple times. And rerunning the analysis a 
> second time would become much faster again.
> 
> Was there a deliberate decision against caching negative server side lookups?
> 
> Thanks
> -- 
> Milian Wolff
> mail@milianw.de
> http://milianw.de



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 20:05 ` Frank Ch. Eigler
@ 2022-04-08 20:45   ` Milian Wolff
  2022-04-08 20:59     ` Mark Wielaard
  0 siblings, 1 reply; 11+ messages in thread
From: Milian Wolff @ 2022-04-08 20:45 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 4723 bytes --]

On Freitag, 8. April 2022 22:05:27 CEST Frank Ch. Eigler wrote:
> Hi -
> 
> > another debuginfod related question, but unrelated to the other thread I
> > started earlier today. In a work branch I have ported my heaptrack
> > profiler
> > over to elfutils. I have then run the analyzer that uses elfutils (and
> > thus
> > debuginfod internally via dwfl) on a recorded data file to have it
> > download
> > all the debug info files it can find.
> 
> Nice.
> 
> > These negative lookups are not cached. Meaning rerunning the same process
> > using dwfl and debuginfod on the same data would always incur a
> > significant
> > slowdown, as we would again and again try to look for something that's not
> > there. The lookups take roughly ~200ms for me to realize the data is not
> > on
> > the server.
> 
> That's not correct, as of elfutils 0.184 (commit 5f72c51a7e5c0),
> with some more recent tweaks in (commit 7d64173fb11c6).

I am running elfutils 0.186 here and I definitely saw this happening (that's 
where the output from my previous mail came from after all). I've now wiped my 
local debuginfod cache and tried to replicate it, but magically the repeated 
lookups are gone. This is very mysterious...

Nevertheless, I clearly still see a significant overhead of having 
DEBUGINFOD_URLS set, even after the first run that should have populated the 
cache, no? I.e. check this out:

```
$ export DEBUGINFOD_URLS=https://debuginfod.archlinux.org/
$ perf stat -r 5 ./heaptrack_interpret ...
          5,993.67 msec task-clock                #    0.696 CPUs utilized            
( +-  0.22% )
               301      context-switches          #   50.567 /sec                     
( +-  7.47% )
                 3      cpu-migrations            #    0.504 /sec                     
( +- 19.44% )
           145,574      page-faults               #   24.456 K/sec                    
( +-  0.01% )
    25,026,712,955      cycles                    #    4.204 GHz                      
( +-  0.14% )  (83.27%)
       357,786,453      stalled-cycles-frontend   #    1.43% frontend cycles 
idle     ( +-  1.77% )  (83.39%)
     4,851,145,553      stalled-cycles-backend    #   19.44% backend cycles 
idle      ( +-  0.99% )  (83.35%)
    64,893,827,473      instructions              #    2.60  insn per cycle         
                                                  #    0.07  stalled cycles 
per insn  ( +-  0.05% )  (83.29%)
    14,234,732,114      branches                  #    2.391 G/sec                    
( +-  0.04% )  (83.30%)
       111,326,964      branch-misses             #    0.78% of all branches          
( +-  0.17% )  (83.44%)

            8.6057 +- 0.0667 seconds time elapsed  ( +-  0.77% )
```

versus:

```
$ unset DEBUGINFOD_URLS
$ perf stat -r 5 ./heaptrack_interpret ...
          2,505.36 msec task-clock                #    1.009 CPUs utilized            
( +-  0.48% )
               542      context-switches          #  218.486 /sec                     
( +- 17.87% )
                 1      cpu-migrations            #    0.403 /sec                     
( +- 24.49% )
            50,712      page-faults               #   20.443 K/sec                    
( +-  0.02% )
    10,821,713,676      cycles                    #    4.362 GHz                      
( +-  0.15% )  (83.24%)
       129,975,157      stalled-cycles-frontend   #    1.20% frontend cycles 
idle     ( +-  4.86% )  (83.37%)
     1,719,256,978      stalled-cycles-backend    #   15.93% backend cycles 
idle      ( +-  1.33% )  (83.38%)
    28,947,652,758      instructions              #    2.68  insn per cycle         
                                                  #    0.06  stalled cycles 
per insn  ( +-  0.03% )  (83.40%)
     6,118,310,875      branches                  #    2.466 G/sec                    
( +-  0.03% )  (83.38%)
        46,907,166      branch-misses             #    0.77% of all branches          
( +-  0.16% )  (83.30%)

            2.4833 +- 0.0124 seconds time elapsed  ( +-  0.50% )
```

There is clearly still a significant slowdown imposed by having debuginfod 
enabled - even after the first run. Is this expected? My assumption was: Only 
the first run would be potentially slow, then everything is cached locally.

In the off-CPU flame graphs I clearly see this callchain when DEBUGINFOD_URLS 
is set:

```
dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait
```

This accounts for ~2.5s of off-CPU time in my case. If you are saying that 
negative lookups are cached, then what is this? Why would a second run of the 
same dwfl-using application with the same input data keep querying the server?

Thanks

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 20:45   ` Milian Wolff
@ 2022-04-08 20:59     ` Mark Wielaard
  2022-04-08 21:08       ` Milian Wolff
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Wielaard @ 2022-04-08 20:59 UTC (permalink / raw)
  To: Milian Wolff; +Cc: Frank Ch. Eigler, elfutils-devel

Hi Milian,

On Fri, Apr 08, 2022 at 10:45:10PM +0200, Milian Wolff wrote:
> In the off-CPU flame graphs I clearly see this callchain when DEBUGINFOD_URLS 
> is set:
> 
> ```
> dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait
> ```
> 
> This accounts for ~2.5s of off-CPU time in my case. If you are saying that 
> negative lookups are cached, then what is this? Why would a second run of the 
> same dwfl-using application with the same input data keep querying the server?

That is certainly not what should happen once the cache is filled.
Could you run with DEBUGINFOD_VERBOSE=1 to see what is being fetched and why?

Thanks,

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 20:59     ` Mark Wielaard
@ 2022-04-08 21:08       ` Milian Wolff
  2022-04-08 21:34         ` Aaron Merey
  0 siblings, 1 reply; 11+ messages in thread
From: Milian Wolff @ 2022-04-08 21:08 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Frank Ch. Eigler, elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2977 bytes --]

On Freitag, 8. April 2022 22:59:55 CEST Mark Wielaard wrote:
> Hi Milian,
> 
> On Fri, Apr 08, 2022 at 10:45:10PM +0200, Milian Wolff wrote:
> > In the off-CPU flame graphs I clearly see this callchain when
> > DEBUGINFOD_URLS is set:
> > 
> > ```
> > dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait
> > ```
> > 
> > This accounts for ~2.5s of off-CPU time in my case. If you are saying that
> > negative lookups are cached, then what is this? Why would a second run of
> > the same dwfl-using application with the same input data keep querying
> > the server?
>
> That is certainly not what should happen once the cache is filled.
> Could you run with DEBUGINFOD_VERBOSE=1 to see what is being fetched and
> why?

My first mail showed this already. But here is one such block:

```
debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
server urls "https://debuginfod.archlinux.org/"
checking build-id
checking cache dir /home/milian/.cache/debuginfod_client
using timeout 90
init server 0 https://debuginfod.archlinux.org/buildid
url 0 https://debuginfod.archlinux.org/buildid/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
query 1 urls in parallel
server response HTTP response code said error
url 0 The requested URL returned error: 404
not found No such file or directory (err=-2)
```

I see this repeated multiple times - both for different files and for the same 
file. I.e.:

```
$ /home/milian/projects/compiled/other/lib/heaptrack/libexec/
heaptrack_interpret < heaptrack.kate.52068.raw |& egrep "^url 0 https" | sort 
| uniq -c | sort -n
```

Produces this output here locally for me: https://invent.kde.org/-/snippets/
2153

I can reproduce it now suddenly with debuginfod-find too:

```
$
debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
server urls "https://debuginfod.archlinux.org/"
checking build-id
checking cache dir /home/milian/.cache/debuginfod_client
using timeout 90
init server 0 https://debuginfod.archlinux.org/buildid
url 0 https://debuginfod.archlinux.org/buildid/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
query 1 urls in parallel
server response HTTP response code said error
url 0 The requested URL returned error: 404
not found No such file or directory (err=-2)
Server query failed: No such file or directory
```

I do see an empty `/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server is 
still queried (i.e. rerunning the above command always produces the same 
output for me). The lookup is costly too at ~70ms overall:

```
$ perf stat -r 5 debuginfod-find debuginfo 
85766e9d8458b16e9c7ce6e07c712c02b8471dbc
              7.83 msec task-clock                #    0.114 CPUs utilized            
( +-  4.75% )
...
           0.06857 +- 0.00187 seconds time elapsed  ( +-  2.73% )

```

Thanks
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 21:08       ` Milian Wolff
@ 2022-04-08 21:34         ` Aaron Merey
  2022-04-08 21:56           ` Milian Wolff
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Merey @ 2022-04-08 21:34 UTC (permalink / raw)
  To: Milian Wolff; +Cc: Mark Wielaard, elfutils-devel

Hi Milian,

On Fri, Apr 8, 2022 at 5:08 PM Milian Wolff <mail@milianw.de> wrote:
> I can reproduce it now suddenly with debuginfod-find too:
>
> ```
> $
> debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> server urls "https://debuginfod.archlinux.org/"
> checking build-id
> checking cache dir /home/milian/.cache/debuginfod_client
> using timeout 90
> init server 0 https://debuginfod.archlinux.org/buildid
> url 0 https://debuginfod.archlinux.org/buildid/
> 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
> query 1 urls in parallel
> server response HTTP response code said error
> url 0 The requested URL returned error: 404
> not found No such file or directory (err=-2)
> Server query failed: No such file or directory
> ```
>
> I do see an empty `/home/milian/.cache/debuginfod_client/
> 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server is
> still queried (i.e. rerunning the above command always produces the same
> output for me). The lookup is costly too at ~70ms overall:

I am not able to reproduce this on Fedora 35 using commit 8db849976f070.

The first time I run debuginfod-find I get similar output to you:
```
debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
server urls "https://debuginfod.archlinux.org/"
checking build-id
checking cache dir /home/amerey/.cache/debuginfod_client
using timeout 90
init server 0 https://debuginfod.archlinux.org/buildid
url 0 https://debuginfod.archlinux.org/buildid/85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
query 1 urls in parallel
server response HTTP response code said error
url 0 The requested URL returned error: 404
not found No such file or directory (err=-2)
Server query failed: No such file or directory
```

But when I rerun debuginfod-find I get output indicating that the cache was
checked:
```
debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
server urls "https://debuginfod.archlinux.org/"
checking build-id
checking cache dir /home/amerey/.cache/debuginfod_client
not found No such file or directory (err=-2)
Server query failed: No such file or directory
```

No calls to curl_multi_wait occur. Is there a "cache_miss_s" file in the top
level of your debuginfod cache? It should contain the number of seconds
that must elapse after a failed attempt before another query would be
attempted (default is 600 seconds).

Aaron


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 21:34         ` Aaron Merey
@ 2022-04-08 21:56           ` Milian Wolff
  2022-04-08 22:21             ` Mark Wielaard
  2022-04-08 22:23             ` Milian Wolff
  0 siblings, 2 replies; 11+ messages in thread
From: Milian Wolff @ 2022-04-08 21:56 UTC (permalink / raw)
  To: Aaron Merey; +Cc: Mark Wielaard, elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 6088 bytes --]

On Freitag, 8. April 2022 23:34:06 CEST Aaron Merey wrote:
> Hi Milian,
> 
> On Fri, Apr 8, 2022 at 5:08 PM Milian Wolff <mail@milianw.de> wrote:
> > I can reproduce it now suddenly with debuginfod-find too:
> > 
> > ```
> > $
> > debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> > debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> > server urls "https://debuginfod.archlinux.org/"
> > checking build-id
> > checking cache dir /home/milian/.cache/debuginfod_client
> > using timeout 90
> > init server 0 https://debuginfod.archlinux.org/buildid
> > url 0 https://debuginfod.archlinux.org/buildid/
> > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo
> > query 1 urls in parallel
> > server response HTTP response code said error
> > url 0 The requested URL returned error: 404
> > not found No such file or directory (err=-2)
> > Server query failed: No such file or directory
> > ```
> > 
> > I do see an empty `/home/milian/.cache/debuginfod_client/
> > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server
> > is
> > still queried (i.e. rerunning the above command always produces the same
> 
> > output for me). The lookup is costly too at ~70ms overall:
> I am not able to reproduce this on Fedora 35 using commit 8db849976f070.
> 
> The first time I run debuginfod-find I get similar output to you:
> ```
> debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> server urls "https://debuginfod.archlinux.org/"
> checking build-id
> checking cache dir /home/amerey/.cache/debuginfod_client
> using timeout 90
> init server 0 https://debuginfod.archlinux.org/buildid
> url 0
> https://debuginfod.archlinux.org/buildid/85766e9d8458b16e9c7ce6e07c712c02b8
> 471dbc/debuginfo query 1 urls in parallel
> server response HTTP response code said error
> url 0 The requested URL returned error: 404
> not found No such file or directory (err=-2)
> Server query failed: No such file or directory
> ```
> 
> But when I rerun debuginfod-find I get output indicating that the cache was
> checked:
> ```
> debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
> server urls "https://debuginfod.archlinux.org/"
> checking build-id
> checking cache dir /home/amerey/.cache/debuginfod_client
> not found No such file or directory (err=-2)
> Server query failed: No such file or directory
> ```
> 
> No calls to curl_multi_wait occur. Is there a "cache_miss_s" file in the top
> level of your debuginfod cache? It should contain the number of seconds
> that must elapse after a failed attempt before another query would be
> attempted (default is 600 seconds).

Yes that file exists and it contains 600.

Running through strace, I see that it actually removes the empty file first. 
This looks totally bogus:

```
strace -e file -f debuginfod-find debuginfo 
85766e9d8458b16e9c7ce6e07c712c02b8471dbc |& grep "cache/debuginfod_client"
checking cache dir /home/milian/.cache/debuginfod_client
newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client", 
{st_mode=S_IFDIR|0755, st_size=12288, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
cache_clean_interval_s", {st_mode=S_IFREG|0644, st_size=5, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
cache_clean_interval_s", {st_mode=S_IFREG|0644, st_size=5, ...}, 0) = 0
openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
cache_clean_interval_s", O_RDONLY) = 5
newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", {st_mode=S_IFREG|000, 
st_size=0, ...}, 0) = 0
newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/cache_miss_s", 
{st_mode=S_IFREG|0644, st_size=3, ...}, 0) = 0
openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/cache_miss_s", 
O_RDONLY) = 5
unlink("/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo") = 0
openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", O_RDONLY) = -1 ENOENT (No 
such file or directory)
mkdir("/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc", 0700) = -1 EEXIST (File exists)
openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo.KdSnXl", O_RDWR|O_CREAT|
O_EXCL, 0600) = 5
openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", O_RDONLY|O_CREAT|O_EXCL, 
000) = 7
unlink("/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo.KdSnXl") = 0
rmdir("/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc") = -1 ENOTEMPTY (Directory not 
empty)
```

Note how there's an unlink first followed by an openat of the file that just 
got removed? Then a mkdir for a folder that still exists. Then a tmpfile for 
the download. Then follows the creation of the empty file. And then it tries 
to rmdir the whole folder, which fails, but that still sounds wrong too, no?

Backtraces for the unlink:

```
unlink("/home/milian/.cache/debuginfod_client/
85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo") = 0
 > /usr/lib/libc.so.6(unlink+0xb) [0x103aab]
 > /usr/lib/libdebuginfod-0.186.so(debuginfod_add_http_header+0x1821) [0x3de1]
 > /usr/bin/debuginfod-find() [0x122f]
 > /usr/lib/libc.so.6(__libc_start_call_main+0x80) [0x2d310]
 > /usr/lib/libc.so.6(__libc_start_main+0x81) [0x2d3c1]
 > /usr/bin/debuginfod-find() [0x1495]
```

That is:

```
eu-addr2line -fe /usr/lib/libdebuginfod-0.186.so -a 0x3de1
0x0000000000003de1
debuginfod_query_server
/usr/src/debug/elfutils-0.186/debuginfod/debuginfod-client.c:791:9
```

Which in turn points at the code that does cache cleanup in 
`debuginfod_query_server`. I now used `rr` to record such a bogus run and I 
clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it 
goes into the unlink case.

I'll try to debug this further now - I definitely do not wait 600s inbetween 
these runs here...

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 21:56           ` Milian Wolff
@ 2022-04-08 22:21             ` Mark Wielaard
  2022-04-08 22:23             ` Milian Wolff
  1 sibling, 0 replies; 11+ messages in thread
From: Mark Wielaard @ 2022-04-08 22:21 UTC (permalink / raw)
  To: Milian Wolff; +Cc: Aaron Merey, elfutils-devel

Hi Milian,

On Fri, Apr 08, 2022 at 11:56:15PM +0200, Milian Wolff wrote:
> Which in turn points at the code that does cache cleanup in 
> `debuginfod_query_server`. I now used `rr` to record such a bogus run and I 
> clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it 
> goes into the unlink case.
> 
> I'll try to debug this further now - I definitely do not wait 600s inbetween 
> these runs here...

I can replicate this on an debian system with /home on ext4.  st_mtime
doesn't seem to change ever. When the file is created the st_mtime
seems to be in the past. So time(NULL) - st.st_mtime <= cache_miss is
always false. I am not sure why yet.

Cheers,

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 21:56           ` Milian Wolff
  2022-04-08 22:21             ` Mark Wielaard
@ 2022-04-08 22:23             ` Milian Wolff
  2022-04-08 22:40               ` Mark Wielaard
  1 sibling, 1 reply; 11+ messages in thread
From: Milian Wolff @ 2022-04-08 22:23 UTC (permalink / raw)
  To: Aaron Merey; +Cc: Mark Wielaard, elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 1615 bytes --]

On Freitag, 8. April 2022 23:56:15 CEST Milian Wolff wrote:

<snip>

> Which in turn points at the code that does cache cleanup in
> `debuginfod_query_server`. I now used `rr` to record such a bogus run and I
> clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it
> goes into the unlink case.
> 
> I'll try to debug this further now - I definitely do not wait 600s inbetween
> these runs here...

I compiled elfutils with debug output, and here's what I can see when I run 
`debuginfod-find`:

time(NULL) = 1649455510
st.st_mtime = 1649448650
delta = 6860
cache_miss = 600

The longer I wait, the bigger the delta becomes - i.e. for every second I 
wait, the `delta` increases by one.

And I think I know where this comes from:

```
# first we stat the target cache path
if (stat(target_cache_path, &st) == 0
  {

  # then we pass _the same st_ to
  debuginfod_config_cache(cache_miss_path, cache_miss_default_s, &st)

  # which internally will do 
  stat(config_path, st)

  # then we check the time delta
  time(NULL) - st.st_mtime <= cache_miss
```

I.e. when we check the time delta, we only take the time stamp of the 
`config_path` into account - the time stamp of the `target_cache_path` is 
ignored.

I mount my filesystems with relatime (old advise for ssd's, probably not 
relevant anymore?). I guess that's the issue then?

Can we change the above code to store the `st.st_mtime` for 
`target_cache_path` and use that for comparison purposes? That fixes the issue 
for my case. If this is acceptable, I'll provide a patch.

Thanks
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 22:23             ` Milian Wolff
@ 2022-04-08 22:40               ` Mark Wielaard
  2022-04-08 22:54                 ` Aaron Merey
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Wielaard @ 2022-04-08 22:40 UTC (permalink / raw)
  To: Milian Wolff; +Cc: Aaron Merey, elfutils-devel

Hi Milian,

On Sat, Apr 09, 2022 at 12:23:32AM +0200, Milian Wolff wrote:
> ```
> # first we stat the target cache path
> if (stat(target_cache_path, &st) == 0
>   {
> 
>   # then we pass _the same st_ to
>   debuginfod_config_cache(cache_miss_path, cache_miss_default_s, &st)
> 
>   # which internally will do 
>   stat(config_path, st)
> 
>   # then we check the time delta
>   time(NULL) - st.st_mtime <= cache_miss
> ```
> 
> I.e. when we check the time delta, we only take the time stamp of the 
> `config_path` into account - the time stamp of the `target_cache_path` is 
> ignored.

O, wow. Yes, reusing the struct st is bad.

> I mount my filesystems with relatime (old advise for ssd's, probably not 
> relevant anymore?). I guess that's the issue then?

No, I think this is unrelated.

> Can we change the above code to store the `st.st_mtime` for 
> `target_cache_path` and use that for comparison purposes? That fixes the issue 
> for my case. If this is acceptable, I'll provide a patch.

Yes please. Thanks so much for finding this.

I think we as developers keep clearing the cache to test stuff.  But
that means we recreate the cache_miss file every time, so that gets a
new mtime. And if you are just testing for 10 minutes everything looks
fine...

Cheers,

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: caching failed lookups of debuginfo?
  2022-04-08 22:40               ` Mark Wielaard
@ 2022-04-08 22:54                 ` Aaron Merey
  0 siblings, 0 replies; 11+ messages in thread
From: Aaron Merey @ 2022-04-08 22:54 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Milian Wolff, elfutils-devel

Thanks for spotting this Milian.

On Fri, Apr 8, 2022 at 6:40 PM Mark Wielaard <mark@klomp.org> wrote:
> I think we as developers keep clearing the cache to test stuff.  But
> that means we recreate the cache_miss file every time, so that gets a
> new mtime. And if you are just testing for 10 minutes everything looks
> fine...

This explains why I couldn't reproduce it at first, 10 minutes hadn't elapsed.
I tried it again sometime later and noticed that every debuginfod-find
had started resulting in a query, but Milian got to the bottom of it first.

Aaron


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-08 22:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08 19:58 caching failed lookups of debuginfo? Milian Wolff
2022-04-08 20:05 ` Frank Ch. Eigler
2022-04-08 20:45   ` Milian Wolff
2022-04-08 20:59     ` Mark Wielaard
2022-04-08 21:08       ` Milian Wolff
2022-04-08 21:34         ` Aaron Merey
2022-04-08 21:56           ` Milian Wolff
2022-04-08 22:21             ` Mark Wielaard
2022-04-08 22:23             ` Milian Wolff
2022-04-08 22:40               ` Mark Wielaard
2022-04-08 22:54                 ` Aaron Merey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).