* caching failed lookups of debuginfo? @ 2022-04-08 19:58 Milian Wolff 2022-04-08 20:05 ` Frank Ch. Eigler 0 siblings, 1 reply; 11+ messages in thread From: Milian Wolff @ 2022-04-08 19:58 UTC (permalink / raw) To: elfutils-devel [-- Attachment #1: Type: text/plain, Size: 3425 bytes --] Hey all, another debuginfod related question, but unrelated to the other thread I started earlier today. In a work branch I have ported my heaptrack profiler over to elfutils. I have then run the analyzer that uses elfutils (and thus debuginfod internally via dwfl) on a recorded data file to have it download all the debug info files it can find. Now, on consecutive runs, nothing should be downloaded anymore. But I notice that there is still a huge slowdown incurred when enabled `DEBUGINFOD_URLS`: With DEBUGINFOD_URLS set: ``` perf stat ./heaptrack_interpret ... 5,765.36 msec task-clock # 0.707 CPUs utilized ... 8.154873275 seconds time elapsed 5.477158000 seconds user 0.271662000 seconds sys ``` Without DEBUGINFOD_URLS set: ``` perf stat ./heaptrack_interpret ... 2,460.09 msec task-clock # 0.999 CPUs utilized ... 2.461661784 seconds time elapsed 2.334429000 seconds user 0.119698000 seconds sys ``` Note the significantly increased wall-time in the first case. The second case (without debuginfod) has basically zero off-CPU time and significantly reduced on-CPU time too. Looking at the off-CPU time flame graph, and enabling verbose debuginfod output lets me come to the conclusion, that the issue above is due to some libraries simply not having any debug information uploaded. I.e. I see dozens of reports such as: ``` debuginfod_find_debuginfo 2e245c2bf12f95fd8ab79b3a4be99524677cbd70 server urls "https://debuginfod.archlinux.org/" checking build-id checking cache dir /home/milian/.debuginfod_client_cache using timeout 90 init server 0 https://debuginfod.archlinux.org/buildid url 0 https://debuginfod.archlinux.org/buildid/ 2e245c2bf12f95fd8ab79b3a4be99524677cbd70/debuginfo query 1 urls in parallel Downloading | server response HTTP response code said error url 0 The requested URL returned error: 404 not found No such file or directory (err=-2) ``` These negative lookups are not cached. Meaning rerunning the same process using dwfl and debuginfod on the same data would always incur a significant slowdown, as we would again and again try to look for something that's not there. The lookups take roughly ~200ms for me to realize the data is not on the server. What's worse, I'm seeing multiple lookups for the same buildid *within the same process*. I.e.: ``` export DEBUGINFOD_VERBOSE=1 ./heaptrack_interpret ... |& egrep "^url 0 https" | sort | uniq -c | sort ... 6 url 0 https://debuginfod.archlinux.org/buildid/ 7f4b16b4b407cbae2d7118d6f99610e29a18a56a/debuginfo 8 url 0 https://debuginfod.archlinux.org/buildid/ c09c6f50f6bcec73c64a0b4be77eadb8f7202410/debuginfo 14 url 0 https://debuginfod.archlinux.org/buildid/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo ``` Here, we are paying roughly `14 * 0.2s = 2.8s` just for a single library. Can we find a way to improve this situation somehow generically? I would personally even be OK with caching the 404 error case locally for some time (say, one hour or one day or ...). Then at least we would at most pay this cost once per library, and not multiple times. And rerunning the analysis a second time would become much faster again. Was there a deliberate decision against caching negative server side lookups? Thanks -- Milian Wolff mail@milianw.de http://milianw.de [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 19:58 caching failed lookups of debuginfo? Milian Wolff @ 2022-04-08 20:05 ` Frank Ch. Eigler 2022-04-08 20:45 ` Milian Wolff 0 siblings, 1 reply; 11+ messages in thread From: Frank Ch. Eigler @ 2022-04-08 20:05 UTC (permalink / raw) To: Milian Wolff; +Cc: elfutils-devel Hi - > another debuginfod related question, but unrelated to the other thread I > started earlier today. In a work branch I have ported my heaptrack profiler > over to elfutils. I have then run the analyzer that uses elfutils (and thus > debuginfod internally via dwfl) on a recorded data file to have it download > all the debug info files it can find. Nice. > These negative lookups are not cached. Meaning rerunning the same process > using dwfl and debuginfod on the same data would always incur a significant > slowdown, as we would again and again try to look for something that's not > there. The lookups take roughly ~200ms for me to realize the data is not on > the server. That's not correct, as of elfutils 0.184 (commit 5f72c51a7e5c0), with some more recent tweaks in (commit 7d64173fb11c6). - FChE > What's worse, I'm seeing multiple lookups for the same buildid *within the > same process*. I.e.: > > ``` > export DEBUGINFOD_VERBOSE=1 > ./heaptrack_interpret ... |& egrep "^url 0 https" | sort | uniq -c | sort > ... > 6 url 0 https://debuginfod.archlinux.org/buildid/ > 7f4b16b4b407cbae2d7118d6f99610e29a18a56a/debuginfo > 8 url 0 https://debuginfod.archlinux.org/buildid/ > c09c6f50f6bcec73c64a0b4be77eadb8f7202410/debuginfo > 14 url 0 https://debuginfod.archlinux.org/buildid/ > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo > ``` > > Here, we are paying roughly `14 * 0.2s = 2.8s` just for a single library. > > Can we find a way to improve this situation somehow generically? I would > personally even be OK with caching the 404 error case locally for some time > (say, one hour or one day or ...). Then at least we would at most pay this > cost once per library, and not multiple times. And rerunning the analysis a > second time would become much faster again. > > Was there a deliberate decision against caching negative server side lookups? > > Thanks > -- > Milian Wolff > mail@milianw.de > http://milianw.de ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 20:05 ` Frank Ch. Eigler @ 2022-04-08 20:45 ` Milian Wolff 2022-04-08 20:59 ` Mark Wielaard 0 siblings, 1 reply; 11+ messages in thread From: Milian Wolff @ 2022-04-08 20:45 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: elfutils-devel [-- Attachment #1: Type: text/plain, Size: 4723 bytes --] On Freitag, 8. April 2022 22:05:27 CEST Frank Ch. Eigler wrote: > Hi - > > > another debuginfod related question, but unrelated to the other thread I > > started earlier today. In a work branch I have ported my heaptrack > > profiler > > over to elfutils. I have then run the analyzer that uses elfutils (and > > thus > > debuginfod internally via dwfl) on a recorded data file to have it > > download > > all the debug info files it can find. > > Nice. > > > These negative lookups are not cached. Meaning rerunning the same process > > using dwfl and debuginfod on the same data would always incur a > > significant > > slowdown, as we would again and again try to look for something that's not > > there. The lookups take roughly ~200ms for me to realize the data is not > > on > > the server. > > That's not correct, as of elfutils 0.184 (commit 5f72c51a7e5c0), > with some more recent tweaks in (commit 7d64173fb11c6). I am running elfutils 0.186 here and I definitely saw this happening (that's where the output from my previous mail came from after all). I've now wiped my local debuginfod cache and tried to replicate it, but magically the repeated lookups are gone. This is very mysterious... Nevertheless, I clearly still see a significant overhead of having DEBUGINFOD_URLS set, even after the first run that should have populated the cache, no? I.e. check this out: ``` $ export DEBUGINFOD_URLS=https://debuginfod.archlinux.org/ $ perf stat -r 5 ./heaptrack_interpret ... 5,993.67 msec task-clock # 0.696 CPUs utilized ( +- 0.22% ) 301 context-switches # 50.567 /sec ( +- 7.47% ) 3 cpu-migrations # 0.504 /sec ( +- 19.44% ) 145,574 page-faults # 24.456 K/sec ( +- 0.01% ) 25,026,712,955 cycles # 4.204 GHz ( +- 0.14% ) (83.27%) 357,786,453 stalled-cycles-frontend # 1.43% frontend cycles idle ( +- 1.77% ) (83.39%) 4,851,145,553 stalled-cycles-backend # 19.44% backend cycles idle ( +- 0.99% ) (83.35%) 64,893,827,473 instructions # 2.60 insn per cycle # 0.07 stalled cycles per insn ( +- 0.05% ) (83.29%) 14,234,732,114 branches # 2.391 G/sec ( +- 0.04% ) (83.30%) 111,326,964 branch-misses # 0.78% of all branches ( +- 0.17% ) (83.44%) 8.6057 +- 0.0667 seconds time elapsed ( +- 0.77% ) ``` versus: ``` $ unset DEBUGINFOD_URLS $ perf stat -r 5 ./heaptrack_interpret ... 2,505.36 msec task-clock # 1.009 CPUs utilized ( +- 0.48% ) 542 context-switches # 218.486 /sec ( +- 17.87% ) 1 cpu-migrations # 0.403 /sec ( +- 24.49% ) 50,712 page-faults # 20.443 K/sec ( +- 0.02% ) 10,821,713,676 cycles # 4.362 GHz ( +- 0.15% ) (83.24%) 129,975,157 stalled-cycles-frontend # 1.20% frontend cycles idle ( +- 4.86% ) (83.37%) 1,719,256,978 stalled-cycles-backend # 15.93% backend cycles idle ( +- 1.33% ) (83.38%) 28,947,652,758 instructions # 2.68 insn per cycle # 0.06 stalled cycles per insn ( +- 0.03% ) (83.40%) 6,118,310,875 branches # 2.466 G/sec ( +- 0.03% ) (83.38%) 46,907,166 branch-misses # 0.77% of all branches ( +- 0.16% ) (83.30%) 2.4833 +- 0.0124 seconds time elapsed ( +- 0.50% ) ``` There is clearly still a significant slowdown imposed by having debuginfod enabled - even after the first run. Is this expected? My assumption was: Only the first run would be potentially slow, then everything is cached locally. In the off-CPU flame graphs I clearly see this callchain when DEBUGINFOD_URLS is set: ``` dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait ``` This accounts for ~2.5s of off-CPU time in my case. If you are saying that negative lookups are cached, then what is this? Why would a second run of the same dwfl-using application with the same input data keep querying the server? Thanks -- Milian Wolff mail@milianw.de http://milianw.de [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 20:45 ` Milian Wolff @ 2022-04-08 20:59 ` Mark Wielaard 2022-04-08 21:08 ` Milian Wolff 0 siblings, 1 reply; 11+ messages in thread From: Mark Wielaard @ 2022-04-08 20:59 UTC (permalink / raw) To: Milian Wolff; +Cc: Frank Ch. Eigler, elfutils-devel Hi Milian, On Fri, Apr 08, 2022 at 10:45:10PM +0200, Milian Wolff wrote: > In the off-CPU flame graphs I clearly see this callchain when DEBUGINFOD_URLS > is set: > > ``` > dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait > ``` > > This accounts for ~2.5s of off-CPU time in my case. If you are saying that > negative lookups are cached, then what is this? Why would a second run of the > same dwfl-using application with the same input data keep querying the server? That is certainly not what should happen once the cache is filled. Could you run with DEBUGINFOD_VERBOSE=1 to see what is being fetched and why? Thanks, Mark ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 20:59 ` Mark Wielaard @ 2022-04-08 21:08 ` Milian Wolff 2022-04-08 21:34 ` Aaron Merey 0 siblings, 1 reply; 11+ messages in thread From: Milian Wolff @ 2022-04-08 21:08 UTC (permalink / raw) To: Mark Wielaard; +Cc: Frank Ch. Eigler, elfutils-devel [-- Attachment #1: Type: text/plain, Size: 2977 bytes --] On Freitag, 8. April 2022 22:59:55 CEST Mark Wielaard wrote: > Hi Milian, > > On Fri, Apr 08, 2022 at 10:45:10PM +0200, Milian Wolff wrote: > > In the off-CPU flame graphs I clearly see this callchain when > > DEBUGINFOD_URLS is set: > > > > ``` > > dwfl_standard_find_debuginfo > debuginfod_query_server > curl_multi_wait > > ``` > > > > This accounts for ~2.5s of off-CPU time in my case. If you are saying that > > negative lookups are cached, then what is this? Why would a second run of > > the same dwfl-using application with the same input data keep querying > > the server? > > That is certainly not what should happen once the cache is filled. > Could you run with DEBUGINFOD_VERBOSE=1 to see what is being fetched and > why? My first mail showed this already. But here is one such block: ``` debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc server urls "https://debuginfod.archlinux.org/" checking build-id checking cache dir /home/milian/.cache/debuginfod_client using timeout 90 init server 0 https://debuginfod.archlinux.org/buildid url 0 https://debuginfod.archlinux.org/buildid/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo query 1 urls in parallel server response HTTP response code said error url 0 The requested URL returned error: 404 not found No such file or directory (err=-2) ``` I see this repeated multiple times - both for different files and for the same file. I.e.: ``` $ /home/milian/projects/compiled/other/lib/heaptrack/libexec/ heaptrack_interpret < heaptrack.kate.52068.raw |& egrep "^url 0 https" | sort | uniq -c | sort -n ``` Produces this output here locally for me: https://invent.kde.org/-/snippets/ 2153 I can reproduce it now suddenly with debuginfod-find too: ``` $ debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc server urls "https://debuginfod.archlinux.org/" checking build-id checking cache dir /home/milian/.cache/debuginfod_client using timeout 90 init server 0 https://debuginfod.archlinux.org/buildid url 0 https://debuginfod.archlinux.org/buildid/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo query 1 urls in parallel server response HTTP response code said error url 0 The requested URL returned error: 404 not found No such file or directory (err=-2) Server query failed: No such file or directory ``` I do see an empty `/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server is still queried (i.e. rerunning the above command always produces the same output for me). The lookup is costly too at ~70ms overall: ``` $ perf stat -r 5 debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc 7.83 msec task-clock # 0.114 CPUs utilized ( +- 4.75% ) ... 0.06857 +- 0.00187 seconds time elapsed ( +- 2.73% ) ``` Thanks -- Milian Wolff mail@milianw.de http://milianw.de [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 21:08 ` Milian Wolff @ 2022-04-08 21:34 ` Aaron Merey 2022-04-08 21:56 ` Milian Wolff 0 siblings, 1 reply; 11+ messages in thread From: Aaron Merey @ 2022-04-08 21:34 UTC (permalink / raw) To: Milian Wolff; +Cc: Mark Wielaard, elfutils-devel Hi Milian, On Fri, Apr 8, 2022 at 5:08 PM Milian Wolff <mail@milianw.de> wrote: > I can reproduce it now suddenly with debuginfod-find too: > > ``` > $ > debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > server urls "https://debuginfod.archlinux.org/" > checking build-id > checking cache dir /home/milian/.cache/debuginfod_client > using timeout 90 > init server 0 https://debuginfod.archlinux.org/buildid > url 0 https://debuginfod.archlinux.org/buildid/ > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo > query 1 urls in parallel > server response HTTP response code said error > url 0 The requested URL returned error: 404 > not found No such file or directory (err=-2) > Server query failed: No such file or directory > ``` > > I do see an empty `/home/milian/.cache/debuginfod_client/ > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server is > still queried (i.e. rerunning the above command always produces the same > output for me). The lookup is costly too at ~70ms overall: I am not able to reproduce this on Fedora 35 using commit 8db849976f070. The first time I run debuginfod-find I get similar output to you: ``` debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc server urls "https://debuginfod.archlinux.org/" checking build-id checking cache dir /home/amerey/.cache/debuginfod_client using timeout 90 init server 0 https://debuginfod.archlinux.org/buildid url 0 https://debuginfod.archlinux.org/buildid/85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo query 1 urls in parallel server response HTTP response code said error url 0 The requested URL returned error: 404 not found No such file or directory (err=-2) Server query failed: No such file or directory ``` But when I rerun debuginfod-find I get output indicating that the cache was checked: ``` debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc server urls "https://debuginfod.archlinux.org/" checking build-id checking cache dir /home/amerey/.cache/debuginfod_client not found No such file or directory (err=-2) Server query failed: No such file or directory ``` No calls to curl_multi_wait occur. Is there a "cache_miss_s" file in the top level of your debuginfod cache? It should contain the number of seconds that must elapse after a failed attempt before another query would be attempted (default is 600 seconds). Aaron ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 21:34 ` Aaron Merey @ 2022-04-08 21:56 ` Milian Wolff 2022-04-08 22:21 ` Mark Wielaard 2022-04-08 22:23 ` Milian Wolff 0 siblings, 2 replies; 11+ messages in thread From: Milian Wolff @ 2022-04-08 21:56 UTC (permalink / raw) To: Aaron Merey; +Cc: Mark Wielaard, elfutils-devel [-- Attachment #1: Type: text/plain, Size: 6088 bytes --] On Freitag, 8. April 2022 23:34:06 CEST Aaron Merey wrote: > Hi Milian, > > On Fri, Apr 8, 2022 at 5:08 PM Milian Wolff <mail@milianw.de> wrote: > > I can reproduce it now suddenly with debuginfod-find too: > > > > ``` > > $ > > debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > > debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > > server urls "https://debuginfod.archlinux.org/" > > checking build-id > > checking cache dir /home/milian/.cache/debuginfod_client > > using timeout 90 > > init server 0 https://debuginfod.archlinux.org/buildid > > url 0 https://debuginfod.archlinux.org/buildid/ > > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo > > query 1 urls in parallel > > server response HTTP response code said error > > url 0 The requested URL returned error: 404 > > not found No such file or directory (err=-2) > > Server query failed: No such file or directory > > ``` > > > > I do see an empty `/home/milian/.cache/debuginfod_client/ > > 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo` file. But the server > > is > > still queried (i.e. rerunning the above command always produces the same > > > output for me). The lookup is costly too at ~70ms overall: > I am not able to reproduce this on Fedora 35 using commit 8db849976f070. > > The first time I run debuginfod-find I get similar output to you: > ``` > debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > server urls "https://debuginfod.archlinux.org/" > checking build-id > checking cache dir /home/amerey/.cache/debuginfod_client > using timeout 90 > init server 0 https://debuginfod.archlinux.org/buildid > url 0 > https://debuginfod.archlinux.org/buildid/85766e9d8458b16e9c7ce6e07c712c02b8 > 471dbc/debuginfo query 1 urls in parallel > server response HTTP response code said error > url 0 The requested URL returned error: 404 > not found No such file or directory (err=-2) > Server query failed: No such file or directory > ``` > > But when I rerun debuginfod-find I get output indicating that the cache was > checked: > ``` > debuginfod_find_debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc > server urls "https://debuginfod.archlinux.org/" > checking build-id > checking cache dir /home/amerey/.cache/debuginfod_client > not found No such file or directory (err=-2) > Server query failed: No such file or directory > ``` > > No calls to curl_multi_wait occur. Is there a "cache_miss_s" file in the top > level of your debuginfod cache? It should contain the number of seconds > that must elapse after a failed attempt before another query would be > attempted (default is 600 seconds). Yes that file exists and it contains 600. Running through strace, I see that it actually removes the empty file first. This looks totally bogus: ``` strace -e file -f debuginfod-find debuginfo 85766e9d8458b16e9c7ce6e07c712c02b8471dbc |& grep "cache/debuginfod_client" checking cache dir /home/milian/.cache/debuginfod_client newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client", {st_mode=S_IFDIR|0755, st_size=12288, ...}, 0) = 0 newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ cache_clean_interval_s", {st_mode=S_IFREG|0644, st_size=5, ...}, 0) = 0 newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ cache_clean_interval_s", {st_mode=S_IFREG|0644, st_size=5, ...}, 0) = 0 openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ cache_clean_interval_s", O_RDONLY) = 5 newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", {st_mode=S_IFREG|000, st_size=0, ...}, 0) = 0 newfstatat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/cache_miss_s", {st_mode=S_IFREG|0644, st_size=3, ...}, 0) = 0 openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/cache_miss_s", O_RDONLY) = 5 unlink("/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo") = 0 openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", O_RDONLY) = -1 ENOENT (No such file or directory) mkdir("/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc", 0700) = -1 EEXIST (File exists) openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo.KdSnXl", O_RDWR|O_CREAT| O_EXCL, 0600) = 5 openat(AT_FDCWD, "/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo", O_RDONLY|O_CREAT|O_EXCL, 000) = 7 unlink("/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo.KdSnXl") = 0 rmdir("/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc") = -1 ENOTEMPTY (Directory not empty) ``` Note how there's an unlink first followed by an openat of the file that just got removed? Then a mkdir for a folder that still exists. Then a tmpfile for the download. Then follows the creation of the empty file. And then it tries to rmdir the whole folder, which fails, but that still sounds wrong too, no? Backtraces for the unlink: ``` unlink("/home/milian/.cache/debuginfod_client/ 85766e9d8458b16e9c7ce6e07c712c02b8471dbc/debuginfo") = 0 > /usr/lib/libc.so.6(unlink+0xb) [0x103aab] > /usr/lib/libdebuginfod-0.186.so(debuginfod_add_http_header+0x1821) [0x3de1] > /usr/bin/debuginfod-find() [0x122f] > /usr/lib/libc.so.6(__libc_start_call_main+0x80) [0x2d310] > /usr/lib/libc.so.6(__libc_start_main+0x81) [0x2d3c1] > /usr/bin/debuginfod-find() [0x1495] ``` That is: ``` eu-addr2line -fe /usr/lib/libdebuginfod-0.186.so -a 0x3de1 0x0000000000003de1 debuginfod_query_server /usr/src/debug/elfutils-0.186/debuginfod/debuginfod-client.c:791:9 ``` Which in turn points at the code that does cache cleanup in `debuginfod_query_server`. I now used `rr` to record such a bogus run and I clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it goes into the unlink case. I'll try to debug this further now - I definitely do not wait 600s inbetween these runs here... -- Milian Wolff mail@milianw.de http://milianw.de [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 21:56 ` Milian Wolff @ 2022-04-08 22:21 ` Mark Wielaard 2022-04-08 22:23 ` Milian Wolff 1 sibling, 0 replies; 11+ messages in thread From: Mark Wielaard @ 2022-04-08 22:21 UTC (permalink / raw) To: Milian Wolff; +Cc: Aaron Merey, elfutils-devel Hi Milian, On Fri, Apr 08, 2022 at 11:56:15PM +0200, Milian Wolff wrote: > Which in turn points at the code that does cache cleanup in > `debuginfod_query_server`. I now used `rr` to record such a bogus run and I > clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it > goes into the unlink case. > > I'll try to debug this further now - I definitely do not wait 600s inbetween > these runs here... I can replicate this on an debian system with /home on ext4. st_mtime doesn't seem to change ever. When the file is created the st_mtime seems to be in the past. So time(NULL) - st.st_mtime <= cache_miss is always false. I am not sure why yet. Cheers, Mark ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 21:56 ` Milian Wolff 2022-04-08 22:21 ` Mark Wielaard @ 2022-04-08 22:23 ` Milian Wolff 2022-04-08 22:40 ` Mark Wielaard 1 sibling, 1 reply; 11+ messages in thread From: Milian Wolff @ 2022-04-08 22:23 UTC (permalink / raw) To: Aaron Merey; +Cc: Mark Wielaard, elfutils-devel [-- Attachment #1: Type: text/plain, Size: 1615 bytes --] On Freitag, 8. April 2022 23:56:15 CEST Milian Wolff wrote: <snip> > Which in turn points at the code that does cache cleanup in > `debuginfod_query_server`. I now used `rr` to record such a bogus run and I > clearly see that `(time(NULL) - st.st_mtime <= cache_miss)` is false and it > goes into the unlink case. > > I'll try to debug this further now - I definitely do not wait 600s inbetween > these runs here... I compiled elfutils with debug output, and here's what I can see when I run `debuginfod-find`: time(NULL) = 1649455510 st.st_mtime = 1649448650 delta = 6860 cache_miss = 600 The longer I wait, the bigger the delta becomes - i.e. for every second I wait, the `delta` increases by one. And I think I know where this comes from: ``` # first we stat the target cache path if (stat(target_cache_path, &st) == 0 { # then we pass _the same st_ to debuginfod_config_cache(cache_miss_path, cache_miss_default_s, &st) # which internally will do stat(config_path, st) # then we check the time delta time(NULL) - st.st_mtime <= cache_miss ``` I.e. when we check the time delta, we only take the time stamp of the `config_path` into account - the time stamp of the `target_cache_path` is ignored. I mount my filesystems with relatime (old advise for ssd's, probably not relevant anymore?). I guess that's the issue then? Can we change the above code to store the `st.st_mtime` for `target_cache_path` and use that for comparison purposes? That fixes the issue for my case. If this is acceptable, I'll provide a patch. Thanks -- Milian Wolff mail@milianw.de http://milianw.de [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 22:23 ` Milian Wolff @ 2022-04-08 22:40 ` Mark Wielaard 2022-04-08 22:54 ` Aaron Merey 0 siblings, 1 reply; 11+ messages in thread From: Mark Wielaard @ 2022-04-08 22:40 UTC (permalink / raw) To: Milian Wolff; +Cc: Aaron Merey, elfutils-devel Hi Milian, On Sat, Apr 09, 2022 at 12:23:32AM +0200, Milian Wolff wrote: > ``` > # first we stat the target cache path > if (stat(target_cache_path, &st) == 0 > { > > # then we pass _the same st_ to > debuginfod_config_cache(cache_miss_path, cache_miss_default_s, &st) > > # which internally will do > stat(config_path, st) > > # then we check the time delta > time(NULL) - st.st_mtime <= cache_miss > ``` > > I.e. when we check the time delta, we only take the time stamp of the > `config_path` into account - the time stamp of the `target_cache_path` is > ignored. O, wow. Yes, reusing the struct st is bad. > I mount my filesystems with relatime (old advise for ssd's, probably not > relevant anymore?). I guess that's the issue then? No, I think this is unrelated. > Can we change the above code to store the `st.st_mtime` for > `target_cache_path` and use that for comparison purposes? That fixes the issue > for my case. If this is acceptable, I'll provide a patch. Yes please. Thanks so much for finding this. I think we as developers keep clearing the cache to test stuff. But that means we recreate the cache_miss file every time, so that gets a new mtime. And if you are just testing for 10 minutes everything looks fine... Cheers, Mark ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: caching failed lookups of debuginfo? 2022-04-08 22:40 ` Mark Wielaard @ 2022-04-08 22:54 ` Aaron Merey 0 siblings, 0 replies; 11+ messages in thread From: Aaron Merey @ 2022-04-08 22:54 UTC (permalink / raw) To: Mark Wielaard; +Cc: Milian Wolff, elfutils-devel Thanks for spotting this Milian. On Fri, Apr 8, 2022 at 6:40 PM Mark Wielaard <mark@klomp.org> wrote: > I think we as developers keep clearing the cache to test stuff. But > that means we recreate the cache_miss file every time, so that gets a > new mtime. And if you are just testing for 10 minutes everything looks > fine... This explains why I couldn't reproduce it at first, 10 minutes hadn't elapsed. I tried it again sometime later and noticed that every debuginfod-find had started resulting in a query, but Milian got to the bottom of it first. Aaron ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-04-08 22:54 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-08 19:58 caching failed lookups of debuginfo? Milian Wolff 2022-04-08 20:05 ` Frank Ch. Eigler 2022-04-08 20:45 ` Milian Wolff 2022-04-08 20:59 ` Mark Wielaard 2022-04-08 21:08 ` Milian Wolff 2022-04-08 21:34 ` Aaron Merey 2022-04-08 21:56 ` Milian Wolff 2022-04-08 22:21 ` Mark Wielaard 2022-04-08 22:23 ` Milian Wolff 2022-04-08 22:40 ` Mark Wielaard 2022-04-08 22:54 ` Aaron Merey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).