Re: [Bug debuginfod/27983] ignore duplicate urls

public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed

From: Noah Sanci <nsanci@redhat.com>
To: elfutils-devel@sourceware.org
Subject: Re: [Bug debuginfod/27983] ignore duplicate urls
Date: Thu, 22 Jul 2021 12:25:04 -0400	[thread overview]
Message-ID: <CAJXA7qg9zUbk0SF40Ct_WuLZ7tFVe5Jr1r_rcDqnDMPirgzeQw@mail.gmail.com> (raw)
In-Reply-To: <YPbg0sq/rFGoNqrJ@wildebeest.org>

[-- Attachment #1: Type: text/plain, Size: 5814 bytes --]

Hello

Please find the updated solution of PR27983 attached.

Noah


Noah

On Tue, Jul 20, 2021 at 10:42 AM Mark Wielaard <mark@klomp.org> wrote:
>
> Hi Noah,
>
> On Mon, Jul 19, 2021 at 09:31:17AM -0400, Noah Sanci wrote:
> > On Wed, Jul 14, 2021 at 12:36 PM Mark Wielaard <mark@klomp.org> wrote:
> > > You deduplicate the full URLs after they are fully constructed.  Would
> > > it make sense to do the deduplication on server_url, maybe even as
> > > part of the Count number of URLs code? That might make the code
> > > simpler. And you can change num_urls upfront.
> > Deduplication before fully building the URL would work well, however I
> > was concerned about the slashbuildid situation. I would need to alter
> > all urls to either have a trailing '/' or no trailing '/' to ensure
> > comparison between 'http://127.0.0.1:8000/' and 'http://127.0.0.1:8000'
> > is considered equal. This is possible, but I ultimately decided to
> > wait until full construction as those issues would have been handled.
> > I would be glad to make the change if you want.
>
> I see what you mean, we have:
>
>       /* Build handle url. Tolerate both  http://foo:999  and
>          http://foo:999/  forms */
>       char *slashbuildid;
>       if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/')
>         slashbuildid = "buildid";
>       else
>         slashbuildid = "/buildid";
>
> I think the code could become simpler if we did the deduplication of
> the server_url list up-front. That also gives you the oppertunity to
> make sure all server_urls end with a slash. But if you think that is
> too much work for this patch we can do it as a followup patch. You
> already included a test, which makes checking such a refactoring
> easier (sadly the run-debuginfod-find.sh test is somewhat fragile).
>
> > From be4e07a8732dd688595b99f92ba275ef5060a34d Mon Sep 17 00:00:00 2001
> > From: Noah Sanci <nsanci@redhat.com>
> > Date: Fri, 9 Jul 2021 14:53:10 -0400
> > Subject: [PATCH] debuginfod: PR27983 - ignore duplicate urls
> >
> > Gazing at server logs, one sees a minority of clients who appear to have
> > duplicate query traffic coming in: the same URL, milliseconds apart.
> > Chances are the user accidentally doubled her $DEBUGINFOD_URLS somehow,
> > and the client library is dutifully asking the servers TWICE. Bug #27863
> > reduces the pain on the servers' CPU, but dupe network traffic is still
> > being paid.  We should reject sending outright duplicate concurrent
> > traffic.
> >
> > https://sourceware.org/bugzilla/show_bug.cgi?id=27983
>
>
> > diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
> > index f417b40a..a9447cae 100644
> > --- a/debuginfod/debuginfod-client.c
> > +++ b/debuginfod/debuginfod-client.c
> > @@ -795,6 +795,7 @@ debuginfod_query_server (debuginfod_client *c,
> >
> >    char *strtok_saveptr;
> >    char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr);
> > +  int unduplicated_urls = 0;
> >
> >    /* Initialize each handle.  */
> >    for (int i = 0; i < num_urls && server_url != NULL; i++)
>
> Maybe this should be:
>
> /* Initialize each handle.  */
>    int i = 0;
>    while (server_url != NULL)
>
> With an i++ at the end after the data[i] handle has been completely assigned.
> See below (*) for why.
>
> > +      char *tmp_url = calloc(PATH_MAX, sizeof(char));
>
> This can fail. So you'll have to handle tmp_url == NULL. Otherwise
> snprintf will crash.
>
> >        if (filename) /* must start with / */
> > -        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> > +        snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> >                   slashbuildid, build_id_bytes, type, filename);
> >        else
> > -        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url,
> > +        snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s", server_url,
> >                   slashbuildid, build_id_bytes, type);
> >
> > +      /* PR 27983: If the url is already set to be used use, skip it */
> > +      int url_index = -1;
> > +      for (url_index = 0; url_index < i; ++url_index)
>
> No need to initialize url_index twice. Just declar int url_index; it
> will be assigned 0 at the next line.
>
> > +        {
> > +          if(strcmp(tmp_url, data[url_index].url) == 0)
> > +            {
> > +              url_index = -1;
> > +              break;
> > +            }
> > +        }
> > +      if (url_index == -1)
> > +        {
> > +          if (vfd >= 0)
> > +            dprintf(vfd, "duplicate url: %s, skipping\n", tmp_url);
> > +          free(tmp_url);
> > +          // Ensure that next iteration doesn't skip over an index mid-array
> > +          i--;
> > +          server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
> > +          continue;
>
> (*) this i--; confused me a little. Which is why I suggest turning
> that for loop into a while loop.
>
> > +        }
> > +      else
> > +        {
> > +          unduplicated_urls++;
> > +          strncpy(data[i].url, tmp_url, PATH_MAX);
>
> Both strings are max PATH_MAX chars, and tmp_url is zero terminated,
> so you can use strcpy.
>
> > +          rc = -ENETUNREACH;
> > +           goto out1;
> > +        }
> > +      data[i].client = c;
> >        curl_easy_setopt(data[i].handle, CURLOPT_URL, data[i].url);
> >        if (vfd >= 0)
> >       curl_easy_setopt(data[i].handle, CURLOPT_ERRORBUFFER, data[i].errbuf);
> > @@ -863,6 +889,7 @@ debuginfod_query_server (debuginfod_client *c,
> >        curl_multi_add_handle(curlm, data[i].handle);
> >        server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
>
> (*) So here a i++ (so you don't need the i-- above).
>
> >      }
> > +  num_urls = unduplicated_urls;
> >
> >    /* Query servers in parallel.  */
> >    if (vfd >= 0)
>
> Cheers,
>
> Mark
>

[-- Attachment #2: 0001-debuginfod-PR27983-ignore-duplicate-urls.patch --]
[-- Type: text/x-patch, Size: 8906 bytes --]

From 84118c56954deef811e94d3a188ea9894430f92d Mon Sep 17 00:00:00 2001
From: Noah Sanci <nsanci@redhat.com>
Date: Fri, 9 Jul 2021 14:53:10 -0400
Subject: [PATCH] debuginfod: PR27983 - ignore duplicate urls

Gazing at server logs, one sees a minority of clients who appear to have
duplicate query traffic coming in: the same URL, milliseconds apart.
Chances are the user accidentally doubled her $DEBUGINFOD_URLS somehow,
and the client library is dutifully asking the servers TWICE. Bug #27863
reduces the pain on the servers' CPU, but dupe network traffic is still
being paid.  We should reject sending outright duplicate concurrent
traffic.
The urls are now simply removed upon finding a duplicate after url
construction.

https://sourceware.org/bugzilla/show_bug.cgi?id=27983

Signed-off-by: Noah Sanci <nsanci@redhat.com>
---
 debuginfod/ChangeLog           |  8 +++
 debuginfod/debuginfod-client.c | 97 ++++++++++++++++++++++++----------
 tests/ChangeLog                |  5 ++
 tests/run-debuginfod-find.sh   | 13 +++++
 4 files changed, 95 insertions(+), 28 deletions(-)

diff --git a/debuginfod/ChangeLog b/debuginfod/ChangeLog
index d9d11737..23544d3f 100644
--- a/debuginfod/ChangeLog
+++ b/debuginfod/ChangeLog
@@ -1,3 +1,11 @@
+2021-07-09  Noah Sanci  <nsanci@redhat.com>
+
+	* debuginfod-client.c (debuginfod_query_server): As full-length
+	urls are generated with standardized formats, ignore duplicates.
+	Created out1 and changed out2 error gotos. Updated url creation print
+	statements.
+	(globals): Removed url_delim_char, as it was no longer used.
+
 2021-06-18  Mark Wielaard  <mark@klomp.org>
 
 	* debuginfod-client.c (debuginfod_begin): Don't use client if
diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
index f417b40a..1a687831 100644
--- a/debuginfod/debuginfod-client.c
+++ b/debuginfod/debuginfod-client.c
@@ -152,7 +152,6 @@ static const char *cache_xdg_name = "debuginfod_client";
 
 /* URLs of debuginfods, separated by url_delim. */
 static const char *url_delim =  " ";
-static const char url_delim_char = ' ';
 
 /* Timeout for debuginfods, in seconds (to get at least 100K). */
 static const long default_timeout = 90;
@@ -763,13 +762,61 @@ debuginfod_query_server (debuginfod_client *c,
       goto out0;
     }
 
+
+  /* Initialize the memory to zero */
+  char *strtok_saveptr;
+  char **server_url_list = NULL;
+  char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr);
   /* Count number of URLs.  */
   int num_urls = 0;
-  for (int i = 0; server_urls[i] != '\0'; i++)
-    if (server_urls[i] != url_delim_char
-        && (i == 0 || server_urls[i - 1] == url_delim_char))
-      num_urls++;
-  
+
+  while (server_url != NULL)
+    {
+      /* PR 27983: If the url is already set to be used use, skip it */
+      char *slashbuildid;
+      if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/')
+        slashbuildid = "buildid";
+      else
+        slashbuildid = "/buildid";
+
+      char *tmp_url = calloc(PATH_MAX, sizeof(char));
+      snprintf(tmp_url, PATH_MAX, "%s%s", server_url,
+               slashbuildid);
+      int url_index;
+      for (url_index = 0; url_index < num_urls; ++url_index)
+        {
+          if(strcmp(tmp_url, server_url_list[url_index]) == 0)
+            {
+              url_index = -1;
+              break;
+            }
+        }
+      if (url_index == -1)
+        {
+          if (vfd >= 0)
+            dprintf(vfd, "duplicate url: %s, skipping\n", tmp_url);
+          free(tmp_url);
+        }
+      else
+        {
+          num_urls++;
+          server_url_list = reallocarray(server_url_list, num_urls,
+                                         sizeof(char*));
+          if (server_url_list == NULL)
+            {
+              rc = -ENOMEM;
+              goto out1;
+            }
+          server_url_list[num_urls-1] = strdup(tmp_url);
+          if (server_url_list[num_urls-1] == NULL)
+            {
+              rc = -ENOMEM;
+              goto out1;
+            }
+          free (tmp_url);
+        }
+      server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
+    }
   CURLM *curlm = c->server_mhandle;
   assert (curlm != NULL);
   
@@ -793,12 +840,11 @@ debuginfod_query_server (debuginfod_client *c,
       data[i].errbuf[0] = '\0';
     }
 
-  char *strtok_saveptr;
-  char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr);
-
   /* Initialize each handle.  */
-  for (int i = 0; i < num_urls && server_url != NULL; i++)
+  for (int i = 0; i < num_urls; i++)
     {
+      if ((server_url = server_url_list[i]) == NULL)
+        break;
       if (vfd >= 0)
 	dprintf (vfd, "init server %d %s\n", i, server_url);
 
@@ -808,25 +854,16 @@ debuginfod_query_server (debuginfod_client *c,
       if (data[i].handle == NULL)
         {
           rc = -ENETUNREACH;
-          goto out1;
+          goto out2;
         }
       data[i].client = c;
 
       /* Build handle url. Tolerate both  http://foo:999  and
          http://foo:999/  forms */
-      char *slashbuildid;
-      if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/')
-        slashbuildid = "buildid";
-      else
-        slashbuildid = "/buildid";
-
-      if (filename) /* must start with / */
-        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
-                 slashbuildid, build_id_bytes, type, filename);
+      if (filename)
+        snprintf(data[i].url, PATH_MAX, "%s/%s/%s%s", server_url, build_id_bytes, type, filename);
       else
-        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url,
-                 slashbuildid, build_id_bytes, type);
-
+        snprintf(data[i].url, PATH_MAX, "%s/%s/%s", server_url, build_id_bytes, type);
       if (vfd >= 0)
 	dprintf (vfd, "url %d %s\n", i, data[i].url);
 
@@ -861,7 +898,6 @@ debuginfod_query_server (debuginfod_client *c,
       curl_easy_setopt(data[i].handle, CURLOPT_HTTPHEADER, c->headers);
 
       curl_multi_add_handle(curlm, data[i].handle);
-      server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
     }
 
   /* Query servers in parallel.  */
@@ -905,7 +941,7 @@ debuginfod_query_server (debuginfod_client *c,
             case CURLM_OUT_OF_MEMORY: rc = -ENOMEM; break;
             default: rc = -ENETUNREACH; break;
             }
-          goto out1;
+          goto out2;
         }
 
       if (c->progressfn) /* inform/check progress callback */
@@ -1075,7 +1111,7 @@ debuginfod_query_server (debuginfod_client *c,
     }
 
   if (verified_handle == NULL)
-    goto out1;
+    goto out2;
 
   if (vfd >= 0)
     {
@@ -1104,7 +1140,7 @@ debuginfod_query_server (debuginfod_client *c,
   if (rc < 0)
     {
       rc = -errno;
-      goto out1;
+      goto out2;
       /* Perhaps we need not give up right away; could retry or something ... */
     }
 
@@ -1127,7 +1163,7 @@ debuginfod_query_server (debuginfod_client *c,
   goto out;
 
 /* error exits */
- out1:
+ out2:
   /* remove all handles from multi */
   for (int i = 0; i < num_urls; i++)
     {
@@ -1140,6 +1176,11 @@ debuginfod_query_server (debuginfod_client *c,
   (void) rmdir (target_cache_dir); /* nop if not empty */
   free(data);
 
+ out1:
+  for (int i = 0; i < num_urls; ++i)
+    free(server_url_list[i]);
+  free(server_url_list);
+
  out0:
   free (server_urls);
 
diff --git a/tests/ChangeLog b/tests/ChangeLog
index b65cbeb7..5747d658 100644
--- a/tests/ChangeLog
+++ b/tests/ChangeLog
@@ -1,3 +1,8 @@
+2021-07-09  Noah Sanci  <nsanci@redhat.com>
+
+	* run-debuginfod-find.sh: Wrote test to ensure duplicate urls are in
+	fact not checked.
+
 2021-07-02  Mark Wielaard  <mark@klomp.org>
 
        * run-debuginfo-find.sh: unset VALGRIND_CMD before testing debuginfod
diff --git a/tests/run-debuginfod-find.sh b/tests/run-debuginfod-find.sh
index 74a5ceff..6faaf70b 100755
--- a/tests/run-debuginfod-find.sh
+++ b/tests/run-debuginfod-find.sh
@@ -359,6 +359,19 @@ rm -rf extracted
 
 wait_ready $PORT1 'found_sourcerefs_total{source=".rpm archive"}' $sourcefiles
 
+########################################################################
+# PR27983 ensure no duplicate urls are used in when querying servers for files
+TMP_DEBUG_URLS=$DEBUGINFOD_URLS
+rm -rf $DEBUGINFOD_CACHE_PATH # clean it from previous tests
+export DEBUGINFOD_URLS="http://127.0.0.1:$PORT1 http://127.0.0.1:$PORT1 http://127.0.0.1:$PORT1 http:127.0.0.1:7999"
+env LD_LIBRARY_PATH=$ldpath ${abs_top_builddir}/debuginfod/debuginfod-find -v executable $BUILDID2 > vlog4 2>&1 || true
+tempfiles vlog4
+if [ $( grep -c 'duplicate url: http://127.0.0.1:'$PORT1'.*' vlog4 ) -ne 2 ]; then
+  echo "Duplicated servers undetected";
+  err
+fi
+export DEBUGINFOD_URLS=$TMP_DEBUG_URLS
+########################################################################
 # Run a bank of queries against the debuginfod-rpms / debuginfod-debs test cases
 
 archive_test() {
-- 
2.31.1

next prev parent reply	other threads:[~2021-07-22 16:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-09 19:12 Noah Sanci
2021-07-14 16:36 ` Mark Wielaard
2021-07-19 13:31   ` Noah Sanci
2021-07-20 14:42     ` Mark Wielaard
2021-07-22 16:25       ` Noah Sanci [this message]
2021-07-22 19:22         ` Noah Sanci
2021-07-23 18:34           ` Mark Wielaard
2021-07-26 16:29             ` Noah Sanci
2021-07-28 14:55               ` Mark Wielaard
2021-07-28 16:23                 ` Noah Sanci
2021-07-29 12:20                   ` Mark Wielaard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJXA7qg9zUbk0SF40Ct_WuLZ7tFVe5Jr1r_rcDqnDMPirgzeQw@mail.gmail.com \
    --to=nsanci@redhat.com \
    --cc=elfutils-devel@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).