From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark@klomp.org>
Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112])
 by sourceware.org (Postfix) with ESMTPS id 1D98B398306A
 for <elfutils-devel@sourceware.org>; Tue, 20 Jul 2021 14:42:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1D98B398306A
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=klomp.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org
Received: from reform (77-167-121-15.hybrid.kpn.net [77.167.121.15])
 (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by gnu.wildebeest.org (Postfix) with ESMTPSA id 443A9300027E;
 Tue, 20 Jul 2021 16:42:27 +0200 (CEST)
Received: by reform (Postfix, from userid 1000)
 id 5F4452E804E0; Tue, 20 Jul 2021 16:42:26 +0200 (CEST)
Date: Tue, 20 Jul 2021 16:42:26 +0200
From: Mark Wielaard <mark@klomp.org>
To: Noah Sanci <nsanci@redhat.com>
Cc: elfutils-devel@sourceware.org
Subject: Re: [Bug debuginfod/27983] ignore duplicate urls
Message-ID: <YPbg0sq/rFGoNqrJ@wildebeest.org>
References: <CAJXA7qi9icBD-fiz9M9s1VsOethdAzDy9Yg7H4Nr-CFMeVFKsQ@mail.gmail.com>
 <YO8Sffae5F9TtdLm@wildebeest.org>
 <CAJXA7qiymkVMv7Pe+7J9r9ZFU5mi_5M=sjbhX1W4xon5eRbeEA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJXA7qiymkVMv7Pe+7J9r9ZFU5mi_5M=sjbhX1W4xon5eRbeEA@mail.gmail.com>
X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_BADIPHTTP, KAM_DMARC_STATUS, KAM_SHORT, NUMERIC_HTTP_ADDR, SPF_HELO_NONE,
 SPF_PASS, TXREP, WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: elfutils-devel@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Elfutils-devel mailing list <elfutils-devel.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/elfutils-devel>,
 <mailto:elfutils-devel-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/elfutils-devel/>
List-Help: <mailto:elfutils-devel-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/elfutils-devel>,
 <mailto:elfutils-devel-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Jul 2021 14:42:31 -0000

Hi Noah,

On Mon, Jul 19, 2021 at 09:31:17AM -0400, Noah Sanci wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Mark Wielaard <mark@klomp.org> wrote:
> > You deduplicate the full URLs after they are fully constructed.  Would
> > it make sense to do the deduplication on server_url, maybe even as
> > part of the Count number of URLs code? That might make the code
> > simpler. And you can change num_urls upfront.
> Deduplication before fully building the URL would work well, however I
> was concerned about the slashbuildid situation. I would need to alter
> all urls to either have a trailing '/' or no trailing '/' to ensure
> comparison between 'http://127.0.0.1:8000/' and 'http://127.0.0.1:8000'
> is considered equal. This is possible, but I ultimately decided to
> wait until full construction as those issues would have been handled.
> I would be glad to make the change if you want.

I see what you mean, we have:

      /* Build handle url. Tolerate both  http://foo:999  and
         http://foo:999/  forms */
      char *slashbuildid;
      if (strlen(server_url) > 1 && server_url[strlen(server_url)-1] == '/')
        slashbuildid = "buildid";
      else
        slashbuildid = "/buildid";

I think the code could become simpler if we did the deduplication of
the server_url list up-front. That also gives you the oppertunity to
make sure all server_urls end with a slash. But if you think that is
too much work for this patch we can do it as a followup patch. You
already included a test, which makes checking such a refactoring
easier (sadly the run-debuginfod-find.sh test is somewhat fragile).

> From be4e07a8732dd688595b99f92ba275ef5060a34d Mon Sep 17 00:00:00 2001
> From: Noah Sanci <nsanci@redhat.com>
> Date: Fri, 9 Jul 2021 14:53:10 -0400
> Subject: [PATCH] debuginfod: PR27983 - ignore duplicate urls
> 
> Gazing at server logs, one sees a minority of clients who appear to have
> duplicate query traffic coming in: the same URL, milliseconds apart.
> Chances are the user accidentally doubled her $DEBUGINFOD_URLS somehow,
> and the client library is dutifully asking the servers TWICE. Bug #27863
> reduces the pain on the servers' CPU, but dupe network traffic is still
> being paid.  We should reject sending outright duplicate concurrent
> traffic.
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=27983


> diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
> index f417b40a..a9447cae 100644
> --- a/debuginfod/debuginfod-client.c
> +++ b/debuginfod/debuginfod-client.c
> @@ -795,6 +795,7 @@ debuginfod_query_server (debuginfod_client *c,
>
>    char *strtok_saveptr;
>    char *server_url = strtok_r(server_urls, url_delim, &strtok_saveptr);
> +  int unduplicated_urls = 0;
> 
>    /* Initialize each handle.  */
>    for (int i = 0; i < num_urls && server_url != NULL; i++)

Maybe this should be:

/* Initialize each handle.  */
   int i = 0;
   while (server_url != NULL)

With an i++ at the end after the data[i] handle has been completely assigned.
See below (*) for why.

> +      char *tmp_url = calloc(PATH_MAX, sizeof(char));

This can fail. So you'll have to handle tmp_url == NULL. Otherwise
snprintf will crash.

>        if (filename) /* must start with / */
> -        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s%s", server_url,
> +        snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s%s", server_url,
>                   slashbuildid, build_id_bytes, type, filename);
>        else
> -        snprintf(data[i].url, PATH_MAX, "%s%s/%s/%s", server_url,
> +        snprintf(tmp_url, PATH_MAX, "%s%s/%s/%s", server_url,
>                   slashbuildid, build_id_bytes, type);
>  
> +      /* PR 27983: If the url is already set to be used use, skip it */
> +      int url_index = -1;
> +      for (url_index = 0; url_index < i; ++url_index)

No need to initialize url_index twice. Just declar int url_index; it
will be assigned 0 at the next line.

> +        {
> +          if(strcmp(tmp_url, data[url_index].url) == 0)
> +            {
> +              url_index = -1;
> +              break;
> +            }
> +        }
> +      if (url_index == -1)
> +        {
> +          if (vfd >= 0)
> +            dprintf(vfd, "duplicate url: %s, skipping\n", tmp_url);
> +          free(tmp_url);
> +          // Ensure that next iteration doesn't skip over an index mid-array
> +          i--;
> +          server_url = strtok_r(NULL, url_delim, &strtok_saveptr);
> +          continue;

(*) this i--; confused me a little. Which is why I suggest turning
that for loop into a while loop.

> +        }
> +      else
> +        {
> +          unduplicated_urls++;
> +          strncpy(data[i].url, tmp_url, PATH_MAX);

Both strings are max PATH_MAX chars, and tmp_url is zero terminated,
so you can use strcpy.

> +          rc = -ENETUNREACH;
> +           goto out1;
> +        }
> +      data[i].client = c;
>        curl_easy_setopt(data[i].handle, CURLOPT_URL, data[i].url);
>        if (vfd >= 0)
>  	curl_easy_setopt(data[i].handle, CURLOPT_ERRORBUFFER, data[i].errbuf);
> @@ -863,6 +889,7 @@ debuginfod_query_server (debuginfod_client *c,
>        curl_multi_add_handle(curlm, data[i].handle);
>        server_url = strtok_r(NULL, url_delim, &strtok_saveptr);

(*) So here a i++ (so you don't need the i-- above).

>      }
> +  num_urls = unduplicated_urls;
>  
>    /* Query servers in parallel.  */
>    if (vfd >= 0)

Cheers,

Mark