From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <elfutils-devel-return-8012-listarch-elfutils-devel=sourceware.org@sourceware.org>
Received: (qmail 91147 invoked by alias); 6 Dec 2019 23:32:35 -0000
Mailing-List: contact elfutils-devel-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <elfutils-devel.sourceware.org>
List-Post: <mailto:elfutils-devel@sourceware.org>
List-Help: <mailto:elfutils-devel-help@sourceware.org>
List-Subscribe: <mailto:elfutils-devel-subscribe@sourceware.org>
Sender: elfutils-devel-owner@sourceware.org
Received: (qmail 91128 invoked by uid 89); 6 Dec 2019 23:32:34 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Checked: by ClamAV 0.100.3 on sourceware.org
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-19.1 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1 spammy=0.5, ratio, knew, extractor
X-Spam-Status: No, score=-19.1 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on sourceware.org
X-Spam-Level:
X-HELO: gnu.wildebeest.org
Received: from wildebeest.demon.nl (HELO gnu.wildebeest.org) (212.238.236.112) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 06 Dec 2019 23:32:29 +0000
Received: from tarox.wildebeest.org (tarox.wildebeest.org [172.31.17.39])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by gnu.wildebeest.org (Postfix) with ESMTPSA id 0422D300073F;	Sat,  7 Dec 2019 00:32:26 +0100 (CET)
Received: by tarox.wildebeest.org (Postfix, from userid 1000)	id E73AB401ED86; Sat,  7 Dec 2019 00:32:25 +0100 (CET)
Message-ID: <bc26c5729b4a96ddda6cdd4a3a7cbd76b8c7aa32.camel@klomp.org>
Subject: Re: RFCv2: debuginfod debian archive support
From: Mark Wielaard <mark@klomp.org>
To: "Frank Ch. Eigler" <fche@redhat.com>, elfutils-devel@sourceware.org
Date: Fri, 06 Dec 2019 23:32:00 -0000
In-Reply-To: <20191202225401.GF4136@redhat.com>
References: <20191202225401.GF4136@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Evolution 3.28.5 (3.28.5-5.el7)
Mime-Version: 1.0
X-Spam-Flag: NO
X-IsSubscribed: yes
X-SW-Source: 2019-q4/txt/msg00242.txt.bz2

Hi Frank,

On Mon, 2019-12-02 at 17:54 -0500, Frank Ch. Eigler wrote:
> If anyone knows of a distro or ISV who is using standard .zip or .tar
> or somesuch archive formats as their distribution mechanism, it would
> be a tiny effort more to add another option, say "-A" (any archive),
> because libarchive can recognize and unpack a large variety of
> formats.  WDYT?

I looked at some distros, but the only ones that provide consistent
debug[info] packages do so in rpm or deb format.

> Author: Frank Ch. Eigler <fche@redhat.com>
> Date:   Sat Nov 30 10:46:44 2019 -0500
>=20
>     debuginfod server: support .deb/.ddeb archives
>=20=20=20=20=20
>     Add support for scanning .deb / .ddeb files, enabled with a new
>     command line option "-U".

Too bad -D was already taken. sniff.

What is the difference between a .deb and a .ddeb file/archive?

> +2019-12-02  Frank Ch. Eigler  <fche@redhat.com>
> +
> +	* debuginfod.cxx (*_rpm_*): Rename to *_archive_* throughout.
> +	(scan_archives): New read-mostly global to identify archive
> +	file extensions and corresponding extractor commands.
> +	(parse_opt): Handle new -U flag.
> +
>  2019-11-26  Mark Wielaard  <mark@klomp.org>
>=20=20
>  	* Makefile.am (BUILD_STATIC): Add needed libraries for libdw and
> diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
> index aa7ffcf..e96a4dd 100644
> --- a/debuginfod/debuginfod.cxx
> +++ b/debuginfod/debuginfod.cxx
> @@ -106,6 +106,15 @@ using namespace std;
>  #endif
>=20=20
>=20=20
> +inline bool
> +string_endswith(const string& haystack, const string& needle)
> +{
> +  return (haystack.size() >=3D needle.size() &&
> +	  equal(haystack.end()-needle.size(), haystack.end(),
> +                needle.begin()));
> +}
> +
> +
>  // Roll this identifier for every sqlite schema incompatiblity.
>  #define BUILDIDS "buildids9"
>=20=20
> @@ -231,9 +240,9 @@ static const char DEBUGINFOD_SQLITE_DDL[] =3D
>    "create view if not exists " BUILDIDS "_stats as\n"
>    "          select 'file d/e' as label,count(*) as quantity from " BUIL=
DIDS "_f_de\n"
>    "union all select 'file s',count(*) from " BUILDIDS "_f_s\n"
> -  "union all select 'rpm d/e',count(*) from " BUILDIDS "_r_de\n"
> -  "union all select 'rpm sref',count(*) from " BUILDIDS "_r_sref\n"
> -  "union all select 'rpm sdef',count(*) from " BUILDIDS "_r_sdef\n"
> +  "union all select 'archive d/e',count(*) from " BUILDIDS "_r_de\n"
> +  "union all select 'archive sref',count(*) from " BUILDIDS "_r_sref\n"
> +  "union all select 'archive sdef',count(*) from " BUILDIDS "_r_sdef\n"
>    "union all select 'buildids',count(*) from " BUILDIDS "_buildids\n"
>    "union all select 'filenames',count(*) from " BUILDIDS "_files\n"
>    "union all select 'files scanned (#)',count(*) from " BUILDIDS "_file_=
mtime_scanned\n"
> @@ -324,6 +333,7 @@ static const struct argp_option options[] =3D
>     { NULL, 0, NULL, 0, "Scanners:", 1 },
>     { "scan-file-dir", 'F', NULL, 0, "Enable ELF/DWARF file scanning thre=
ads.", 0 },
>     { "scan-rpm-dir", 'R', NULL, 0, "Enable RPM scanning threads.", 0 },
> +   { "scan-deb-dir", 'U', NULL, 0, "Enable DEB scanning threads.", 0 },=
=20=20=20
>     // "source-oci-imageregistry"  ...=20
>=20=20
>     { NULL, 0, NULL, 0, "Options:", 2 },
> @@ -371,7 +381,7 @@ static unsigned maxigroom =3D false;
>  static unsigned concurrency =3D std::thread::hardware_concurrency() ?: 1;
>  static set<string> source_paths;
>  static bool scan_files =3D false;
> -static bool scan_rpms =3D false;
> +static map<string,string> scan_archives;
>  static vector<string> extra_ddl;
>  static regex_t file_include_regex;
>  static regex_t file_exclude_regex;
> @@ -402,7 +412,13 @@ parse_opt (int key, char *arg,
>        if (http_port > 65535) argp_failure(state, 1, EINVAL, "port number=
");
>        break;
>      case 'F': scan_files =3D true; break;
> -    case 'R': scan_rpms =3D true; break;
> +    case 'R':
> +      scan_archives[".rpm"]=3D"rpm2cpio";
> +      break;
> +    case 'U':
> +      scan_archives[".deb"]=3D"dpkg-deb --fsys-tarfile";
> +      scan_archives[".ddeb"]=3D"dpkg-deb --fsys-tarfile";
> +      break;
>      case 'L':
>        traverse_logical =3D true;
>        break;
> @@ -851,7 +867,11 @@ handle_buildid_r_match (int64_t b_mtime,
>        return 0;
>      }
>=20=20
> -  string popen_cmd =3D string("rpm2cpio " + shell_escape(b_source0));
> +  string archive_decoder =3D "/dev/null";
> +  for (auto&& arch : scan_archives)
> +    if (string_endswith(b_source0, arch.first))
> +      archive_decoder =3D arch.second;
> +  string popen_cmd =3D archive_decoder + " " + shell_escape(b_source0);
>    FILE* fp =3D popen (popen_cmd.c_str(), "r"); // "e" O_CLOEXEC?
>    if (fp =3D=3D NULL)
>      throw libc_exception (errno, string("popen ") + popen_cmd);

This seems a lot of work for dealing with non-archives. If the file
doesn't end with any known extension do we really have to try to create
a pipe, fork, execute a shell and then see that /dev/null couldn't be
executed just to throw an exception?

> @@ -863,9 +883,9 @@ handle_buildid_r_match (int64_t b_mtime,
>      throw archive_exception("cannot create archive reader");
>    defer_dtor<struct archive*,int> archive_closer (a, archive_read_free);
>=20=20
> -  rc =3D archive_read_support_format_cpio(a);
> +  rc =3D archive_read_support_format_all(a);
>    if (rc !=3D ARCHIVE_OK)
> -    throw archive_exception(a, "cannot select cpio format");
> +    throw archive_exception(a, "cannot select all format");
>    rc =3D archive_read_support_filter_all(a);
>    if (rc !=3D ARCHIVE_OK)
>      throw archive_exception(a, "cannot select all filters");
> @@ -902,7 +922,7 @@ handle_buildid_r_match (int64_t b_mtime,
>            throw archive_exception(a, "cannot extract file");
>          }
>=20=20
> -      inc_metric ("http_responses_total","result","rpm");
> +      inc_metric ("http_responses_total","result","archive");
>        struct MHD_Response* r =3D MHD_create_response_from_fd (archive_en=
try_size(e), fd);
>        if (r =3D=3D 0)
>          {
> @@ -916,7 +936,7 @@ handle_buildid_r_match (int64_t b_mtime,
>            MHD_add_response_header (r, "Content-Type", "application/octet=
-stream");
>            add_mhd_last_modified (r, archive_entry_mtime(e));
>            if (verbose > 1)
> -            obatched(clog) << "serving rpm " << b_source0 << " file " <<=
 b_source1 << endl;
> +            obatched(clog) << "serving archive " << b_source0 << " file =
" << b_source1 << endl;
>            /* libmicrohttpd will close it. */
>            if (result_fd)
>              *result_fd =3D fd;
> @@ -1858,16 +1878,20 @@ thread_main_scan_source_file_path (void* arg)
>=20=20
>=20=20
>=20=20
> -// Analyze given *.rpm file of given age; record buildids / exec/debugin=
fo-ness of its
> +// Analyze given archive file of given age; record buildids / exec/debug=
info-ness of its
>  // constituent files with given upsert statements.
>  static void
> -rpm_classify (const string& rps, sqlite_ps& ps_upsert_buildids, sqlite_p=
s& ps_upsert_files,
> +archive_classify (const string& rps, sqlite_ps& ps_upsert_buildids, sqli=
te_ps& ps_upsert_files,
>                sqlite_ps& ps_upsert_de, sqlite_ps& ps_upsert_sref, sqlite=
_ps& ps_upsert_sdef,
>                time_t mtime,
>                unsigned& fts_executable, unsigned& fts_debuginfo, unsigne=
d& fts_sref, unsigned& fts_sdef,
>                bool& fts_sref_complete_p)
>  {
> -  string popen_cmd =3D string("rpm2cpio " + shell_escape(rps));
> +  string archive_decoder =3D "/dev/null";
> +  for (auto&& arch : scan_archives)
> +    if (string_endswith(rps, arch.first))
> +      archive_decoder =3D arch.second;
> +  string popen_cmd =3D archive_decoder + " " + shell_escape(rps);
>    FILE* fp =3D popen (popen_cmd.c_str(), "r"); // "e" O_CLOEXEC?
>    if (fp =3D=3D NULL)
>      throw libc_exception (errno, string("popen ") + popen_cmd);

Likewise as above. Can we skip the whole popen dance if nothing
matches?

> @@ -1879,9 +1903,9 @@ rpm_classify (const string& rps, sqlite_ps& ps_upse=
rt_buildids, sqlite_ps& ps_up
>      throw archive_exception("cannot create archive reader");
>    defer_dtor<struct archive*,int> archive_closer (a, archive_read_free);
>=20=20
> -  int rc =3D archive_read_support_format_cpio(a);
> +  int rc =3D archive_read_support_format_all(a);
>    if (rc !=3D ARCHIVE_OK)
> -    throw archive_exception(a, "cannot select cpio format");
> +    throw archive_exception(a, "cannot select all formats");
>=20
>    rc =3D archive_read_support_filter_all(a);
>    if (rc !=3D ARCHIVE_OK)
>      throw archive_exception(a, "cannot select all filters");

In theory you could know the format you are expecting, just like you
know the decoder above. Might that give slight more accurate error
messages and maybe be slightly more efficient since libarchive doesn't
need to try to see if it correctly detect the format itself.

> @@ -2027,11 +2051,11 @@ rpm_classify (const string& rps, sqlite_ps& ps_up=
sert_buildids, sqlite_ps& ps_up
>=20=20
>=20=20
>=20=20
> -// scan for *.rpm files
> +// scan for archive files such as .rpm
>  static void
> -scan_source_rpm_path (const string& dir)
> +scan_source_archive_path (const string& dir)
>  {
> -  obatched(clog) << "fts/rpm traversing " << dir << endl;
> +  obatched(clog) << "fts/archive traversing " << dir << endl;
>=20=20
>    sqlite_ps ps_upsert_buildids (db, "rpm-buildid-intern", "insert or ign=
ore into " BUILDIDS "_buildids VALUES (NULL, ?);");
>    sqlite_ps ps_upsert_files (db, "rpm-file-intern", "insert or ignore in=
to " BUILDIDS "_files VALUES (NULL, ?);");
> @@ -2060,7 +2084,7 @@ scan_source_rpm_path (const string& dir)
>    struct timeval tv_start, tv_end;
>    gettimeofday (&tv_start, NULL);
>    unsigned fts_scanned=3D0, fts_regex=3D0, fts_cached=3D0, fts_debuginfo=
=3D0;
> -  unsigned fts_executable=3D0, fts_rpm =3D 0, fts_sref=3D0, fts_sdef=3D0;
> +  unsigned fts_executable=3D0, fts_archive =3D 0, fts_sref=3D0, fts_sdef=
=3D0;
>=20=20
>    FTS *fts =3D fts_open (dirs,
>                         (traverse_logical ? FTS_LOGICAL : FTS_PHYSICAL|FT=
S_XDEV)
> @@ -2082,7 +2106,7 @@ scan_source_rpm_path (const string& dir)
>          break;
>=20=20
>        if (verbose > 2)
> -        obatched(clog) << "fts/rpm traversing " << f->fts_path << endl;
> +        obatched(clog) << "fts/archive traversing " << f->fts_path << en=
dl;
>=20=20
>        try
>          {
> @@ -2101,7 +2125,7 @@ scan_source_rpm_path (const string& dir)
>            if (!ri || rx)
>              {
>                if (verbose > 3)
> -                obatched(clog) << "fts/rpm skipped by regex " << (!ri ? =
"I" : "") << (rx ? "X" : "") << endl;
> +                obatched(clog) << "fts/archive skipped by regex " << (!r=
i ? "I" : "") << (rx ? "X" : "") << endl;
>                fts_regex ++;
>                continue;
>              }
> @@ -2116,13 +2140,13 @@ scan_source_rpm_path (const string& dir)
>=20=20
>              case FTS_F:
>                {
> -                // heuristic: reject if file name does not end with ".rp=
m"
> -                // (alternative: try opening with librpm etc., caching)
> -                string suffix =3D ".rpm";
> -                if (rps.size() < suffix.size() ||
> -                    rps.substr(rps.size()-suffix.size()) !=3D suffix)
> +		bool any =3D false;
> +                for (auto&& arch : scan_archives)
> +                  if (string_endswith(rps, arch.first))
> +		    any =3D true;
> +		if (! any)
>                    continue;
> -                fts_rpm ++;
> +                fts_archive ++;
>=20=20
>                  /* See if we know of it already. */
>                  int rc =3D ps_query
> @@ -2133,7 +2157,7 @@ scan_source_rpm_path (const string& dir)
>                  ps_query.reset();
>                  if (rc =3D=3D SQLITE_ROW) // i.e., a result, as opposed =
to DONE (no results)
>                    // no need to recheck a file/version we already know
> -                  // specifically, no need to parse this rpm again, sinc=
e we already have
> +                  // specifically, no need to parse this archive again, =
since we already have
>                    // it as a D or E or S record,
>                    // (so is stored with buildid=3DNULL)
>                    {
> @@ -2141,29 +2165,29 @@ scan_source_rpm_path (const string& dir)
>                      continue;
>                    }
>=20=20
> -                // intern the rpm file name
> +                // intern the archive file name
>                  ps_upsert_files
>                    .reset()
>                    .bind(1, rps)
>                    .step_ok_done();
>=20=20
> -                // extract the rpm contents via popen("rpm2cpio") | liba=
rchive | loop-of-elf_classify()
> +                // extract the archive contents
>                  unsigned my_fts_executable =3D 0, my_fts_debuginfo =3D 0=
, my_fts_sref =3D 0, my_fts_sdef =3D 0;
>                  bool my_fts_sref_complete_p =3D true;
>                  try
>                    {
> -                    rpm_classify (rps,
> +                    archive_classify (rps,
>                                    ps_upsert_buildids, ps_upsert_files,
>                                    ps_upsert_de, ps_upsert_sref, ps_upser=
t_sdef, // dalt
>                                    f->fts_statp->st_mtime,
>                                    my_fts_executable, my_fts_debuginfo, m=
y_fts_sref, my_fts_sdef,
>                                    my_fts_sref_complete_p);
> -                    inc_metric ("scanned_total","source","rpm");
> -                    add_metric("found_debuginfo_total","source","rpm",
> +                    inc_metric ("scanned_total","source","archive");
> +                    add_metric("found_debuginfo_total","source","archive=
",
>                                 my_fts_debuginfo);
> -                    add_metric("found_executable_total","source","rpm",
> +                    add_metric("found_executable_total","source","archiv=
e",
>                                 my_fts_executable);
> -                    add_metric("found_sourcerefs_total","source","rpm",
> +                    add_metric("found_sourcerefs_total","source","archiv=
e",
>                                 my_fts_sref);
>                    }
>                  catch (const reportable_exception& e)
> @@ -2172,7 +2196,7 @@ scan_source_rpm_path (const string& dir)
>                    }
>=20=20
>                  if (verbose > 2)
> -                  obatched(clog) << "scanned rpm=3D" << rps
> +                  obatched(clog) << "scanned archive=3D" << rps
>                                   << " mtime=3D" << f->fts_statp->st_mtime
>                                   << " executables=3D" << my_fts_executab=
le
>                                   << " debuginfos=3D" << my_fts_debuginfo
> @@ -2197,7 +2221,7 @@ scan_source_rpm_path (const string& dir)
>=20=20
>              case FTS_ERR:
>              case FTS_NS:
> -              throw libc_exception(f->fts_errno, string("fts/rpm travers=
al ") + string(f->fts_path));
> +              throw libc_exception(f->fts_errno, string("fts/archive tra=
versal ") + string(f->fts_path));
>=20=20
>              default:
>              case FTS_SL: /* ignore symlinks; seen in non-L mode only */
> @@ -2206,9 +2230,9 @@ scan_source_rpm_path (const string& dir)
>=20=20
>            if ((verbose && f->fts_info =3D=3D FTS_DP) ||
>                (verbose > 1 && f->fts_info =3D=3D FTS_F))
> -            obatched(clog) << "fts/rpm traversing " << rps << ", scanned=
=3D" << fts_scanned
> +            obatched(clog) << "fts/archive traversing " << rps << ", sca=
nned=3D" << fts_scanned
>                             << ", regex-skipped=3D" << fts_regex
> -                           << ", rpm=3D" << fts_rpm << ", cached=3D" << =
fts_cached << ", debuginfo=3D" << fts_debuginfo
> +                           << ", archive=3D" << fts_archive << ", cached=
=3D" << fts_cached << ", debuginfo=3D" << fts_debuginfo
>                             << ", executable=3D" << fts_executable
>                             << ", sourcerefs=3D" << fts_sref << ", source=
defs=3D" << fts_sdef << endl;
>          }
> @@ -2222,9 +2246,9 @@ scan_source_rpm_path (const string& dir)
>    gettimeofday (&tv_end, NULL);
>    double deltas =3D (tv_end.tv_sec - tv_start.tv_sec) + (tv_end.tv_usec =
- tv_start.tv_usec)*0.000001;
>=20=20
> -  obatched(clog) << "fts/rpm traversed " << dir << " in " << deltas << "=
s, scanned=3D" << fts_scanned
> +  obatched(clog) << "fts/archive traversed " << dir << " in " << deltas =
<< "s, scanned=3D" << fts_scanned
>                   << ", regex-skipped=3D" << fts_regex
> -                 << ", rpm=3D" << fts_rpm << ", cached=3D" << fts_cached=
 << ", debuginfo=3D" << fts_debuginfo
> +                 << ", archive=3D" << fts_archive << ", cached=3D" << ft=
s_cached << ", debuginfo=3D" << fts_debuginfo
>                   << ", executable=3D" << fts_executable
>                   << ", sourcerefs=3D" << fts_sref << ", sourcedefs=3D" <=
< fts_sdef << endl;
>  }
> @@ -2232,18 +2256,18 @@ scan_source_rpm_path (const string& dir)
>=20=20
>=20=20
>  static void*
> -thread_main_scan_source_rpm_path (void* arg)
> +thread_main_scan_source_archive_path (void* arg)
>  {
>    string dir =3D string((const char*) arg);
>=20=20
>    unsigned rescan_timer =3D 0;
>    sig_atomic_t forced_rescan_count =3D 0;
> -  set_metric("thread_timer_max", "rpm", dir, rescan_s);
> -  set_metric("thread_tid", "rpm", dir, tid());
> +  set_metric("thread_timer_max", "archive", dir, rescan_s);
> +  set_metric("thread_tid", "archive", dir, tid());
>    while (! interrupted)
>      {
> -      set_metric("thread_timer", "rpm", dir, rescan_timer);
> -      set_metric("thread_forced_total", "rpm", dir, forced_rescan_count);
> +      set_metric("thread_timer", "archive", dir, rescan_timer);
> +      set_metric("thread_forced_total", "archive", dir, forced_rescan_co=
unt);
>        if (rescan_s && rescan_timer > rescan_s)
>          rescan_timer =3D 0;
>        if (sigusr1 !=3D forced_rescan_count)
> @@ -2254,10 +2278,10 @@ thread_main_scan_source_rpm_path (void* arg)
>        if (rescan_timer =3D=3D 0)
>          try
>            {
> -            set_metric("thread_working", "rpm", dir, time(NULL));
> -            inc_metric("thread_work_total", "rpm", dir);
> -            scan_source_rpm_path (dir);
> -            set_metric("thread_working", "rpm", dir, 0);
> +            set_metric("thread_working", "archive", dir, time(NULL));
> +            inc_metric("thread_work_total", "archive", dir);
> +            scan_source_archive_path (dir);
> +            set_metric("thread_working", "archive", dir, 0);
>            }
>          catch (const sqlite_exception& e)
>            {

So in general it is nice that libarchive can just open any given
archive and that we don't need to update the sql schema to keep track
of different archive types. But it does feel like the errors, logs and
metrics are a little generic (e.g. "cannot select all format"). I think
in a couple of places it would be nice if we just knew whether we were
dealing with rpms or debs. It also seems nice to have some separate
metrics. Might it be an idea to not just make scan_archives a simple
string map, but include the actual archive type in it? Or pass the
expected type to scanner thread or to the archive_classify or the
scan_source_archive_path so you can keep referring to rpms or debs in
the logs, error messages and metrics but still keep them as generic
"archives" in the sql tables?

> @@ -2490,8 +2514,8 @@ main (int argc, char *argv[])
>        error (EXIT_FAILURE, 0,
>               "unexpected argument: %s", argv[remaining]);
>=20=20
> -  if (!scan_rpms && !scan_files && source_paths.size()>0)
> -    obatched(clog) << "warning: without -F and/or -R, ignoring PATHs" <<=
 endl;
> +  if (scan_archives.size()=3D=3D0 && !scan_files && source_paths.size()>=
0)
> +    obatched(clog) << "warning: without -F -R -U, ignoring PATHs" << end=
l;
>=20=20
>    (void) signal (SIGPIPE, SIG_IGN); // microhttpd can generate it incide=
ntally, ignore
>    (void) signal (SIGINT, signal_handler); // ^C
> @@ -2611,16 +2635,22 @@ main (int argc, char *argv[])
>    if (maxigroom)
>      obatched(clog) << "maxigroomed database" << endl;
>=20=20
> -
>    obatched(clog) << "search concurrency " << concurrency << endl;
>    obatched(clog) << "rescan time " << rescan_s << endl;
>    obatched(clog) << "groom time " << groom_s << endl;
> +  if (scan_archives.size()>0)
> +    {
> +      obatched ob(clog);
> +      auto& o =3D ob << "scanning archive types ";
> +      for (auto&& arch : scan_archives)
> +	o << arch.first << " ";
> +      o << endl;
> +    }

The indentation is correct in the source, but email seems to have
mangled it. Just note that tabs and spaces aren't used consistently in
the source file.

>    const char* du =3D getenv(DEBUGINFOD_URLS_ENV_VAR);
>    if (du && du[0] !=3D '\0') // set to non-empty string?
>      obatched(clog) << "upstream debuginfod servers: " << du << endl;
>=20=20
> -  vector<pthread_t> source_file_scanner_threads;
> -  vector<pthread_t> source_rpm_scanner_threads;
> +  vector<pthread_t> scanner_threads;
>    pthread_t groom_thread;
>=20=20
>    rc =3D pthread_create (& groom_thread, NULL, thread_main_groom, NULL);
> @@ -2632,20 +2662,21 @@ main (int argc, char *argv[])
>        pthread_t pt;
>        rc =3D pthread_create (& pt, NULL, thread_main_scan_source_file_pa=
th, (void*) it.c_str());
>        if (rc < 0)
> -        error (0, 0, "warning: cannot spawn thread (%d) to scan source f=
iles %s\n", rc, it.c_str());
> +        error (0, 0, "warning: cannot spawn thread (%d) to scan files %s=
\n", rc, it.c_str());
>        else
> -        source_file_scanner_threads.push_back(pt);
> +        scanner_threads.push_back(pt);
>      }
>=20=20
> -  if (scan_rpms) for (auto&& it : source_paths)
> -    {
> -      pthread_t pt;
> -      rc =3D pthread_create (& pt, NULL, thread_main_scan_source_rpm_pat=
h, (void*) it.c_str());
> -      if (rc < 0)
> -        error (0, 0, "warning: cannot spawn thread (%d) to scan source r=
pms %s\n", rc, it.c_str());
> -      else
> -        source_rpm_scanner_threads.push_back(pt);
> -    }
> +  if (scan_archives.size() > 0)
> +    for (auto&& it : source_paths)
> +      {
> +        pthread_t pt;
> +        rc =3D pthread_create (& pt, NULL, thread_main_scan_source_archi=
ve_path, (void*) it.c_str());
> +        if (rc < 0)
> +          error (0, 0, "warning: cannot spawn thread (%d) to scan archiv=
es %s\n", rc, it.c_str());
> +        else
> +          scanner_threads.push_back(pt);
> +      }
>=20=20
>    /* Trivial main loop! */
>    set_metric("ready", 1);
> @@ -2656,10 +2687,8 @@ main (int argc, char *argv[])
>    if (verbose)
>      obatched(clog) << "stopping" << endl;
>=20=20
> -  /* Join any source scanning threads. */
> -  for (auto&& it : source_file_scanner_threads)
> -    pthread_join (it, NULL);
> -  for (auto&& it : source_rpm_scanner_threads)
> +  /* Join all our threads. */
> +  for (auto&& it : scanner_threads)
>      pthread_join (it, NULL);
>    pthread_join (groom_thread, NULL);

Note that in some of the unpatches code there are still references to
rpms in some logs and comments that should also be "archive".

> diff --git a/doc/ChangeLog b/doc/ChangeLog
> index 00a61ac..398048f 100644
> --- a/doc/ChangeLog
> +++ b/doc/ChangeLog
> @@ -1,3 +1,12 @@
> +2019-12-02  Frank Ch. Eigler  <fche@redhat.com
> +
> +	* debuginfod.8: Add -U (DEB) flag, generalize RPM to "archive".
> +
> +2019-11-26  Frank Ch. Eigler  <fche@redhat.com>
> +	    Aaron Merey  <amerey@redhat.com>
> +
> +	* debuginfod.8, find-debuginfo.1, debuginfod_*.3: New files.
> +

That second ChangeLog entry looks incorrect.

>  .TP
>  .B "\-t SECONDS"  "\-\-rescan\-time=3DSECONDS"
> -Set the rescan time for the file and RPM directories.  This is the
> +Set the rescan time for the file and archive directories.  This is the
>  amount of time the scanning threads will wait after finishing a scan,
>  before doing it again.  A rescan for unchanged files is fast (because
>  the index also stores the file mtimes).  A time of zero is acceptable,
> @@ -143,8 +152,8 @@ independent of the groom time (including if it was ze=
ro).
>  .B "\-G"
>  Run an extraordinary maximal-grooming pass at debuginfod startup.
>  This pass can take considerable time, because it tries to remove any
> -debuginfo-unrelated content from the RPM-related parts of the index.
> -It should not be run if any recent RPM-related indexing operations
> +debuginfo-unrelated content from the archive-related parts of the index.
> +It should not be run if any recent archive-related indexing operations
>  were aborted early.  It can take considerable space, because it
>  finishes up with an sqlite "vacuum" operation, which repacks the
>  database file by triplicating it temporarily.  The default is not to
> @@ -155,7 +164,7 @@ do maximal-grooming.  See also the \fIDATA MANAGEMENT=
\fP section.
>  Set the concurrency limit for all the scanning threads.  While many
>  threads may be spawned to cover all the given PATHs, only NUM may
>  concurrently do CPU-intensive operations like parsing an ELF file
> -or an RPM.  The default is the number of processors on the system;
> +or an archive.  The default is the number of processors on the system;
>  the minimum is 1.
>=20=20
>  .TP
> @@ -257,10 +266,10 @@ many files.  This section offers some advice about =
the implications.
>=20=20
>  As a general explanation for size, consider that debuginfod indexes
>  ELF/DWARF files, it stores their names and referenced source file
> -names, and buildids will be stored.  When indexing RPMs, it stores
> -every file name \fIof or in\fP an RPM, every buildid, plus every
> -source file name referenced from a DWARF file.  (Indexing RPMs takes
> -more space because the source files often reside in separate
> +names, and buildids will be stored.  When indexing archives, it stores
> +every file name \fIof or in\fP an archive, every buildid, plus every
> +source file name referenced from a DWARF file.  (Indexing archives
> +takes more space because the source files often reside in separate
>  subpackages that may not be indexed at the same pass, so extra
>  metadata has to be kept.)

Is my understanding that debug debs don't contain the actual source
correct? If that is the case can't we take advantage of that by never
indexing source files from debs?
=20
> @@ -283,14 +292,14 @@ This means that the sqlite files grow fast during i=
nitial indexing,
>  slowly during index rescans, and periodically shrink during grooming.
>  There is also an optional one-shot \fImaximal grooming\fP pass is
>  available.  It removes information debuginfo-unrelated data from the
> -RPM content index such as file names found in RPMs ("rpm sdef"
> -records) that are not referred to as source files from any binaries
> -find in RPMs ("rpm sref" records).  This can save considerable disk
> -space.  However, it is slow and temporarily requires up to twice the
> -database size as free space.  Worse: it may result in missing
> -source-code info if the RPM traversals were interrupted, so the not
> -all source file references were known.  Use it rarely to polish a
> -complete index.
> +archive content index such as file names found in archives ("archive
> +sdef" records) that are not referred to as source files from any
> +binaries find in archives ("archive sref" records).  This can save
> +considerable disk space.  However, it is slow and temporarily requires
> +up to twice the database size as free space.  Worse: it may result in
> +missing source-code info if the archive traversals were interrupted,
> +so the not all source file references were known.  Use it rarely to
> +polish a complete index.

Already in the original, but should that be "so that not all
source..."?

>  You should ensure that ample disk space remains available.  (The flood
>  of error messages on -ENOSPC is ugly and nagging.  But, like for most
> @@ -317,7 +326,7 @@ happens, new versions of debuginfod will issue SQL st=
atements to
>  \fIdrop\fP all prior schema & data, and start over.  So, disk space
>  will not be wasted for retaining a no-longer-useable dataset.
>=20=20
> -In summary, if your system can bear a 0.5%-3% index-to-RPM-dataset
> +In summary, if your system can bear a 0.5%-3% index-to-archive-dataset
>  size ratio, and slow growth afterwards, you should not need to
>  worry about disk space.  If a system crash corrupts the database,
>  or you want to force debuginfod to reset and start over, simply

Here I think you should leave it at RPM since that is what you
originally measured. It isn't clear to me that things are the same for
debs since those don't include the sources?

Cheers,

Mark