public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
From: fche@redhat.com (Frank Ch. Eigler)
To: "Serhei Makarov" <me@serhei.io>
Cc: Bunsen <bunsen@sourceware.org>
Subject: Re: bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
Date: Mon, 21 Mar 2022 15:45:21 -0400	[thread overview]
Message-ID: <87r16vdr8u.fsf@redhat.com> (raw)
In-Reply-To: <c7d926d8-1da9-4379-bf01-67ec8a478ae2@www.fastmail.com> (Serhei Makarov's message of "Sat, 19 Mar 2022 22:59:13 -0400")


Hi -

(We seem to be converging on accepting any loosely structured git repo
for storage of the raw testresult data, which should make it easy for
varied tooling to submit data.)

OK, shifting discussion to how analyzed results are stored.

> Git+JSON (the current format)
> - Good for cloning
>    (git clone + git pull, options for shallow cloning,
>     cloning subsets of branches)
>
> SQLite
> - Awful for cloning
>    (essentially, requires redoing the parse on the receiving end,
>     or returning data to the Git+JSON format)

Awful for -partial- cloning/merging, yes, without formal tooling.  With
just a bit of tooling, it's fine.  That json-to-sql loader python script
you saw was basically all it takes to merge in a batch of data.

That said, I'm not sure that partial cloning of *analysis results* is
even a worthwhile feature.  It's the raw testsuite data that is precious
- reanalysis is just a matter of background CPU.


> - Likely to be more efficient at certain queries
>   - The 'sliding window' analyses that compare nearby points in a history
>     would be possible but tedious to express

Not just that.  Delegating filtering & selection to a database is
undoubtedly faster than parsing/instantiating python objects en masse.

>   - Designing a *stable* schema for analyses to query against would be
>   tricky, requiring space-optimized tables to be munged into views
>   following an unchanging schema.  Otherwise any change to
>   optimizations will require embedded SQL in analysis scripts to be
>   rewritten.

Note that this is an API design problem.  It has its dual in the
python/json world too.  sql schema stability <-> json object schema <->
python api stability.  (If anything, sql schema changes can be
mechanized easier.)


> Given the toss-up and potential complementary strengths, it would be
> best to have a way to support both formats and experiment extensively.

What could make sense is to have a dual API.  A secondary storage format
that tools can consume (not just ours), and a programmatic python API
bound to that stuff for those types of operations that are best written
procedurally.


> The configuration file could allow options as follows:
>
> [core]
> git_repo = ... // stores both testlogs and testruns

Strongly suggesting NOT mixing the two.

> testlogs_repo = ... // testlogs in Git format with properly named
> branches

If we indeed keep them separate, then the raw logs do not need to have a
structured naming convention imposed on them.  We can consume it by just
traversing all refs/commits.

> testruns_repo = ... // testruns in JSON format
> testruns_db = ... // enables data to be stored in SQLite DB

Well, I hope through discussion and experiment, we settle on one or the
other, before exposing this decision to our users.

> cache_db = ... // can be the same as cache_db

What needs to be cached?  I'm thinking of derived analysis of analysis
jobs as being first class things stored in the testruns database
(whatever the form).  The simplest case is a set of result tables to
store the parsed dejagnu object records, and then a second set of result
tables to store the aggregate pass/fail/etc. COUNTS to enable quick
overviews.

So these can be not just a disposable cache but a long-lived standard
place to look things up for reporting.  (Explicit deletion and
recomputation may of course be needed, but that's not a caching issue
per se but wanting new parameters or new analysis software version.)

> [project "systemtap"]
> raw_testlogs_repo = ... // testlogs in Git format with any branches,
> for importin
> [...]
> I'll think a bit more about the configuration format in light of the 'Makefile' model.

The key idea could be representing how analysis feeds other analysis,
and how to represent any required parametrization.  Some derived
analysis (e.g., enumerating regressions) is a rich terrain for rerunning
analysis against *varying sets* of test results, which could lead to
having several analysis results coexist nearby.

Gotta sit down and come up with some cray-zee analysis pass ideas to
expand the horizons.


- FChE


  reply	other threads:[~2022-03-21 19:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17 21:23 Serhei Makarov
2022-03-20  2:59 ` Serhei Makarov
2022-03-21 19:45   ` Frank Ch. Eigler [this message]
2022-03-21 20:23     ` Serhei Makarov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r16vdr8u.fsf@redhat.com \
    --to=fche@redhat.com \
    --cc=bunsen@sourceware.org \
    --cc=me@serhei.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).