bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)

public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed

* bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
@ 2022-03-17 21:23 Serhei Makarov
  2022-03-20  2:59 ` Serhei Makarov
  0 siblings, 1 reply; 4+ messages in thread
From: Serhei Makarov @ 2022-03-17 21:23 UTC (permalink / raw)
  To: Bunsen

* #2a Repository Layout -- where do you put a bunsen repository?

Plausible locations of the Bunsen repo on disk:
- bunsen/.bunsen: inside a Git checkout of the Bunsen codebase.
- bunsen/bunsen-data: inside a Git checkout of the Bunsen codebase, not hidden.
- project/.bunsen: inside a Git checkout of the project source code.
- project/bunsen-data: inside a Git checkout of the project source code, not hidden.
- project.bunsen: standalone Bunsen repo.

When we install Bunsen as an RPM, the 'bunsen' command should
autodetect which repo it's being run against. Commands such as the
following make sense:
- cd bunsen-systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd project.bunsen && bunsen ls # -> uses .

For pushing and pulling, the Bunsen Git repo could be placed in a
separate location, e.g.
- /public/bunsen-systemtap.git
- /notnfs/staplogs/bunsen-systemtap/.bunsen/config specifies git_repo=/public/bunsen-systemtap.git

* #2b Repository Layout -- how is the content of the repo laid out

This is a subject of ongoing discussion with fche, because we have
quite different opinions about what is convenient to keep in Git vs
SQLite, how flexible the Git layout should be etc. In principle, the
analysis and data representation will work more or less the same
regardless of what solution we settle on. Therefore, the repository
layout can be made configurable (and potentially the configurability
can be reduced down the line as we settle on what makes sense and what
doesn't).

There are *several* types of artefacts we're contemplating between the
two of us.

(1) testlogs Git repo. Requested by fche. *Every* commit of this repo
is a directory tree of log files. The branch naming scheme is
completely free-form. This is meant to allow third parties to start
creating git repos for us to pull down, as quickly as possible, with
minimal tooling or explanation.

(2) index Git repo. Preferred by serhei, described in a prior email.
  - (<source>/)?index - contains JSON files <project>-<year>-<month>-<extra>
  - (<source>/)?<project>/testlogs-<suffix>
  - (<source>/)?<project>/testruns-<suffix>

An addendum re: branch naming: Git already prefixes cloned branches as
'remotes/foo/branch'. So this would most likely replace any <source>
prefix we would be adding on our own.

Keeping in mind we have to work directly with 'remotes/foo/branch'
and be careful with any Git operations (e.g. checkout) that default 
to creating a local mirror of the branch.

(3) index SQLite DB. Discussed in a later email

(4) JSON analysis cache. This would just be a directory of JSON files,
with format defined by the analysis script/library that generates it.

* #2c Repository Layout -- examples

Now for some examples of how this could be arranged:

(a) The current Bunsen setup for SystemTap uses all Git. We have:
- An index Git repo at .bunsen/bunsen.git

Cloning/archiving this repo is super simple.
Cloning a subset of this repo involves cloning a subset of the branches,
and is also fairly simple if the branch naming scheme allows us to use
sensible wildcards.

(a') Suppose we evolve this format to allow several separate Git repos.
- foo.bunsen/index.git - the main index+testlogs git repo
- foo.bunsen/index-myproject.git - an index git repo for a separate project
- foo.bunsen/extra-input1.git - a testlogs Git repo cloned from one source
- foo.bunsen/extra-input2.git - a testlogs Git repo cloned from another source
- path/to/public/export.git - extracted testruns from index git repo,
  with any sensitive data omitted
- cache.sqlite - an SQLite cache for analysis artefacts

Cloning this repo is a bit more complicated.

We have to specify which branches from which git repo on the remote
will be cloned to which git repo on the local.

So the following command API might need to be complicated somewhat:

$ bunsen clone user@ssh.server:foo.bunsen/ source_name
$ bunsen clone https://server/path/to/public/export.git source_name

Although there are sensible defaults:
- if making a fresh clone, clone all the same git repos and all the branches
- if adding data to a new repo

(b) fche's preference for an all-SQLite setup.
- foo.bunsen/bunsen.git - the main testlogs git repo
- index.sqlite - an SQLite DB with parsed testrun data 
- ??? cache.sqlite - an SQLite cache for transient analysis artefacts

Whether index or cache should be separate or merged is arguable. If
cloning the SQLite DB is a possibility, separation may be useful
because there *is* a distinction between a permanent index which is
worth keeping around vs. a transient analysis which functions as an
annotation to this index. I don't see cloning the SQLite DB as being a
standard operation in this design, so merging the DBs would be less
annoying.

But, on the other hand....

The main disadvantage to cloning this type of repo is that the
receiving side has to repeat the parsing work that was done on the
sending side. Since this could take hour(s) when we clone a repo for
the first time, for me this is a deal-breaker. The only workaround is
keeping tight control on the size of the SQLite DB that has the parsed
testrun data, so that we *can* clone it together with the testlogs as
a special case. This requires the possibility of offloading
large/transient analysis tables to a separate SQLite DB from
small/persistent/necessary analysis tables. I don't know yet if this
will significantly complicate the code or not.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
  2022-03-17 21:23 bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?) Serhei Makarov
@ 2022-03-20  2:59 ` Serhei Makarov
  2022-03-21 19:45   ` Frank Ch. Eigler
  0 siblings, 1 reply; 4+ messages in thread
From: Serhei Makarov @ 2022-03-20  2:59 UTC (permalink / raw)
  To: Bunsen

> This is a subject of ongoing discussion with fche, because we have
> quite different opinions about what is convenient to keep in Git vs
> SQLite, how flexible the Git layout should be etc. In principle, the
> analysis and data representation will work more or less the same
> regardless of what solution we settle on. Therefore, the repository
> layout can be made configurable (and potentially the configurability
> can be reduced down the line as we settle on what makes sense and what
> doesn't).

Git+JSON (the current format)
- Good for cloning
   (git clone + git pull, options for shallow cloning,
    cloning subsets of branches)

SQLite
- Awful for cloning
   (essentially, requires redoing the parse on the receiving end,
    or returning data to the Git+JSON format)
- Likely to be more efficient at certain queries
  - The 'sliding window' analyses that compare nearby points in a history
    would be possible but tedious to express
  - Designing a *stable* schema for analyses to query against would be tricky,
    requiring space-optimized tables to be munged into views
    following an unchanging schema.
    Otherwise any change to optimizations will require embedded SQL
    in analysis scripts to be rewritten.

Given the toss-up and potential complementary strengths,
it would be best to have a way to support both formats and experiment
extensively.

The configuration file could allow options as follows:

[core]
git_repo = ... // stores both testlogs and testruns
testlogs_repo = ... // testlogs in Git format with properly named branches
testruns_repo = ... // testruns in JSON format
testruns_db = ... // enables data to be stored in SQLite DB
cache_db = ... // can be the same as cache_db

(Obviously, not all at once. This is meant to capture the possible use cases.)

Then we could also add options per-project:

[project "systemtap"]
raw_testlogs_repo = ... // testlogs in Git format with any branches, for importing
parse_module = systemtap.parse_dejagnu

[project "systemtap-contrib"]
testlogs_repo = ... // archive testruns in a separate repo (or db) from the main repo

I'll think a bit more about the configuration format in light of the 'Makefile' model.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
  2022-03-20  2:59 ` Serhei Makarov
@ 2022-03-21 19:45   ` Frank Ch. Eigler
  2022-03-21 20:23     ` Serhei Makarov
  0 siblings, 1 reply; 4+ messages in thread
From: Frank Ch. Eigler @ 2022-03-21 19:45 UTC (permalink / raw)
  To: Serhei Makarov; +Cc: Bunsen

Hi -

(We seem to be converging on accepting any loosely structured git repo
for storage of the raw testresult data, which should make it easy for
varied tooling to submit data.)

OK, shifting discussion to how analyzed results are stored.

> Git+JSON (the current format)
> - Good for cloning
>    (git clone + git pull, options for shallow cloning,
>     cloning subsets of branches)
>
> SQLite
> - Awful for cloning
>    (essentially, requires redoing the parse on the receiving end,
>     or returning data to the Git+JSON format)

Awful for -partial- cloning/merging, yes, without formal tooling.  With
just a bit of tooling, it's fine.  That json-to-sql loader python script
you saw was basically all it takes to merge in a batch of data.

That said, I'm not sure that partial cloning of *analysis results* is
even a worthwhile feature.  It's the raw testsuite data that is precious
- reanalysis is just a matter of background CPU.

> - Likely to be more efficient at certain queries
>   - The 'sliding window' analyses that compare nearby points in a history
>     would be possible but tedious to express

Not just that.  Delegating filtering & selection to a database is
undoubtedly faster than parsing/instantiating python objects en masse.

>   - Designing a *stable* schema for analyses to query against would be
>   tricky, requiring space-optimized tables to be munged into views
>   following an unchanging schema.  Otherwise any change to
>   optimizations will require embedded SQL in analysis scripts to be
>   rewritten.

Note that this is an API design problem.  It has its dual in the
python/json world too.  sql schema stability <-> json object schema <->
python api stability.  (If anything, sql schema changes can be
mechanized easier.)

> Given the toss-up and potential complementary strengths, it would be
> best to have a way to support both formats and experiment extensively.

What could make sense is to have a dual API.  A secondary storage format
that tools can consume (not just ours), and a programmatic python API
bound to that stuff for those types of operations that are best written
procedurally.

> The configuration file could allow options as follows:
>
> [core]
> git_repo = ... // stores both testlogs and testruns

Strongly suggesting NOT mixing the two.

> testlogs_repo = ... // testlogs in Git format with properly named
> branches

If we indeed keep them separate, then the raw logs do not need to have a
structured naming convention imposed on them.  We can consume it by just
traversing all refs/commits.

> testruns_repo = ... // testruns in JSON format
> testruns_db = ... // enables data to be stored in SQLite DB

Well, I hope through discussion and experiment, we settle on one or the
other, before exposing this decision to our users.

> cache_db = ... // can be the same as cache_db

What needs to be cached?  I'm thinking of derived analysis of analysis
jobs as being first class things stored in the testruns database
(whatever the form).  The simplest case is a set of result tables to
store the parsed dejagnu object records, and then a second set of result
tables to store the aggregate pass/fail/etc. COUNTS to enable quick
overviews.

So these can be not just a disposable cache but a long-lived standard
place to look things up for reporting.  (Explicit deletion and
recomputation may of course be needed, but that's not a caching issue
per se but wanting new parameters or new analysis software version.)

> [project "systemtap"]
> raw_testlogs_repo = ... // testlogs in Git format with any branches,
> for importin
> [...]
> I'll think a bit more about the configuration format in light of the 'Makefile' model.

The key idea could be representing how analysis feeds other analysis,
and how to represent any required parametrization.  Some derived
analysis (e.g., enumerating regressions) is a rich terrain for rerunning
analysis against *varying sets* of test results, which could lead to
having several analysis results coexist nearby.

Gotta sit down and come up with some cray-zee analysis pass ideas to
expand the horizons.

- FChE

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
  2022-03-21 19:45   ` Frank Ch. Eigler
@ 2022-03-21 20:23     ` Serhei Makarov
  0 siblings, 0 replies; 4+ messages in thread
From: Serhei Makarov @ 2022-03-21 20:23 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Bunsen



On Mon, Mar 21, 2022, at 3:45 PM, Frank Ch. Eigler wrote:
> (We seem to be converging on accepting any loosely structured git repo
> for storage of the raw testresult data, which should make it easy for
> varied tooling to submit data.)
Yes, I have no problem with coding support for that into the design,
accepting test logs from a Git repo with free-form format and
putting parsed testrun data elsewhere.
For my concern re: cloning (full or partial) it does not complicate things.

>> Given the toss-up and potential complementary strengths, it would be
>> best to have a way to support both formats and experiment extensively.
>
> What could make sense is to have a dual API.  A secondary storage format
> that tools can consume (not just ours), and a programmatic python API
> bound to that stuff for those types of operations that are best written
> procedurally.
> ...
> Well, I hope through discussion and experiment, we settle on one or the
> other, before exposing this decision to our users.
The current model API is not "bound to" JSON the same way an
embedded-SQL analysis script would be bound to SQL.
Thus far I have completely avoided handling JSON format
directly in the analysis, doing everything through the model classes,
which could be coded against either JSON or SQL.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-03-21 20:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17 21:23 bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?) Serhei Makarov
2022-03-20  2:59 ` Serhei Makarov
2022-03-21 19:45   ` Frank Ch. Eigler
2022-03-21 20:23     ` Serhei Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).