public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
* bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?)
@ 2022-03-17 21:23 Serhei Makarov
  2022-03-20  2:59 ` Serhei Makarov
  0 siblings, 1 reply; 4+ messages in thread
From: Serhei Makarov @ 2022-03-17 21:23 UTC (permalink / raw)
  To: Bunsen

* #2a Repository Layout -- where do you put a bunsen repository?

Plausible locations of the Bunsen repo on disk:
- bunsen/.bunsen: inside a Git checkout of the Bunsen codebase.
- bunsen/bunsen-data: inside a Git checkout of the Bunsen codebase, not hidden.
- project/.bunsen: inside a Git checkout of the project source code.
- project/bunsen-data: inside a Git checkout of the project source code, not hidden.
- project.bunsen: standalone Bunsen repo.

When we install Bunsen as an RPM, the 'bunsen' command should
autodetect which repo it's being run against. Commands such as the
following make sense:
- cd bunsen-systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd project.bunsen && bunsen ls # -> uses .

For pushing and pulling, the Bunsen Git repo could be placed in a
separate location, e.g.
- /public/bunsen-systemtap.git
- /notnfs/staplogs/bunsen-systemtap/.bunsen/config specifies git_repo=/public/bunsen-systemtap.git

* #2b Repository Layout -- how is the content of the repo laid out

This is a subject of ongoing discussion with fche, because we have
quite different opinions about what is convenient to keep in Git vs
SQLite, how flexible the Git layout should be etc. In principle, the
analysis and data representation will work more or less the same
regardless of what solution we settle on. Therefore, the repository
layout can be made configurable (and potentially the configurability
can be reduced down the line as we settle on what makes sense and what
doesn't).

There are *several* types of artefacts we're contemplating between the
two of us.

(1) testlogs Git repo. Requested by fche. *Every* commit of this repo
is a directory tree of log files. The branch naming scheme is
completely free-form. This is meant to allow third parties to start
creating git repos for us to pull down, as quickly as possible, with
minimal tooling or explanation.

(2) index Git repo. Preferred by serhei, described in a prior email.
  - (<source>/)?index - contains JSON files <project>-<year>-<month>-<extra>
  - (<source>/)?<project>/testlogs-<suffix>
  - (<source>/)?<project>/testruns-<suffix>

An addendum re: branch naming: Git already prefixes cloned branches as
'remotes/foo/branch'. So this would most likely replace any <source>
prefix we would be adding on our own.

Keeping in mind we have to work directly with 'remotes/foo/branch'
and be careful with any Git operations (e.g. checkout) that default 
to creating a local mirror of the branch.

(3) index SQLite DB. Discussed in a later email

(4) JSON analysis cache. This would just be a directory of JSON files,
with format defined by the analysis script/library that generates it.

* #2c Repository Layout -- examples

Now for some examples of how this could be arranged:

(a) The current Bunsen setup for SystemTap uses all Git. We have:
- An index Git repo at .bunsen/bunsen.git

Cloning/archiving this repo is super simple.
Cloning a subset of this repo involves cloning a subset of the branches,
and is also fairly simple if the branch naming scheme allows us to use
sensible wildcards.

(a') Suppose we evolve this format to allow several separate Git repos.
- foo.bunsen/index.git - the main index+testlogs git repo
- foo.bunsen/index-myproject.git - an index git repo for a separate project
- foo.bunsen/extra-input1.git - a testlogs Git repo cloned from one source
- foo.bunsen/extra-input2.git - a testlogs Git repo cloned from another source
- path/to/public/export.git - extracted testruns from index git repo,
  with any sensitive data omitted
- cache.sqlite - an SQLite cache for analysis artefacts

Cloning this repo is a bit more complicated.

We have to specify which branches from which git repo on the remote
will be cloned to which git repo on the local.

So the following command API might need to be complicated somewhat:

$ bunsen clone user@ssh.server:foo.bunsen/ source_name
$ bunsen clone https://server/path/to/public/export.git source_name

Although there are sensible defaults:
- if making a fresh clone, clone all the same git repos and all the branches
- if adding data to a new repo

(b) fche's preference for an all-SQLite setup.
- foo.bunsen/bunsen.git - the main testlogs git repo
- index.sqlite - an SQLite DB with parsed testrun data 
- ??? cache.sqlite - an SQLite cache for transient analysis artefacts

Whether index or cache should be separate or merged is arguable. If
cloning the SQLite DB is a possibility, separation may be useful
because there *is* a distinction between a permanent index which is
worth keeping around vs. a transient analysis which functions as an
annotation to this index. I don't see cloning the SQLite DB as being a
standard operation in this design, so merging the DBs would be less
annoying.

But, on the other hand....

The main disadvantage to cloning this type of repo is that the
receiving side has to repeat the parsing work that was done on the
sending side. Since this could take hour(s) when we clone a repo for
the first time, for me this is a deal-breaker. The only workaround is
keeping tight control on the size of the SQLite DB that has the parsed
testrun data, so that we *can* clone it together with the testlogs as
a special case. This requires the possibility of offloading
large/transient analysis tables to a separate SQLite DB from
small/persistent/necessary analysis tables. I don't know yet if this
will significantly complicate the code or not.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-03-21 20:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17 21:23 bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?) Serhei Makarov
2022-03-20  2:59 ` Serhei Makarov
2022-03-21 19:45   ` Frank Ch. Eigler
2022-03-21 20:23     ` Serhei Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).