From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <me@serhei.io>
Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com
 [66.111.4.25])
 by sourceware.org (Postfix) with ESMTPS id DBAA9385801E
 for <bunsen@sourceware.org>; Thu, 17 Mar 2022 21:23:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DBAA9385801E
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=serhei.io
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=serhei.io
Received: from compute5.internal (compute5.nyi.internal [10.202.2.45])
 by mailout.nyi.internal (Postfix) with ESMTP id 3E4545C00FD;
 Thu, 17 Mar 2022 17:23:55 -0400 (EDT)
Received: from imap50 ([10.202.2.100])
 by compute5.internal (MEProxy); Thu, 17 Mar 2022 17:23:55 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=cc
 :cc:content-type:date:date:from:from:in-reply-to:message-id
 :mime-version:reply-to:sender:subject:subject:to:to; s=fm1; bh=s
 4+LhdHenPGxNe+pDvE8GmsSwhpvQKMjmtx9KRghEA4=; b=llzbYtqCoLpKK20xl
 s1MXWmGKtjEDY9xh+cUhXpkyiqVGbTte34KQU8mhUFjHgHD+BgIH3XJPEbyrdkgM
 ltCOZN9ws0N5bKE2ad4VUSs7DT8f17TW/H+KK9F5SUEg5EupwO1ifOsZuWTGq6DT
 fO3pQfBp2cKuKka9HfyQv0n1bQdBgyjl4ECnzj5hWC/UpXBZ6117kUOW2r3mqHwj
 qHBa9GtsyzkpsvJrK1AeZEFv4AkEBA8tIOHaBQH4CddUWDb5D8+84fB6HYPeURYB
 wzJEKJLn9olkwLvR6B15MV1+3ZhSM5cA9DYYq+Sqp90ucqs5mtimefTsJyCS42WO
 2CqHQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:cc:content-type:date:date:from:from
 :in-reply-to:message-id:mime-version:reply-to:sender:subject
 :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
 :x-sasl-enc; s=fm3; bh=s4+LhdHenPGxNe+pDvE8GmsSwhpvQKMjmtx9KRghE
 A4=; b=c3lD7XJYlmGV4wl9szQ+Th9V/eAi5z4xkzkMbykHLucnTqLAZ/1Hx7pfa
 AxFHuhUESkENIIJDe8q66FVHxdCWRNqfj/RLHbvlo8S/bjUu8YKes5v2vA7SSA+p
 g0eE6dxK4WO1zcWKV3JQZqntVDrfBu+xxDHxsucUU/DBUMpsqwJ3MLFxEtRH+GXb
 ZanlSzkumhFPnDR8ndSBGmuoXiSDqKxG6ElWfw+gifoOwJ51sS7gna0hki9YzdvW
 f66gJ+9YkIBHTV/Xe2L9KPyPtxR4Vb/aZjZwi8OknumjYBZJavkNOzE1zHRrdVV4
 esNz4wsLBSBZsDqM+gGi1Z1DhLJzA==
X-ME-Sender: <xms:6qYzYr3JarDxSbebSXxPbRYlqhkmectEJk7h6L0ZXW27pX7TvWgmOQ>
 <xme:6qYzYqHP6TbxPYpIH7BiosU6TbkBqljjSjpensmXHG8w0nMUoZT3W5IVyIcNRhOIY
 9AUsYrfLxiZ6cvfUQ>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrudefgedgudegiecutefuodetggdotefrod
 ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
 necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkfffhvffutgesthdtre
 dtreertdenucfhrhhomhepfdfuvghrhhgvihcuofgrkhgrrhhovhdfuceomhgvsehsvghr
 hhgvihdrihhoqeenucggtffrrghtthgvrhhnpefgveeuieejtdefvdegteeukedtueelfe
 ehueduffdugfdutedtheeviedtleehffenucevlhhushhtvghrufhiiigvpedtnecurfgr
 rhgrmhepmhgrihhlfhhrohhmpehmvgesshgvrhhhvghirdhioh
X-ME-Proxy: <xmx:6qYzYr5l5G8JxQ1FDo85TkXq7btSe5uX97st12Y6JD--fkn_eQHHpw>
 <xmx:6qYzYg1ys6uMM0sR_Yca5b53Xr35wjWpcA36euTu40FZxDncAO9KOw>
 <xmx:6qYzYuEBH_qZGGg_HHiOCp0r6TsoLGjbl5zDdsajCtt9ZuwRilbB2A>
 <xmx:66YzYpz1Ib2F9tn0xT3YXys6q3b9RoPhYHSNm1C0Az_7JxsQFkI0hw>
Received: by mailuser.nyi.internal (Postfix, from userid 501)
 id CBB2A192833C; Thu, 17 Mar 2022 17:23:54 -0400 (EDT)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.5.0-alpha0-4907-g25ce6f34a9-fm-20220311.001-g25ce6f34
Mime-Version: 1.0
Message-Id: <1796a6e2-2b2a-49a7-b350-9d58700d3e30@www.fastmail.com>
Date: Thu, 17 Mar 2022 17:23:04 -0400
From: "Serhei Makarov" <me@serhei.io>
To: Bunsen <bunsen@sourceware.org>
Subject: bunsen (re)design discussion #2: Repository Layout,
 SQLite design idea + Django-esque API(?)
Content-Type: text/plain
X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: bunsen@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bunsen mailing list <bunsen.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/bunsen/>
List-Help: <mailto:bunsen-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Mar 2022 21:24:00 -0000

* #2a Repository Layout -- where do you put a bunsen repository?

Plausible locations of the Bunsen repo on disk:
- bunsen/.bunsen: inside a Git checkout of the Bunsen codebase.
- bunsen/bunsen-data: inside a Git checkout of the Bunsen codebase, not hidden.
- project/.bunsen: inside a Git checkout of the project source code.
- project/bunsen-data: inside a Git checkout of the project source code, not hidden.
- project.bunsen: standalone Bunsen repo.

When we install Bunsen as an RPM, the 'bunsen' command should
autodetect which repo it's being run against. Commands such as the
following make sense:
- cd bunsen-systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data
- cd project.bunsen && bunsen ls # -> uses .

For pushing and pulling, the Bunsen Git repo could be placed in a
separate location, e.g.
- /public/bunsen-systemtap.git
- /notnfs/staplogs/bunsen-systemtap/.bunsen/config specifies git_repo=/public/bunsen-systemtap.git

* #2b Repository Layout -- how is the content of the repo laid out

This is a subject of ongoing discussion with fche, because we have
quite different opinions about what is convenient to keep in Git vs
SQLite, how flexible the Git layout should be etc. In principle, the
analysis and data representation will work more or less the same
regardless of what solution we settle on. Therefore, the repository
layout can be made configurable (and potentially the configurability
can be reduced down the line as we settle on what makes sense and what
doesn't).

There are *several* types of artefacts we're contemplating between the
two of us.

(1) testlogs Git repo. Requested by fche. *Every* commit of this repo
is a directory tree of log files. The branch naming scheme is
completely free-form. This is meant to allow third parties to start
creating git repos for us to pull down, as quickly as possible, with
minimal tooling or explanation.

(2) index Git repo. Preferred by serhei, described in a prior email.
  - (<source>/)?index - contains JSON files <project>-<year>-<month>-<extra>
  - (<source>/)?<project>/testlogs-<suffix>
  - (<source>/)?<project>/testruns-<suffix>

An addendum re: branch naming: Git already prefixes cloned branches as
'remotes/foo/branch'. So this would most likely replace any <source>
prefix we would be adding on our own.

Keeping in mind we have to work directly with 'remotes/foo/branch'
and be careful with any Git operations (e.g. checkout) that default 
to creating a local mirror of the branch.

(3) index SQLite DB. Discussed in a later email

(4) JSON analysis cache. This would just be a directory of JSON files,
with format defined by the analysis script/library that generates it.

* #2c Repository Layout -- examples

Now for some examples of how this could be arranged:

(a) The current Bunsen setup for SystemTap uses all Git. We have:
- An index Git repo at .bunsen/bunsen.git

Cloning/archiving this repo is super simple.
Cloning a subset of this repo involves cloning a subset of the branches,
and is also fairly simple if the branch naming scheme allows us to use
sensible wildcards.

(a') Suppose we evolve this format to allow several separate Git repos.
- foo.bunsen/index.git - the main index+testlogs git repo
- foo.bunsen/index-myproject.git - an index git repo for a separate project
- foo.bunsen/extra-input1.git - a testlogs Git repo cloned from one source
- foo.bunsen/extra-input2.git - a testlogs Git repo cloned from another source
- path/to/public/export.git - extracted testruns from index git repo,
  with any sensitive data omitted
- cache.sqlite - an SQLite cache for analysis artefacts

Cloning this repo is a bit more complicated.

We have to specify which branches from which git repo on the remote
will be cloned to which git repo on the local.

So the following command API might need to be complicated somewhat:

$ bunsen clone user@ssh.server:foo.bunsen/ source_name
$ bunsen clone https://server/path/to/public/export.git source_name

Although there are sensible defaults:
- if making a fresh clone, clone all the same git repos and all the branches
- if adding data to a new repo

(b) fche's preference for an all-SQLite setup.
- foo.bunsen/bunsen.git - the main testlogs git repo
- index.sqlite - an SQLite DB with parsed testrun data 
- ??? cache.sqlite - an SQLite cache for transient analysis artefacts

Whether index or cache should be separate or merged is arguable. If
cloning the SQLite DB is a possibility, separation may be useful
because there *is* a distinction between a permanent index which is
worth keeping around vs. a transient analysis which functions as an
annotation to this index. I don't see cloning the SQLite DB as being a
standard operation in this design, so merging the DBs would be less
annoying.

But, on the other hand....

The main disadvantage to cloning this type of repo is that the
receiving side has to repeat the parsing work that was done on the
sending side. Since this could take hour(s) when we clone a repo for
the first time, for me this is a deal-breaker. The only workaround is
keeping tight control on the size of the SQLite DB that has the parsed
testrun data, so that we *can* clone it together with the testlogs as
a special case. This requires the possibility of offloading
large/transient analysis tables to a separate SQLite DB from
small/persistent/necessary analysis tables. I don't know yet if this
will significantly complicate the code or not.