Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation

public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed

From: "Serhei Makarov" <serhei@serhei.io>
To: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Bunsen <bunsen@sourceware.org>
Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
Date: Mon, 14 Mar 2022 13:24:36 -0400	[thread overview]
Message-ID: <813163ee-3ab5-462c-b2b2-475f77bc3ab1@www.fastmail.com> (raw)
In-Reply-To: <87ee39fm51.fsf@redhat.com>

My main concern with the SQLite storage schema is how well SQLite
handles string deduplication. The main source of redundancy in parsed
testrun data is the subtest strings repeated across different
testruns, so we might end up needing to do an additional join as
follows:

    (name, outcome, subtest_id) >< (subtest_id, subtest_text)

Anyways, rather than guesswork back-and-forth over whether going
SQLite-only is a good decision, I'm thinking over how to rework the
bunsen/model.py class outlines to experiment with it.

Given the weak correspondence between Testrun/Testcase objects, we end
up needing a set of explicit ser/de methods under the hood, just as
with the JSON representation.

My current understanding of the schema would be as follows:

- testruns (testrun_id, project, testlogs_commit_id)
- ANALYSIS_testrun_kvs (testrun_id, key, value)
  - For analysis results that annotate the original testrun.
- testrun_kv_types (project, key, schema)
  - We will probably need to store some 'schema' information like this.
- testcases (testrun_id, name, outcome, subtest_id)
- subtest_strs (subtest_id, subtest_text)

According to your vision, there would also need to be additional
tables for analysis results which don't follow the format of testrun
key-value annotations. Not relevant just yet, and we may need to
decide a different schema for each category of analysis (diff, grid
view, regression report, etc.).

You are hoping to track fine-grained provenance, e.g. which testruns
are present in each regression report. This is not the same as the
dependency info used to decide when an analysis would be regenerated
e.g. the regression report would be re-run in response to changes in
the entire set of testruns it's run on, including testruns that are
not present in the regression report and therefore wouldn't be marked
in the provenance info. If the queries producing the input set of
testruns are complex, caching them as a set of testruns would a loss
in terms of storage. You would need to re-run the analysis whenever
the project changes.

First attempt at class outlines:

    class Testrun(dict):
        def __init__(self, repo=None, from_json=None, from_sqlite=None, schema=None):
           self._schema = schema
           self._repo = repo
           ...

        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id

        def add_testcase(self, tc) # tc is a Testcase obj
        def add_testcase(self, name, outcome, subtest=None) # create a Testcase obj
        # use like:
        #   tc = trun.add_testcase("foo.exp", "PASS", subtest="subtest text")
        #   tc.origin_log = ...

        # ... methods for updating key/values with analysis provenance, e.g. ...
        def set_provenance(field_name, analysis_name)
        def set_field(field_name, field_value, analysis_name)

        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB

        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

    class Testcase(dict):
        def __init__(self, from_json=None, from_sqlite=None, schema=None,
                           parent_testrun=None, repo=None):
            self._schema = schema
            if parent_testrun is None:
                self._repo = repo
            self.parent_testrun = parent_testrun
            ...

        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None)

        def set_parent(parent_testrun) # if created with parent_testrun=None

        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB
        def delete(self) # delete from tables in SQLite DB

        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

The 'repo' field in each object refers to a bunsen.Repo object,
which specifies such things as where testrun data is stored
(in Git repo or in an SQLite cache, or in a mix of both).

If storing analysis-derived key-values in separate tables per
analysis, an analysis could be implemented along the lines of:

    # in my_analysis.py
    for testrun in input_testruns:
        testrun.key = analyze_stuff() # by default, sets provenance to my_analysis
        testrun.save() # saves the updated key in my_analysis_testrun_kvs,
        # without touching the tables that don't belong to my_analysis

You could re-run this code whenever input_testruns changes
and the my_analysis_testrun_kvs table would be updated downstream?

A minimal ORMish template for a cacheable Analysis object *not* tied
to testrun key-values might be:

    class Analysis(dict):
        def __init__(self, repo=None, from_json=None, from_sqlite=None, ...):
            self._repo = repo
            ...

        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id

        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB

        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

next prev parent reply	other threads:[~2022-03-14 17:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-09 15:07 Serhei Makarov
2022-03-09 22:25 ` Serhei Makarov
2022-03-10 18:04 ` Frank Ch. Eigler
2022-03-10 20:00   ` Serhei Makarov
2022-03-10 23:00     ` Frank Ch. Eigler
2022-03-14 17:24       ` Serhei Makarov [this message]
2022-04-07 16:42         ` Keith Seitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=813163ee-3ab5-462c-b2b2-475f77bc3ab1@www.fastmail.com \
    --to=serhei@serhei.io \
    --cc=bunsen@sourceware.org \
    --cc=fche@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).