public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
From: "Serhei Makarov" <serhei@serhei.io>
To: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Bunsen <bunsen@sourceware.org>
Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
Date: Thu, 10 Mar 2022 15:00:13 -0500	[thread overview]
Message-ID: <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com> (raw)
In-Reply-To: <20220310180414.GC28310@redhat.com>

Thanks for those comments. I believe the current point of disagreement is
whether we can abandon the JSON testrun storage format, while maintaining neat
properties for easy cloning / archival of entire Bunsen repos, subsets of Bunsen
repos, and so forth. I'm not convinced yet, especially if we go the route of
every Bunsen instance parsing and generating its own local testrun data.

However, your comments on how to specify the SQLite schema and split the
'dejagnu', 'autoconf', ... parser bits have convinced me to run my design
exercise in the other direction: try my hand at specifying an SQLite schema,
then check whether the same features can be replicated in the JSON version of
the format.

(When thinking in SQLite first, I can immediately think of a number of
'metaknowledge' bits that are currently hardcoded in the corresponding parsing
scripts, but might conveniently be maintained and extended in tables.
Things like knowing that "Red Hat Enterprise Linux \d+ (.*)" maps to
"rhel-\1". But I'll see if a more convincing list of examples comes to mind or not.)

On Thu, Mar 10, 2022, at 1:04 PM, Frank Ch. Eigler wrote:
>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
>> identifier? We could conceivably allow a single set of testlogs to
>> define several testruns for the *same* project, which would break the
>> uniqueness property for <project>+<bunsen_commit_id>
>
> Given that we're reconsidering use of git for purposes of storing this
> analysis data, there is no obvious hash for it.
Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs,
not the commit storing the testruns JSON files. So moving the testruns to SQLite
wouldn't eliminate the possibility of using this hash in a testrun
identifier.

>OTOH, given that it
> is the result of analysis, we have a lot of rich data to query by.  I
> suspect we don't need much of a standard identifier scheme at all.  An
> arbitrary nickname for query shortcutting could be a field associated
> with the "testrun" object itself.
I don't get it.

The testrun ID could be stored in a field associated with the "testrun" object.
It is indeed an 'arbitrary nickname' that we are free to generate when the
testrun is created. But we do need such a nickname to exist in order to store
tuples in the (testrun_id, keyword, value) schema, as well as to issue
unambiguous commands like 'delete this testrun'.

In my prior email I was trying to improve on 'arbitrary hexsha' and design a
'nickname' that identifies the testrun somewhat and can be handled by human
minds when necessary. The bunsen_commit_id 'hexsha' was sufficiently unique
but it was also very confusing from a UI perspective since users will be
thinking of commit IDs in the project Git repo whenever they see a bare hexsha.

>> - To accommodate fche's fine-grained branching idea, the
>>   branch names in the Bunsen repo can be based on the unique ID.
>
> (I was talking about fine-grained branching *for the test logs*, not
> for the derived analysis data.)

If the testrun data is stored in SQLite, obviously the Git branching scheme
would only apply to the testlogs.

If the testrun data is stored in Git, the branching scheme would be
based on the same concerns about easily pulling a subset of the data.

I'm not convinced we need individual-testrun granularity for the branching, but
the branching scheme I outlined easily accommodates this option.

>> Overall, though, the branch name format can be fairly free-form and
>> different commit_logs scripts could even use different formats. The
>> only design goals are to adequately split the testlogs into a fairly
>> small number of fairly small branches (i.e. divide O(n*m) testlogs
>> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
>> invocations to identify a meaningful subset of branches via wildcard
>> or regex.
>
> Noting we switched to talking about testlogs - the raw test result log
> files that are archived in git.  These (rather than analysis data
> stored somewhere else) are indeed worth pushing & pulling &
> transporting between bunsen installations.  I suspect that imposing a
> naming standard here is not necessary either.  As long as we
> standardize what bunsen consumes from these repos (every tag?  every
> commit?  every branch?  every ref matching a given regex?), a human or
> other tooling could decide their favorite naming convention.
I'll need to think about this carefully. My intuition screams a strong disagreement.

If we just treat the transported data as a set of unlabeled testlogs,
and say that each Bunsen instance is going to be naming them separately
& differently....

>> #1: Testrun Schema, very quick take1
> [...] the list of
> key names can be left open in the tooling core, so queries can use any
> set of them.  i.e., the schema for "testrun" analysis objects could be:
>
>    (id, keyword, value)
>
> and then a user can find testruns by combination of keywords having
> particular values (or range or pattern).  That can map
> straightforwardly to a relational filter (select query) if that's how
> we end up storing these things.

As I understand it, this a neat solution to my earlier question of needing to
define metadata fields in an SQLite schema and end up having to deal with
migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we
definitely don't get Django-like ORM functionality for free: we need to
implement our own mapping between this (id, keyword, value) soup and individual
Testrun objects that combine all keywords for a particular id. This mapping
would take the place of the current code which translated Testrun objects to and
from JSON.

That said, there are properties to this scheme I like. The SQLite schema should
be the same across any Bunsen instance for any project, and this satisfies the
goal.

To test the 'straightforward' part of your claim about querying, what might a
'select' query look like for 'all configuration fields of {set of testruns}'?

We would want it to produce a table along the lines of:

    testrun arch    osver     kernel_ver
    id1     x86_64  fedora-35 4.6-whatever
    id2     aarch64 rhel-8    4.7-whatever

(Another functionality [Q] came up for distribution & kernel version:
we may want to store the most exact version of each component
but then specify a granularity for analysis
which combines some 'similar-enough' versions into
the same configuration
(e.g. treat all 5.x kernels as one configuration,
then treat all 4.x kernels as another configuration).

If we see a problem arise on 5.x but not 4.x,
*then* we would want to look at the detailed history of
changes within 5.x.)

>> testcases: an array of objects as follows 
>> - name - DejaGNU .exp name
>> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
>> - subtest - DejaGNU subtest, very very very freeform
>> - origin_sum - cursor (subset of lines) in the dejagnu .sum file
>> - origin_log - cursor (subset of lines) in the dejagnu .log file
>
> OK, so that could be a separate relational table of testrun analysis
> data, derived from the testlog and related to the above set.
Agreed.

>> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
>> subtests' format tradeoff is a complex one [...]
>
> To the extent this represents an optimization, I'd tend to think of it
> as a policy decision that the dejagnu parser configuration should make.
Agreed, I would look at it that way too.

We might want to flag what format a stored testrun is using, in a way that can
be easily found out later, or changed by changing the parser configuration and
rebuilding.

>> bookkeeping information: where is the relevant data located in the Bunsen Git repo?
>> - project: the project being tested
>> - testrun_id: unique identifier of the testrun within the project
>> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored
>> - bunsen_testlogs_branch: branch where the testlogs are stored
>> - bunsen_commit_id: commit ID of the commit where the testlogs are stored
>
> (Not sure how much of this needs to be formally separated, vs. just
> basic key/value tuples associated with the testrun vs. already
> represented otherwise.)
The testrun_id appears to be basic to this scheme, since it's used to label
which key/value tuples are associated with which testruns.

> These could also go as additional key/value tuples into the testrun.
> (Prefix their names with "dejagnu-" to identify the analysis/parse
> tool that created them.)
Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'?
That could work, with an iterator like:

    for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ...
        key = 'dejagnu-{}-count'
        val = testrun[key]
        yield outcome, key, val
        ...

  reply	other threads:[~2022-03-10 20:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-09 15:07 Serhei Makarov
2022-03-09 22:25 ` Serhei Makarov
2022-03-10 18:04 ` Frank Ch. Eigler
2022-03-10 20:00   ` Serhei Makarov [this message]
2022-03-10 23:00     ` Frank Ch. Eigler
2022-03-14 17:24       ` Serhei Makarov
2022-04-07 16:42         ` Keith Seitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com \
    --to=serhei@serhei.io \
    --cc=bunsen@sourceware.org \
    --cc=fche@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).