From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fche@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 314AB3858D39
 for <bunsen@sourceware.org>; Thu, 10 Mar 2022 18:04:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 314AB3858D39
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-630-u_7r0R-DNlSHhHJmAWhxyQ-1; Thu, 10 Mar 2022 13:04:17 -0500
X-MC-Unique: u_7r0R-DNlSHhHJmAWhxyQ-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A3972520F;
 Thu, 10 Mar 2022 18:04:16 +0000 (UTC)
Received: from redhat.com (unknown [10.2.16.64])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 5812984034;
 Thu, 10 Mar 2022 18:04:16 +0000 (UTC)
Received: from fche by redhat.com with local (Exim 4.94.2)
 (envelope-from <fche@redhat.com>)
 id 1nSN94-0000K5-Ra; Thu, 10 Mar 2022 13:04:14 -0500
Date: Thu, 10 Mar 2022 13:04:14 -0500
From: "Frank Ch. Eigler" <fche@redhat.com>
To: Serhei Makarov <serhei@serhei.io>
Cc: Bunsen <bunsen@sourceware.org>
Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers,
 testrun representation
Message-ID: <20220310180414.GC28310@redhat.com>
References: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com>
MIME-Version: 1.0
In-Reply-To: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com>
User-Agent: Mutt/1.12.0 (2019-05-25)
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: bunsen@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bunsen mailing list <bunsen.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/bunsen/>
List-Help: <mailto:bunsen-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Mar 2022 18:04:23 -0000

Hi -


> Every testrun in the Bunsen repo should have a unique user-visible
> identifier. [...]

Reminding: "testrun" = "analysis data derived from a set of log files
stored in a git branch nearby".

> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
> identifier? We could conceivably allow a single set of testlogs to
> define several testruns for the *same* project, which would break the
> uniqueness property for <project>+<bunsen_commit_id>

Given that we're reconsidering use of git for purposes of storing this
analysis data, there is no obvious hash for it.  OTOH, given that it
is the result of analysis, we have a lot of rich data to query by.  I
suspect we don't need much of a standard identifier scheme at all.  An
arbitrary nickname for query shortcutting could be a field associated
with the "testrun" object itself.

> - To accommodate fche's fine-grained branching idea, the
>   branch names in the Bunsen repo can be based on the unique ID.

(I was talking about fine-grained branching *for the test logs*, not
for the derived analysis data.)


> All of this machinery lets us specify natural-looking commands like:
> $ bunsen show systemtap/2022-02-13/12
>   -> show testrun
> $ bunsen show systemtap/2022-02-13/12:systemtap.log*
>   -> show testlog
> $ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500
>   -> show subset of lines in a testlog
> $ bunsen ls systemtap/2022-02
>   -> list testruns matching a partial ID
> [...]


> Overall, though, the branch name format can be fairly free-form and
> different commit_logs scripts could even use different formats. The
> only design goals are to adequately split the testlogs into a fairly
> small number of fairly small branches (i.e. divide O(n*m) testlogs
> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
> invocations to identify a meaningful subset of branches via wildcard
> or regex.

Noting we switched to talking about testlogs - the raw test result log
files that are archived in git.  These (rather than analysis data
stored somewhere else) are indeed worth pushing & pulling &
transporting between bunsen installations.  I suspect that imposing a
naming standard here is not necessary either.  As long as we
standardize what bunsen consumes from these repos (every tag?  every
commit?  every branch?  every ref matching a given regex?), a human or
other tooling could decide their favorite naming convention.


> [...]
> #1: Testrun Schema, very quick take1
> 
> version fields: these identify the version of the project being tested
> - source_commit: commit id in the project source repo
> - [Q] source_branch: branch being tested in the project source repo
>   - [...]
> - package_nvr: version id of a downstream package
>   - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or
>     just the version number ('4.4-6.el7')?
> - version: fallback field if the project uses some other type of versioning scheme
>   - [Q] By default, we don’t know how to sort this. I could extend the
>     analysis libraries to load the original parsing module for the
>     project (e.g. systemtap.parse_dejagnu) and check for a method to
>     extract the version sort key.
> 
> configuration fields: identify the system configuration on which the project was tested
> - arch: hardware architecture
> - [...]
>   - 'target_board' is the only one that isn't covered by existing fields?

This list, and especially the "... and other fields ..." suggest to me
that these don't need to be tightly specified, but function as a set
of key/value tuples that the parsers would produce as results.  Yeah,
a conventional set is great for user convenience and for building
other analysis on top of the initial parse values.  But the list of
key names can be left open in the tooling core, so queries can use any
set of them.  i.e., the schema for "testrun" analysis objects could be:

   (id, keyword, value)

and then a user can find testruns by combination of keywords having
particular values (or range or pattern).  That can map
straightforwardly to a relational filter (select query) if that's how
we end up storing these things.


> testcases: an array of objects as follows 
> - name - DejaGNU .exp name
> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
> - subtest - DejaGNU subtest, very very very freeform
> - origin_sum - cursor (subset of lines) in the dejagnu .sum file
> - origin_log - cursor (subset of lines) in the dejagnu .log file

OK, so that could be a separate relational table of testrun analysis
data, derived from the testlog and related to the above set.


> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
> subtests' format tradeoff is a complex one [...]

To the extent this represents an optimization, I'd tend to think of it
as a policy decision that the dejagnu parser configuration should make.


> bookkeeping information: where is the relevant data located in the Bunsen Git repo?
> - project: the project being tested
> - testrun_id: unique identifier of the testrun within the project
> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored
> - bunsen_testlogs_branch: branch where the testlogs are stored
> - bunsen_commit_id: commit ID of the commit where the testlogs are stored

(Not sure how much of this needs to be formally separated, vs. just
basic key/value tuples associated with the testrun vs. already
represented otherwise.)


> - pass_count
> - fail_count
>    - [Q] These are self-explanatory, but not as full-featured as
>      DejaGNU’s summary of all outcome codes. I could remove this
>      entirely, or add an outcome_counts field containing a map
>      {‘PASS’: …, ‘FAIL’: …, ‘KFAIL’: …, …}

These could also go as additional key/value tuples into the testrun.
(Prefix their names with "dejagnu-" to identify the analysis/parse
tool that created them.)


- FChE