From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <serhei@serhei.io>
Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com
 [64.147.123.20])
 by sourceware.org (Postfix) with ESMTPS id A2D803858415
 for <bunsen@sourceware.org>; Wed,  9 Mar 2022 15:07:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2D803858415
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=serhei.io
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=serhei.io
Received: from compute5.internal (compute5.nyi.internal [10.202.2.45])
 by mailout.west.internal (Postfix) with ESMTP id 04FD032015ED;
 Wed,  9 Mar 2022 10:07:46 -0500 (EST)
Received: from imap47 ([10.202.2.97])
 by compute5.internal (MEProxy); Wed, 09 Mar 2022 10:07:47 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=cc
 :cc:content-transfer-encoding:content-type:date:date:from:from
 :in-reply-to:message-id:mime-version:reply-to:sender:subject
 :subject:to:to; s=fm1; bh=bEYoEUU4NGXZ77hxP6dM0b2CVq7yYgoD2oiitn
 31Qrk=; b=rWvFYq/SycTUImVVO0ECLmmIyDygt1exEDS+7ZLLlTeVi414QUvd5r
 npizIJhgMI2tuZCZik3q6AnjIYgd8+q+zuJ21456pKZ66aeOCOVBqYzzLQ1dsNq5
 xjlFDBcbk8PK6tr6xUYtyS2XXKO/ei8G6p6gSvpj39e2nQ19EUo+j+XIQbFFnLd3
 y1q1HLkfOKS8PEnnLZ0KWnKZf8YqPRsi1ws6a9rUo8pXeEFcyzPBwaV0wuxzFANO
 s2njSZIMRDSKpAIIwPZmyYn6Ga24qFrx9uW1PJ3R44Lf5Loaq9rkAqRQI08YuDn1
 mJ2Se2OtjN3mF4bftTNmygNcZkQZJ8XQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:cc:content-transfer-encoding
 :content-type:date:date:from:from:in-reply-to:message-id
 :mime-version:reply-to:sender:subject:subject:to:to:x-me-proxy
 :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=bEYoEU
 U4NGXZ77hxP6dM0b2CVq7yYgoD2oiitn31Qrk=; b=aYicfaGuv2Fr6wmPGUy9Er
 Pk1uFT2JtnHO7a+ou1355uIDfMpoG76htX0py/ZntvOWr65E+aN8R5p4WTHie2AV
 GcBIsGNr/bHty8BRzdvhezVfSlS8EVG6MmT6Sg7EvmuM8wQ3SzYAEckh5EQdQYap
 m36zeu9MTc/pB0wtYT3sRJOHqP/OIeJ7yy5DY02J+09D59PsMxQcyCgUVbsOluh5
 DmVWm+Anrwof2hYtz5XbrmcWjc/ekR2lKtXaIUwUWMShkYWpwogEgau4qz7GUwKM
 BHTXY6o7Y46NkfyiG386SZyfaoJTpv7Jin1jeXmVymS024epivc4WRi0Kzsw19Uw
 ==
X-ME-Sender: <xms:wsIoYlT3qc6WH45gDeZ4pPdUa9uEbmAzWA16Xj_WNSOiCHNhFJp8gA>
 <xme:wsIoYuzi-dPSzJ3aoL0aZMcwSPQ_TvLcWx_hC7xlolZBee_IFHlSj55ur4Ajx_FB8
 jlOscwogt8QcKBTtA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddruddukedgjeduucetufdoteggodetrfdotf
 fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
 uceurghilhhouhhtmecufedttdenucenucfjughrpefofgggkfffhffvufgtgfesthhqre
 dtreerjeenucfhrhhomhepfdfuvghrhhgvihcuofgrkhgrrhhovhdfuceoshgvrhhhvghi
 sehsvghrhhgvihdrihhoqeenucggtffrrghtthgvrhhnpeeihedtuefhvdfhvedtvddule
 dtheegtdefvdeigeehffeuvdffvdejledtueelkeenucffohhmrghinhepsghunhhsvghn
 qdgtghhirdhphienucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfh
 hrohhmpehsvghrhhgvihesshgvrhhhvghirdhioh
X-ME-Proxy: <xmx:wsIoYq3-NZT43JsKr8RQVIRUigui4QVloOSVE__6yJLk0xXyGl-uzA>
 <xmx:wsIoYtBJLLA19LriAjqTAfalM4Bpodu_22_r7f2ttFMeB4lLB7N03g>
 <xmx:wsIoYugRMHWErrCb4eiveS0yAS0D4-Iq11Ay8IYaVa9tvLiPGB22Eg>
 <xmx:wsIoYkcHB1fAcjVZnvsTYQ3h1JIkVAQuozC8aB1eOTvZDU1SxOVJ0w>
Received: by mailuser.nyi.internal (Postfix, from userid 501)
 id 4F16127406B6; Wed,  9 Mar 2022 10:07:46 -0500 (EST)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.5.0-alpha0-4778-g14fba9972e-fm-20220217.001-g14fba997
Mime-Version: 1.0
Message-Id: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com>
Date: Wed, 09 Mar 2022 10:07:25 -0500
From: "Serhei Makarov" <serhei@serhei.io>
To: Bunsen <bunsen@sourceware.org>
Subject: bunsen (re)design discussion #1: testrun & branch identifiers, testrun
 representation
Content-Type: text/plain;charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW,
 RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: bunsen@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Bunsen mailing list <bunsen.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/bunsen/>
List-Help: <mailto:bunsen-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/bunsen>,
 <mailto:bunsen-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Mar 2022 15:07:52 -0000

bunsen (re)design discussion #1: testrun & branch identifiers, testrun r=
epresentation

#1a Testrun Identifiers & Branching

Every testrun in the Bunsen repo should have a unique user-visible
identifier. Currently, this role is played by the testrun=E2=80=99s
bunsen_commit_id, which is the hash of the Git commit containing the
testlogs corresponding to the testrun.

[Q] Should the bunsen_commit_id be deprecated entirely as a unique
identifier? We could conceivably allow a single set of testlogs to
define several testruns for the *same* project, which would break the
uniqueness property for <project>+<bunsen_commit_id>

Available fields for identifier:
- (optional) source -- where the testrun was submitted from
  - Very important to keep locally added testruns separate
    from testruns that were pulled from a remote Bunsen repo.
- (optional) project
- (subset of) timestamp -- not too granular is good
  -> could be date of testrun completion,
  or that date of submission as a fallback.
  Don't expect too much meaningful precision.
- (subset of) bunsen_commit_id -- not too long is good
- sequence_no -- only added as a last resort
  (this is tacked on in case of collisions for the canonical identifier,
  e.g. year-month-day + 3 letters from commit id)

When generating testruns in multiple projects from the same testlog,
(as in the case of the gcc Jenkins tester)
the identifiers should be the same except for <project>.

Canonical format when displaying the identifier:
- <source>/<project>/<year-month-day>.<3 letters of bunsen_commit_id>.<s=
equence_no>
  e.g. mcermak/systemtap/2021-08-04.c54

When accepting an identifier as input, we should permit more
variation. Key properties:
- Not too complicated to type out by hand
- Can have reasonable elements in front for tab-completion
  - The top-level <source> and <project> fields can be dropped.
- Should *not* include query style fields (e.g. arch, osver).
  That's the job of our Bunsen repo querying solution.
- A subset of the full ID can be provided to identify the whole thing
  - Analysis scripts should fail gracefully if the ID is ambiguous,
    or treat ambiguous IDs as sets of testruns where appropriate.
- IDs pulled from different sources should usually not collide even if t=
he
  source is omitted. [Q] Is the 3-letter commit id sufficient for this?
  - The 'canonical' identifier includes the source, so this property
    is purely for convenient user input. Collisions without the source
    are survivable.
- To accommodate fche's fine-grained branching idea, the
  branch names in the Bunsen repo can be based on the unique ID.

Possible formats when accepting an identifier as input:
- option 1: (<source>/)?(<project>/)?<bunsen_commit_id>
  - This corresponds to the current interface where we
    specify two options "project=3D... testrun=3D<commit_id>".
- option 2: (<source>/)?(<project>/)?<timestamp>.<bunsen_commit_id>(.<se=
quence_no>?)
  - Canonically, the timestamp is <year-month-day>
    but the user could specify a more granular timestamp.
- option 3: (<source>/)?(<project>/)?<full_timestamp>(.<sequence_no>?)
  - [Q] Needed for fche's suggestion to base the branch name on the
    identifier. Otherwise a nasty hack is needed to create the
    branch, commit to it, then rename the branch to include the
    bunsen_commit_id.

[Q] Especially for CGI queries, slashes '/' should be replaceable with
either colon ':' or dot '.'.

Testrun ID examples
- 2021-08.a4bf2c
- systemtap/2021-08.a4b.1
- mcermak/systemtap/2021-08.a4b.1
- mcermak/systemtap/2021-08-12:15
- a4bf2c55
- systemtap/a4bf2c55

All of this machinery lets us specify natural-looking commands like:
$ bunsen show systemtap/2022-02-13/12
  -> show testrun
$ bunsen show systemtap/2022-02-13/12:systemtap.log*
  -> show testlog
$ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500
  -> show subset of lines in a testlog
$ bunsen ls systemtap/2022-02
  -> list testruns matching a partial ID

And corresponding web queries, e.g.
- /bunsen-cgi.py?script=3Dshow_logs&id=3Dsystemtap.2022-02-13.12:systemt=
ap.log:400-500
- [Q] Is it worth implementing per-command endpoints '/show_runs?id=3D..=
.'?

The fields in the testrun ID are also used to generate the names of
the branches where the testruns and testlogs will be stored.

Note that the branch name may be nontrivially user-visible if we're
going by fche's suggestion of specifying subsets of branches to push
and pull. Which is why we're discussing it here rather than in the
'Repository Layout' portion.

(Testruns and testlogs branches are separate for two reasons -- (a)
testlogs branches replace the worktree with each commit while testruns
branches don't, and, (b) more relevantly to this discussion, we may
want to save some space by pulling some testruns branches but not the
corresponding testlogs branches.)

Possible formats for the branch name:
- mcermak/systemtap/testlogs-2021-08
  -> the current scheme, split into branches by year_month
- sergiodj/gdb/testlogs-2021-08-Ubuntu-Aarch64-m64
  -> a scheme I previously tried for the GDB buildbots,
  with additional splitting by buildbot name
- (mcermak/systemtap/testruns-2021-08-04-12:15.1 -> unlikely to scale we=
ll)
- mcermak/systemtap/testruns-2021-08/04-12:15.1
  -> fche=E2=80=99s suggestion to have 1 branch per testrun,
  probably requires the latter part of the branch
  identifier to be in a subdirectory like this

Overall, though, the branch name format can be fairly free-form and
different commit_logs scripts could even use different formats. The
only design goals are to adequately split the testlogs into a fairly
small number of fairly small branches (i.e. divide O(n*m) testlogs
into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
invocations to identify a meaningful subset of branches via wildcard
or regex. The content doesn't matter when accessing the repo, because
top-level index files specify the branch where testrun and testlog
data is stored. The only required elements IMO are in the prefix:
- (<source>/)?<project>/test{runs,logs}-...

The current Bunsen implementation also strongly insists on
<year-month> following immediately after test{runs,logs}, but I plan
to relax this requirement. Don't see any reason why not.

#1: Testrun Schema, very quick take1

version fields: these identify the version of the project being tested
- source_commit: commit id in the project source repo
- [Q] source_branch: branch being tested in the project source repo
  - Suggested by keiths at one point.
  - Overspecifies the version, but useful for tracking buildbot testing?
- package_nvr: version id of a downstream package
  - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or
    just the version number ('4.4-6.el7')?
- version: fallback field if the project uses some other type of version=
ing scheme
  - [Q] By default, we don=E2=80=99t know how to sort this. I could exte=
nd the
    analysis libraries to load the original parsing module for the
    project (e.g. systemtap.parse_dejagnu) and check for a method to
    extract the version sort key.

configuration fields: identify the system configuration on which the pro=
ject was tested
- arch: hardware architecture
- osver: identifier of the Linux distro e.g. fedora-36
- Other fields identify dependencies e.g. kernel_ver, gcc_ver
  - These are version fields similar to package_nvr. Analysis scripts
    should be configurable to sort on these instead of on / together
    with the project version.
- [Q] More complex configuration subtleties could be captured by
  parsing things like config.log and creating a related testrun with
  the yes/no results? Needs some thinking since an autoconf 'no' isn't
  a 'fail' for the purposes of reporting regressions.
- [Q] keiths had an alternate set of configuration fields for
  his GDB use case, should give some thought to this.
  - 'target_board' is the only one that isn't covered by existing fields?

testcases: an array of objects as follows=20
- name - DejaGNU .exp name
  - every testrun of a DejaGNU testsuite usually has the same .exps
- outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
- subtest - DejaGNU subtest, very very very freeform
  - a DejaGNU .exp doesn't always have the same subtests,
    subtest strings mix an identifying part with a diagnostic part
    subtest strings aren't even necessarily unique within an .exp
- origin_sum - cursor (subset of lines) in the dejagnu .sum file
- origin_log - cursor (subset of lines) in the dejagnu .log file

[Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
subtests' format tradeoff is a complex one & I'll need to write a
follow-up email to discuss it. Briefly, because the set of subtest
strings varies wildly from testrun to testrun, the 'state' of an .exp
is probably best described as 'the set of FAILing subtest strings in
the .exp'. If there are no FAILs we assume the .exp is working
correctly, which can be recorded as a single PASS; trying to keep
track of which PASS subtests are 'supposed' to be present and flagging
their absence as a failure would be a pretty daunting task. If we want
to do it, it's best left to some complex follow-up analysis script.

bookkeeping information: where is the relevant data located in the Bunse=
n Git repo?
- project: the project being tested
- testrun_id: unique identifier of the testrun within the project
- bunsen_testruns_branch: branch where the parsed testrun JSON is stored
- bunsen_testlogs_branch: branch where the testlogs are stored
- bunsen_commit_id: commit ID of the commit where the testlogs are stored
- [Q] (_cursor_commit_ids - A hack to shorten the JSON representation
  of origin_{sum,log}, I will get rid of it as this field is typically
  redundant with bunsen_commit_id)

extras, purely for convenience=20
- pass_count
- fail_count
   - [Q] These are self-explanatory, but not as full-featured as
     DejaGNU=E2=80=99s summary of all outcome codes. I could remove this
     entirely, or add an outcome_counts field containing a map
     {=E2=80=98PASS=E2=80=99: =E2=80=A6, =E2=80=98FAIL=E2=80=99: =E2=80=A6=
, =E2=80=98KFAIL=E2=80=99: =E2=80=A6, =E2=80=A6}

Next time on Bunsen:

- #2a: Repository Layout: including a more detailed discussion of
  pushing and pulling subsets of branches, in particular testruns-only
  or testruns+testlogs pull operations.

- #2b: SQLite Cache and Django-esque API:
  in particular, will discuss how the Git repo looks if
  parsed testrun data is stored in a database. I still expect having
  JSON in the repo to be canonical, however, because pulling a
  *subset* of testruns from an SQLite database would be much more
  complex than 'git pull <remote> <list of branches>'. ([fche] assures
  me, however, that cloning an SQLite repo in its entirety would be
  straightforware.)