public inbox for bunsen@sourceware.org
 help / color / mirror / Atom feed
* bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
@ 2022-03-09 15:07 Serhei Makarov
  2022-03-09 22:25 ` Serhei Makarov
  2022-03-10 18:04 ` Frank Ch. Eigler
  0 siblings, 2 replies; 7+ messages in thread
From: Serhei Makarov @ 2022-03-09 15:07 UTC (permalink / raw)
  To: Bunsen

bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation

#1a Testrun Identifiers & Branching

Every testrun in the Bunsen repo should have a unique user-visible
identifier. Currently, this role is played by the testrun’s
bunsen_commit_id, which is the hash of the Git commit containing the
testlogs corresponding to the testrun.

[Q] Should the bunsen_commit_id be deprecated entirely as a unique
identifier? We could conceivably allow a single set of testlogs to
define several testruns for the *same* project, which would break the
uniqueness property for <project>+<bunsen_commit_id>

Available fields for identifier:
- (optional) source -- where the testrun was submitted from
  - Very important to keep locally added testruns separate
    from testruns that were pulled from a remote Bunsen repo.
- (optional) project
- (subset of) timestamp -- not too granular is good
  -> could be date of testrun completion,
  or that date of submission as a fallback.
  Don't expect too much meaningful precision.
- (subset of) bunsen_commit_id -- not too long is good
- sequence_no -- only added as a last resort
  (this is tacked on in case of collisions for the canonical identifier,
  e.g. year-month-day + 3 letters from commit id)

When generating testruns in multiple projects from the same testlog,
(as in the case of the gcc Jenkins tester)
the identifiers should be the same except for <project>.

Canonical format when displaying the identifier:
- <source>/<project>/<year-month-day>.<3 letters of bunsen_commit_id>.<sequence_no>
  e.g. mcermak/systemtap/2021-08-04.c54

When accepting an identifier as input, we should permit more
variation. Key properties:
- Not too complicated to type out by hand
- Can have reasonable elements in front for tab-completion
  - The top-level <source> and <project> fields can be dropped.
- Should *not* include query style fields (e.g. arch, osver).
  That's the job of our Bunsen repo querying solution.
- A subset of the full ID can be provided to identify the whole thing
  - Analysis scripts should fail gracefully if the ID is ambiguous,
    or treat ambiguous IDs as sets of testruns where appropriate.
- IDs pulled from different sources should usually not collide even if the
  source is omitted. [Q] Is the 3-letter commit id sufficient for this?
  - The 'canonical' identifier includes the source, so this property
    is purely for convenient user input. Collisions without the source
    are survivable.
- To accommodate fche's fine-grained branching idea, the
  branch names in the Bunsen repo can be based on the unique ID.

Possible formats when accepting an identifier as input:
- option 1: (<source>/)?(<project>/)?<bunsen_commit_id>
  - This corresponds to the current interface where we
    specify two options "project=... testrun=<commit_id>".
- option 2: (<source>/)?(<project>/)?<timestamp>.<bunsen_commit_id>(.<sequence_no>?)
  - Canonically, the timestamp is <year-month-day>
    but the user could specify a more granular timestamp.
- option 3: (<source>/)?(<project>/)?<full_timestamp>(.<sequence_no>?)
  - [Q] Needed for fche's suggestion to base the branch name on the
    identifier. Otherwise a nasty hack is needed to create the
    branch, commit to it, then rename the branch to include the
    bunsen_commit_id.

[Q] Especially for CGI queries, slashes '/' should be replaceable with
either colon ':' or dot '.'.

Testrun ID examples
- 2021-08.a4bf2c
- systemtap/2021-08.a4b.1
- mcermak/systemtap/2021-08.a4b.1
- mcermak/systemtap/2021-08-12:15
- a4bf2c55
- systemtap/a4bf2c55

All of this machinery lets us specify natural-looking commands like:
$ bunsen show systemtap/2022-02-13/12
  -> show testrun
$ bunsen show systemtap/2022-02-13/12:systemtap.log*
  -> show testlog
$ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500
  -> show subset of lines in a testlog
$ bunsen ls systemtap/2022-02
  -> list testruns matching a partial ID

And corresponding web queries, e.g.
- /bunsen-cgi.py?script=show_logs&id=systemtap.2022-02-13.12:systemtap.log:400-500
- [Q] Is it worth implementing per-command endpoints '/show_runs?id=...'?

The fields in the testrun ID are also used to generate the names of
the branches where the testruns and testlogs will be stored.

Note that the branch name may be nontrivially user-visible if we're
going by fche's suggestion of specifying subsets of branches to push
and pull. Which is why we're discussing it here rather than in the
'Repository Layout' portion.

(Testruns and testlogs branches are separate for two reasons -- (a)
testlogs branches replace the worktree with each commit while testruns
branches don't, and, (b) more relevantly to this discussion, we may
want to save some space by pulling some testruns branches but not the
corresponding testlogs branches.)

Possible formats for the branch name:
- mcermak/systemtap/testlogs-2021-08
  -> the current scheme, split into branches by year_month
- sergiodj/gdb/testlogs-2021-08-Ubuntu-Aarch64-m64
  -> a scheme I previously tried for the GDB buildbots,
  with additional splitting by buildbot name
- (mcermak/systemtap/testruns-2021-08-04-12:15.1 -> unlikely to scale well)
- mcermak/systemtap/testruns-2021-08/04-12:15.1
  -> fche’s suggestion to have 1 branch per testrun,
  probably requires the latter part of the branch
  identifier to be in a subdirectory like this

Overall, though, the branch name format can be fairly free-form and
different commit_logs scripts could even use different formats. The
only design goals are to adequately split the testlogs into a fairly
small number of fairly small branches (i.e. divide O(n*m) testlogs
into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
invocations to identify a meaningful subset of branches via wildcard
or regex. The content doesn't matter when accessing the repo, because
top-level index files specify the branch where testrun and testlog
data is stored. The only required elements IMO are in the prefix:
- (<source>/)?<project>/test{runs,logs}-...

The current Bunsen implementation also strongly insists on
<year-month> following immediately after test{runs,logs}, but I plan
to relax this requirement. Don't see any reason why not.

#1: Testrun Schema, very quick take1

version fields: these identify the version of the project being tested
- source_commit: commit id in the project source repo
- [Q] source_branch: branch being tested in the project source repo
  - Suggested by keiths at one point.
  - Overspecifies the version, but useful for tracking buildbot testing?
- package_nvr: version id of a downstream package
  - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or
    just the version number ('4.4-6.el7')?
- version: fallback field if the project uses some other type of versioning scheme
  - [Q] By default, we don’t know how to sort this. I could extend the
    analysis libraries to load the original parsing module for the
    project (e.g. systemtap.parse_dejagnu) and check for a method to
    extract the version sort key.

configuration fields: identify the system configuration on which the project was tested
- arch: hardware architecture
- osver: identifier of the Linux distro e.g. fedora-36
- Other fields identify dependencies e.g. kernel_ver, gcc_ver
  - These are version fields similar to package_nvr. Analysis scripts
    should be configurable to sort on these instead of on / together
    with the project version.
- [Q] More complex configuration subtleties could be captured by
  parsing things like config.log and creating a related testrun with
  the yes/no results? Needs some thinking since an autoconf 'no' isn't
  a 'fail' for the purposes of reporting regressions.
- [Q] keiths had an alternate set of configuration fields for
  his GDB use case, should give some thought to this.
  - 'target_board' is the only one that isn't covered by existing fields?

testcases: an array of objects as follows 
- name - DejaGNU .exp name
  - every testrun of a DejaGNU testsuite usually has the same .exps
- outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
- subtest - DejaGNU subtest, very very very freeform
  - a DejaGNU .exp doesn't always have the same subtests,
    subtest strings mix an identifying part with a diagnostic part
    subtest strings aren't even necessarily unique within an .exp
- origin_sum - cursor (subset of lines) in the dejagnu .sum file
- origin_log - cursor (subset of lines) in the dejagnu .log file

[Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
subtests' format tradeoff is a complex one & I'll need to write a
follow-up email to discuss it. Briefly, because the set of subtest
strings varies wildly from testrun to testrun, the 'state' of an .exp
is probably best described as 'the set of FAILing subtest strings in
the .exp'. If there are no FAILs we assume the .exp is working
correctly, which can be recorded as a single PASS; trying to keep
track of which PASS subtests are 'supposed' to be present and flagging
their absence as a failure would be a pretty daunting task. If we want
to do it, it's best left to some complex follow-up analysis script.

bookkeeping information: where is the relevant data located in the Bunsen Git repo?
- project: the project being tested
- testrun_id: unique identifier of the testrun within the project
- bunsen_testruns_branch: branch where the parsed testrun JSON is stored
- bunsen_testlogs_branch: branch where the testlogs are stored
- bunsen_commit_id: commit ID of the commit where the testlogs are stored
- [Q] (_cursor_commit_ids - A hack to shorten the JSON representation
  of origin_{sum,log}, I will get rid of it as this field is typically
  redundant with bunsen_commit_id)

extras, purely for convenience 
- pass_count
- fail_count
   - [Q] These are self-explanatory, but not as full-featured as
     DejaGNU’s summary of all outcome codes. I could remove this
     entirely, or add an outcome_counts field containing a map
     {‘PASS’: …, ‘FAIL’: …, ‘KFAIL’: …, …}

Next time on Bunsen:

- #2a: Repository Layout: including a more detailed discussion of
  pushing and pulling subsets of branches, in particular testruns-only
  or testruns+testlogs pull operations.

- #2b: SQLite Cache and Django-esque API:
  in particular, will discuss how the Git repo looks if
  parsed testrun data is stored in a database. I still expect having
  JSON in the repo to be canonical, however, because pulling a
  *subset* of testruns from an SQLite database would be much more
  complex than 'git pull <remote> <list of branches>'. ([fche] assures
  me, however, that cloning an SQLite repo in its entirety would be
  straightforware.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-09 15:07 bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation Serhei Makarov
@ 2022-03-09 22:25 ` Serhei Makarov
  2022-03-10 18:04 ` Frank Ch. Eigler
  1 sibling, 0 replies; 7+ messages in thread
From: Serhei Makarov @ 2022-03-09 22:25 UTC (permalink / raw)
  To: Serhei Makarov, Bunsen

* #1c Details of DejaGNU name/outcome/subtest

There are two options for the format of the 'testcases' array.

*Option 1:* PASS subtests are combined into a single PASS entry for each .exp.

Example: from a dejagnu sum file containing:

    01 Running foo.exp...
    02 PASS: rotate gears
    03 PASS: clean widgets (widget no.1,2,3)
    04 PASS: polish tea kettle
    05 Running bar.exp...
    06 PASS: greet guests
    07 FAIL: serve casserole (casserole was burnt)
    08 oven temperature was 1400 degrees F
    09 XFAIL: guests are upset 3/5 stars
    10 "You are a so-so host."
    11 PASS: clean house after guests depart

The Bunsen testcase entries corresponding to this entry would be:
- {name:'foo.exp', outcome:'PASS',
  origin_sum:'project.sum:01-04'}
- {name:'bar.exp', outcome:'FAIL',
  subtest:'serve casserole (casserole was burnt)',
  origin_sum:'project.sum:07-08'}
- {name:'bar.exp', outcome:'XFAIL',
  subtest:'guests are upset 3/5 stars',
  origin_sum:'project.sum:09-10'}

The current testrun format, as extensively tested with the SystemTap buildbots,
combines PASS subtests into a single entry for each .exp (with 'subtest' field
omitted), but stores FAIL subtests as separate entries. When working with
portions of the testsuite that don't contain failures, this significantly
reduces the size of the JSON that needs to be processed.

The reasoning for why this format works is that the set of subtest messages
across different DejaGNU runs is extremely inconsistent. Therefore, we define
the 'state' of an entire .exp testcase as 'the set of FAIL subtests produced by
this testcase' and compare testruns at the .exp level accordingly.

[Q] We _assume_ that the 'absence' of a PASS subtest in a testrun is not a
problem, and consider it the testsuite's responsibility to explicitly signal a
FAIL when something doesn't work. Is this assumption accurate to how projects
use DejaGNU?

Note that the PASS subtest for bar.exp were dropped in the above example.
If the set of failures is empty, we mark the entirety of bar.exp as a 'PASS'.
Since the set of failures is nonempty, we just record the 'FAIL' subtests in the set.

*Option 2:* PASS subtests are stored separately just like FAIL subtests.
In this case, every subtest line in the testrun creates a corresponding
entry in the 'testcases' array.

keiths requested this mode for the GDB logs.

It takes up a lot more space (although the strong de-duplication of Bunsen's
storage format may make this a moot issue, it slows down batch jobs working with
bulk quantities of testcase data).

But it allows test results to be diffed to detect the 'absence of a PASS' as a problem.

In principle, I don't see a reason why Bunsen couldn't continue supporting
either mode, with some careful attention paid to the API in the various analysis
scripts (the analysis scripts contain a collection of helper functions for
working with passes, fails, diffs, etc. which has been gradually evolving into a
proper library).

* #1d Applying the testcase format scheme to config.log

Including the yes/no answers from configure in the parsed testcase data. This is
an idea from fche I find very intriguing, especially because changes in autoconf
tests can correlate with regressions caused by the environment.

Applying the scheme to config.log:
- name = "config.log"
- subtest = "checking for/whether VVV"
- outcome = yes (PASS) or no (FAIL)

This should probably be stored with both PASS and FAIL subtests 'separate'.

[Q] Where should the parsed config.log be stored?
- in the 'testcases' field of same testrun?
  -> analysis scripts must know how to treat 'config.log' FAIL
  entries and ignore them (as 'not real failures') if necessary.  
  In short, 'config.log' FAIL entries are relevant to diffing/bisection for a
  known problem, but not relevant when reporting regressions.
- in a different field of same testrun? e.g. 'config'?
  -> analysis scripts will ignore 'config' unless explicitly coded to look at it.
- in a testrun in a separate project? (e.g. 'systemtap' -> 'systemtap-config')
  -> this is similar to the gcc buildbot case, where one testsuite run
  will create testruns in several 'projects' (e.g. 'gcc','ld','gas',...)

[Q] In analysis scripts such as show_testcases, how to show changes in large
testcases more granularly (e.g. check.exp, bpf.exp)? A brainstorm:
- Add an option split_subtests which will try to show a single grid for every
  subtest
- Scan the history for subtest strings, possibly with common prefix. Try to
  reduce the set / combine subtest strings with identical history
- Generate a grid view for each subtest we've identified this way
- In the CGI HTML view, for each .exp testcase of the grid view without
  split_subtests, add a link to the split_subtests view of that particular .exp
  - This would be much better than the current mcermak-derived option to show
    subtests when a '+' table cell is clicked. The HTML is lighter-weight and
    the history of separate subtests is clearly visible.
- Possibly: identify the testcases which require more granularity
  (e.g. they are always failing, only the number of failures keeps changing)
  and expand them automatically in the top level grid view.

[Q] For someone testing projects on a company-internal setup, how do we
extract a 'safe' subset of data that can be shared with the public?

- Option 1: analysis results only (e.g. grid views without subtests are
  guaranteed-safe)

- Option 2: testrun data but not testlogs (includes subtest strings; these may
  or may not be safe)

- Option 3: testrun data with scrubbed subtest strings (replace several FAIL
  outcomes with one testcase entry whose subtest says 'N failures')

Note: Within the 'Makefile' scheme, the scrubbing could be handled by an analysis
script that produces project 'systemtap-sourceware' from 'systemtap'.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-09 15:07 bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation Serhei Makarov
  2022-03-09 22:25 ` Serhei Makarov
@ 2022-03-10 18:04 ` Frank Ch. Eigler
  2022-03-10 20:00   ` Serhei Makarov
  1 sibling, 1 reply; 7+ messages in thread
From: Frank Ch. Eigler @ 2022-03-10 18:04 UTC (permalink / raw)
  To: Serhei Makarov; +Cc: Bunsen

Hi -


> Every testrun in the Bunsen repo should have a unique user-visible
> identifier. [...]

Reminding: "testrun" = "analysis data derived from a set of log files
stored in a git branch nearby".

> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
> identifier? We could conceivably allow a single set of testlogs to
> define several testruns for the *same* project, which would break the
> uniqueness property for <project>+<bunsen_commit_id>

Given that we're reconsidering use of git for purposes of storing this
analysis data, there is no obvious hash for it.  OTOH, given that it
is the result of analysis, we have a lot of rich data to query by.  I
suspect we don't need much of a standard identifier scheme at all.  An
arbitrary nickname for query shortcutting could be a field associated
with the "testrun" object itself.

> - To accommodate fche's fine-grained branching idea, the
>   branch names in the Bunsen repo can be based on the unique ID.

(I was talking about fine-grained branching *for the test logs*, not
for the derived analysis data.)


> All of this machinery lets us specify natural-looking commands like:
> $ bunsen show systemtap/2022-02-13/12
>   -> show testrun
> $ bunsen show systemtap/2022-02-13/12:systemtap.log*
>   -> show testlog
> $ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500
>   -> show subset of lines in a testlog
> $ bunsen ls systemtap/2022-02
>   -> list testruns matching a partial ID
> [...]


> Overall, though, the branch name format can be fairly free-form and
> different commit_logs scripts could even use different formats. The
> only design goals are to adequately split the testlogs into a fairly
> small number of fairly small branches (i.e. divide O(n*m) testlogs
> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
> invocations to identify a meaningful subset of branches via wildcard
> or regex.

Noting we switched to talking about testlogs - the raw test result log
files that are archived in git.  These (rather than analysis data
stored somewhere else) are indeed worth pushing & pulling &
transporting between bunsen installations.  I suspect that imposing a
naming standard here is not necessary either.  As long as we
standardize what bunsen consumes from these repos (every tag?  every
commit?  every branch?  every ref matching a given regex?), a human or
other tooling could decide their favorite naming convention.


> [...]
> #1: Testrun Schema, very quick take1
> 
> version fields: these identify the version of the project being tested
> - source_commit: commit id in the project source repo
> - [Q] source_branch: branch being tested in the project source repo
>   - [...]
> - package_nvr: version id of a downstream package
>   - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or
>     just the version number ('4.4-6.el7')?
> - version: fallback field if the project uses some other type of versioning scheme
>   - [Q] By default, we don’t know how to sort this. I could extend the
>     analysis libraries to load the original parsing module for the
>     project (e.g. systemtap.parse_dejagnu) and check for a method to
>     extract the version sort key.
> 
> configuration fields: identify the system configuration on which the project was tested
> - arch: hardware architecture
> - [...]
>   - 'target_board' is the only one that isn't covered by existing fields?

This list, and especially the "... and other fields ..." suggest to me
that these don't need to be tightly specified, but function as a set
of key/value tuples that the parsers would produce as results.  Yeah,
a conventional set is great for user convenience and for building
other analysis on top of the initial parse values.  But the list of
key names can be left open in the tooling core, so queries can use any
set of them.  i.e., the schema for "testrun" analysis objects could be:

   (id, keyword, value)

and then a user can find testruns by combination of keywords having
particular values (or range or pattern).  That can map
straightforwardly to a relational filter (select query) if that's how
we end up storing these things.


> testcases: an array of objects as follows 
> - name - DejaGNU .exp name
> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
> - subtest - DejaGNU subtest, very very very freeform
> - origin_sum - cursor (subset of lines) in the dejagnu .sum file
> - origin_log - cursor (subset of lines) in the dejagnu .log file

OK, so that could be a separate relational table of testrun analysis
data, derived from the testlog and related to the above set.


> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
> subtests' format tradeoff is a complex one [...]

To the extent this represents an optimization, I'd tend to think of it
as a policy decision that the dejagnu parser configuration should make.


> bookkeeping information: where is the relevant data located in the Bunsen Git repo?
> - project: the project being tested
> - testrun_id: unique identifier of the testrun within the project
> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored
> - bunsen_testlogs_branch: branch where the testlogs are stored
> - bunsen_commit_id: commit ID of the commit where the testlogs are stored

(Not sure how much of this needs to be formally separated, vs. just
basic key/value tuples associated with the testrun vs. already
represented otherwise.)


> - pass_count
> - fail_count
>    - [Q] These are self-explanatory, but not as full-featured as
>      DejaGNU’s summary of all outcome codes. I could remove this
>      entirely, or add an outcome_counts field containing a map
>      {‘PASS’: …, ‘FAIL’: …, ‘KFAIL’: …, …}

These could also go as additional key/value tuples into the testrun.
(Prefix their names with "dejagnu-" to identify the analysis/parse
tool that created them.)


- FChE


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-10 18:04 ` Frank Ch. Eigler
@ 2022-03-10 20:00   ` Serhei Makarov
  2022-03-10 23:00     ` Frank Ch. Eigler
  0 siblings, 1 reply; 7+ messages in thread
From: Serhei Makarov @ 2022-03-10 20:00 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Bunsen

Thanks for those comments. I believe the current point of disagreement is
whether we can abandon the JSON testrun storage format, while maintaining neat
properties for easy cloning / archival of entire Bunsen repos, subsets of Bunsen
repos, and so forth. I'm not convinced yet, especially if we go the route of
every Bunsen instance parsing and generating its own local testrun data.

However, your comments on how to specify the SQLite schema and split the
'dejagnu', 'autoconf', ... parser bits have convinced me to run my design
exercise in the other direction: try my hand at specifying an SQLite schema,
then check whether the same features can be replicated in the JSON version of
the format.

(When thinking in SQLite first, I can immediately think of a number of
'metaknowledge' bits that are currently hardcoded in the corresponding parsing
scripts, but might conveniently be maintained and extended in tables.
Things like knowing that "Red Hat Enterprise Linux \d+ (.*)" maps to
"rhel-\1". But I'll see if a more convincing list of examples comes to mind or not.)

On Thu, Mar 10, 2022, at 1:04 PM, Frank Ch. Eigler wrote:
>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
>> identifier? We could conceivably allow a single set of testlogs to
>> define several testruns for the *same* project, which would break the
>> uniqueness property for <project>+<bunsen_commit_id>
>
> Given that we're reconsidering use of git for purposes of storing this
> analysis data, there is no obvious hash for it.
Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs,
not the commit storing the testruns JSON files. So moving the testruns to SQLite
wouldn't eliminate the possibility of using this hash in a testrun
identifier.

>OTOH, given that it
> is the result of analysis, we have a lot of rich data to query by.  I
> suspect we don't need much of a standard identifier scheme at all.  An
> arbitrary nickname for query shortcutting could be a field associated
> with the "testrun" object itself.
I don't get it.

The testrun ID could be stored in a field associated with the "testrun" object.
It is indeed an 'arbitrary nickname' that we are free to generate when the
testrun is created. But we do need such a nickname to exist in order to store
tuples in the (testrun_id, keyword, value) schema, as well as to issue
unambiguous commands like 'delete this testrun'.

In my prior email I was trying to improve on 'arbitrary hexsha' and design a
'nickname' that identifies the testrun somewhat and can be handled by human
minds when necessary. The bunsen_commit_id 'hexsha' was sufficiently unique
but it was also very confusing from a UI perspective since users will be
thinking of commit IDs in the project Git repo whenever they see a bare hexsha.

>> - To accommodate fche's fine-grained branching idea, the
>>   branch names in the Bunsen repo can be based on the unique ID.
>
> (I was talking about fine-grained branching *for the test logs*, not
> for the derived analysis data.)

If the testrun data is stored in SQLite, obviously the Git branching scheme
would only apply to the testlogs.

If the testrun data is stored in Git, the branching scheme would be
based on the same concerns about easily pulling a subset of the data.

I'm not convinced we need individual-testrun granularity for the branching, but
the branching scheme I outlined easily accommodates this option.

>> Overall, though, the branch name format can be fairly free-form and
>> different commit_logs scripts could even use different formats. The
>> only design goals are to adequately split the testlogs into a fairly
>> small number of fairly small branches (i.e. divide O(n*m) testlogs
>> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull'
>> invocations to identify a meaningful subset of branches via wildcard
>> or regex.
>
> Noting we switched to talking about testlogs - the raw test result log
> files that are archived in git.  These (rather than analysis data
> stored somewhere else) are indeed worth pushing & pulling &
> transporting between bunsen installations.  I suspect that imposing a
> naming standard here is not necessary either.  As long as we
> standardize what bunsen consumes from these repos (every tag?  every
> commit?  every branch?  every ref matching a given regex?), a human or
> other tooling could decide their favorite naming convention.
I'll need to think about this carefully. My intuition screams a strong disagreement.

If we just treat the transported data as a set of unlabeled testlogs,
and say that each Bunsen instance is going to be naming them separately
& differently....

>> #1: Testrun Schema, very quick take1
> [...] the list of
> key names can be left open in the tooling core, so queries can use any
> set of them.  i.e., the schema for "testrun" analysis objects could be:
>
>    (id, keyword, value)
>
> and then a user can find testruns by combination of keywords having
> particular values (or range or pattern).  That can map
> straightforwardly to a relational filter (select query) if that's how
> we end up storing these things.

As I understand it, this a neat solution to my earlier question of needing to
define metadata fields in an SQLite schema and end up having to deal with
migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we
definitely don't get Django-like ORM functionality for free: we need to
implement our own mapping between this (id, keyword, value) soup and individual
Testrun objects that combine all keywords for a particular id. This mapping
would take the place of the current code which translated Testrun objects to and
from JSON.

That said, there are properties to this scheme I like. The SQLite schema should
be the same across any Bunsen instance for any project, and this satisfies the
goal.

To test the 'straightforward' part of your claim about querying, what might a
'select' query look like for 'all configuration fields of {set of testruns}'?

We would want it to produce a table along the lines of:

    testrun arch    osver     kernel_ver
    id1     x86_64  fedora-35 4.6-whatever
    id2     aarch64 rhel-8    4.7-whatever

(Another functionality [Q] came up for distribution & kernel version:
we may want to store the most exact version of each component
but then specify a granularity for analysis
which combines some 'similar-enough' versions into
the same configuration
(e.g. treat all 5.x kernels as one configuration,
then treat all 4.x kernels as another configuration).

If we see a problem arise on 5.x but not 4.x,
*then* we would want to look at the detailed history of
changes within 5.x.)

>> testcases: an array of objects as follows 
>> - name - DejaGNU .exp name
>> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ...
>> - subtest - DejaGNU subtest, very very very freeform
>> - origin_sum - cursor (subset of lines) in the dejagnu .sum file
>> - origin_log - cursor (subset of lines) in the dejagnu .log file
>
> OK, so that could be a separate relational table of testrun analysis
> data, derived from the testlog and related to the above set.
Agreed.

>> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS
>> subtests' format tradeoff is a complex one [...]
>
> To the extent this represents an optimization, I'd tend to think of it
> as a policy decision that the dejagnu parser configuration should make.
Agreed, I would look at it that way too.

We might want to flag what format a stored testrun is using, in a way that can
be easily found out later, or changed by changing the parser configuration and
rebuilding.

>> bookkeeping information: where is the relevant data located in the Bunsen Git repo?
>> - project: the project being tested
>> - testrun_id: unique identifier of the testrun within the project
>> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored
>> - bunsen_testlogs_branch: branch where the testlogs are stored
>> - bunsen_commit_id: commit ID of the commit where the testlogs are stored
>
> (Not sure how much of this needs to be formally separated, vs. just
> basic key/value tuples associated with the testrun vs. already
> represented otherwise.)
The testrun_id appears to be basic to this scheme, since it's used to label
which key/value tuples are associated with which testruns.

> These could also go as additional key/value tuples into the testrun.
> (Prefix their names with "dejagnu-" to identify the analysis/parse
> tool that created them.)
Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'?
That could work, with an iterator like:

    for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ...
        key = 'dejagnu-{}-count'
        val = testrun[key]
        yield outcome, key, val
        ...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-10 20:00   ` Serhei Makarov
@ 2022-03-10 23:00     ` Frank Ch. Eigler
  2022-03-14 17:24       ` Serhei Makarov
  0 siblings, 1 reply; 7+ messages in thread
From: Frank Ch. Eigler @ 2022-03-10 23:00 UTC (permalink / raw)
  To: Serhei Makarov; +Cc: bunsen


> (When thinking in SQLite first, I can immediately think of a number of
> 'metaknowledge' bits that are currently hardcoded in the corresponding parsing
> scripts, but might conveniently be maintained and extended in tables.
> [...])

(Pure configuration could well just live outside too.)


>>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique
>>> identifier? We could conceivably allow a single set of testlogs to
>>> define several testruns for the *same* project, which would break the
>>> uniqueness property for <project>+<bunsen_commit_id>
>>
>> Given that we're reconsidering use of git for purposes of storing this
>> analysis data, there is no obvious hash for it.
> Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs,
> not the commit storing the testruns JSON files. So moving the testruns to SQLite
> wouldn't eliminate the possibility of using this hash in a testrun
> identifier.

I believe you were talking about identification (unique key) for
testruns rather than testlogs.

>>OTOH, given that it
>> is the result of analysis, we have a lot of rich data to query by.  I
>> suspect we don't need much of a standard identifier scheme at all.  An
>> arbitrary nickname for query shortcutting could be a field associated
>> with the "testrun" object itself.

> I don't get it.
>
> The testrun ID could be stored in a field associated with the "testrun" object.
> It is indeed an 'arbitrary nickname' that we are free to generate when the
> testrun is created. But we do need such a nickname to exist in order to store
> tuples in the (testrun_id, keyword, value) schema, as well as to issue
> unambiguous commands like 'delete this testrun'.

As far as the database is concerned, a testrun ID can just be some
random unique integer that the user never even sees.  A user may see the
nickname or some other synthetic description (kinda like git describe?)
when needed, which may be mappable back to the related rows.  The
testrun objects would relate to the testlog commit#.  (If that
relationship happens to be one-to-one, then that commit# could be a good
nickname.)


> In my prior email I was trying to improve on 'arbitrary hexsha' and
> design a 'nickname' that identifies the testrun somewhat and can be
> handled by human minds when necessary. The bunsen_commit_id 'hexsha'
> was sufficiently unique but it was also very confusing from a UI
> perspective since users will be thinking of commit IDs in the project
> Git repo whenever they see a bare hexsha.

Yeah.  I suspect an identification-by-query that happens to expose a
convenient nickname would work here.  And you're right, it'd be
unfortunate to have ambiguity between the testlog & testrun objects.
(OTOH it could be that the bunsen CLIs generally deal with testrun
such things.)


>> Noting we switched to talking about testlogs - the raw test result log
>> files that are archived in git.  These (rather than analysis data
>> stored somewhere else) are indeed worth pushing & pulling &
>> transporting between bunsen installations.  I suspect that imposing a
>> naming standard here is not necessary either.  As long as we
>> standardize what bunsen consumes from these repos (every tag?  every
>> commit?  every branch?  every ref matching a given regex?), a human or
>> other tooling could decide their favorite naming convention.
>
> I'll need to think about this carefully. My intuition screams a strong
> disagreement.
>
> If we just treat the transported data as a set of unlabeled testlogs,
> and say that each Bunsen instance is going to be naming them separately
> & differently....

Well, if they are externally identified usually by common nickname, or
by predicates like time/tool/host or something, it need not vary from
installation to installation.


>>> #1: Testrun Schema, very quick take1
>> [...] the list of
>> key names can be left open in the tooling core, so queries can use any
>> set of them.  i.e., the schema for "testrun" analysis objects could be:
>>
>>    (id, keyword, value)
>>
>> and then a user can find testruns by combination of keywords having
>> particular values (or range or pattern).  That can map
>> straightforwardly to a relational filter (select query) if that's how
>> we end up storing these things.
>
> As I understand it, this a neat solution to my earlier question of needing to
> define metadata fields in an SQLite schema and end up having to deal with
> migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we
> definitely don't get Django-like ORM functionality for free [...]

Yes, right.  If one doesn't reify the set of keys to describe a testrun,
they won't show up as ORM columns/fields.



> [...]
> To test the 'straightforward' part of your claim about querying, what might a
> 'select' query look like for 'all configuration fields of {set of testruns}'?
>
> We would want it to produce a table along the lines of:
>
>     testrun arch    osver     kernel_ver
>     id1     x86_64  fedora-35 4.6-whatever
>     id2     aarch64 rhel-8    4.7-whatever

In SQL, it'd be a join like this:

select tr.id, trkv1.value, trkv2.value, trkv3.value
  from testrun tr, testrunkv kv1, testrunkv kv2, testrunkv kv3,
 where kv1.id = tr.id and kv1.name = 'arch'
   and kv2.id = tr.id and kv2.name = 'osver'
   and kv3.id = tr.id and kv3.name = 'kernel_ver';

(modulo testrun nickname).


> (Another functionality [Q] came up for distribution & kernel version:
> we may want to store the most exact version of each component
> but then specify a granularity for analysis
> which combines some 'similar-enough' versions into
> the same configuration
> (e.g. treat all 5.x kernels as one configuration,
> then treat all 4.x kernels as another configuration).
>
> If we see a problem arise on 5.x but not 4.x,
> *then* we would want to look at the detailed history of
> changes within 5.x.)

That should be expressible a variety of ways, even within sql ...

   where .... and kv3.value like '4.%';


>> These could also go as additional key/value tuples into the testrun.
>> (Prefix their names with "dejagnu-" to identify the analysis/parse
>> tool that created them.)
> Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'?
> That could work, with an iterator like:
>
>     for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ...
>         key = 'dejagnu-{}-count'
>         val = testrun[key]
>         yield outcome, key, val
>         ...

Yeah.  Just a place to stash summaries.

Or: why not, a whole separate derived analysis table with only
a handful of rows:
   dejagnu-TOOL-counts (testrun_id, outcome, count)


- FChE


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-10 23:00     ` Frank Ch. Eigler
@ 2022-03-14 17:24       ` Serhei Makarov
  2022-04-07 16:42         ` Keith Seitz
  0 siblings, 1 reply; 7+ messages in thread
From: Serhei Makarov @ 2022-03-14 17:24 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Bunsen

My main concern with the SQLite storage schema is how well SQLite
handles string deduplication. The main source of redundancy in parsed
testrun data is the subtest strings repeated across different
testruns, so we might end up needing to do an additional join as
follows:

    (name, outcome, subtest_id) >< (subtest_id, subtest_text)

Anyways, rather than guesswork back-and-forth over whether going
SQLite-only is a good decision, I'm thinking over how to rework the
bunsen/model.py class outlines to experiment with it.

Given the weak correspondence between Testrun/Testcase objects, we end
up needing a set of explicit ser/de methods under the hood, just as
with the JSON representation.

My current understanding of the schema would be as follows:

- testruns (testrun_id, project, testlogs_commit_id)
- ANALYSIS_testrun_kvs (testrun_id, key, value)
  - For analysis results that annotate the original testrun.
- testrun_kv_types (project, key, schema)
  - We will probably need to store some 'schema' information like this.
- testcases (testrun_id, name, outcome, subtest_id)
- subtest_strs (subtest_id, subtest_text)

According to your vision, there would also need to be additional
tables for analysis results which don't follow the format of testrun
key-value annotations. Not relevant just yet, and we may need to
decide a different schema for each category of analysis (diff, grid
view, regression report, etc.).

You are hoping to track fine-grained provenance, e.g. which testruns
are present in each regression report. This is not the same as the
dependency info used to decide when an analysis would be regenerated
e.g. the regression report would be re-run in response to changes in
the entire set of testruns it's run on, including testruns that are
not present in the regression report and therefore wouldn't be marked
in the provenance info. If the queries producing the input set of
testruns are complex, caching them as a set of testruns would a loss
in terms of storage. You would need to re-run the analysis whenever
the project changes.

First attempt at class outlines:

    class Testrun(dict):
        def __init__(self, repo=None, from_json=None, from_sqlite=None, schema=None):
           self._schema = schema
           self._repo = repo
           ...
        
        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id
        
        def add_testcase(self, tc) # tc is a Testcase obj
        def add_testcase(self, name, outcome, subtest=None) # create a Testcase obj
        # use like:
        #   tc = trun.add_testcase("foo.exp", "PASS", subtest="subtest text")
        #   tc.origin_log = ...

        # ... methods for updating key/values with analysis provenance, e.g. ...
        def set_provenance(field_name, analysis_name)
        def set_field(field_name, field_value, analysis_name)
        
        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB
        
        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

    class Testcase(dict):
        def __init__(self, from_json=None, from_sqlite=None, schema=None,
                           parent_testrun=None, repo=None):
            self._schema = schema
            if parent_testrun is None:
                self._repo = repo
            self.parent_testrun = parent_testrun
            ...
        
        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None)
        
        def set_parent(parent_testrun) # if created with parent_testrun=None
        
        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB
        def delete(self) # delete from tables in SQLite DB
        
        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

The 'repo' field in each object refers to a bunsen.Repo object,
which specifies such things as where testrun data is stored
(in Git repo or in an SQLite cache, or in a mix of both).

If storing analysis-derived key-values in separate tables per
analysis, an analysis could be implemented along the lines of:

    # in my_analysis.py
    for testrun in input_testruns:
        testrun.key = analyze_stuff() # by default, sets provenance to my_analysis
        testrun.save() # saves the updated key in my_analysis_testrun_kvs,
        # without touching the tables that don't belong to my_analysis

You could re-run this code whenever input_testruns changes
and the my_analysis_testrun_kvs table would be updated downstream?

A minimal ORMish template for a cacheable Analysis object *not* tied
to testrun key-values might be:

    class Analysis(dict):
        def __init__(self, repo=None, from_json=None, from_sqlite=None, ...):
            self._repo = repo
            ...

        def _load_json(self, from_json=None) # from_json is a JSON object
        def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id

        def validate(self, ...) # checks required fields for saving to DB
        def save(self) # update tables in SQLite DB
        
        def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation
  2022-03-14 17:24       ` Serhei Makarov
@ 2022-04-07 16:42         ` Keith Seitz
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Seitz @ 2022-04-07 16:42 UTC (permalink / raw)
  To: Serhei Makarov, Frank Ch. Eigler; +Cc: Bunsen

On 3/14/22 10:24, Serhei Makarov wrote:
> Given the weak correspondence between Testrun/Testcase objects, we end
> up needing a set of explicit ser/de methods under the hood, just as
> with the JSON representation.

In Sept 2020, I played with this a bit, and I have old (probably stale)
patches lying around to let SQLite do the serialization. The patches
do nothing to replace the JSON or the presented data model.

I reported my initial findings here:

https://sourceware.org/pipermail/bunsen/2020q3/000034.html

If you'd like to peek at that work, I can send it along, but
it is probably quite bit-rot by now.

> My current understanding of the schema would be as follows:
> 
> - testruns (testrun_id, project, testlogs_commit_id)
> - ANALYSIS_testrun_kvs (testrun_id, key, value)
>    - For analysis results that annotate the original testrun.
> - testrun_kv_types (project, key, schema)
>    - We will probably need to store some 'schema' information like this.
> - testcases (testrun_id, name, outcome, subtest_id)
> - subtest_strs (subtest_id, subtest_text)
> 
> According to your vision, there would also need to be additional
> tables for analysis results which don't follow the format of testrun
> key-value annotations. Not relevant just yet, and we may need to
> decide a different schema for each category of analysis (diff, grid
> view, regression report, etc.).

I have a prototype schema just to record test data. Bunsen's git repo
is still used to store the actual .sum/.log files and provide commit
IDs to identify a particular testrun, i.e., no bunsne metadata.

Keith


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-07 16:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09 15:07 bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation Serhei Makarov
2022-03-09 22:25 ` Serhei Makarov
2022-03-10 18:04 ` Frank Ch. Eigler
2022-03-10 20:00   ` Serhei Makarov
2022-03-10 23:00     ` Frank Ch. Eigler
2022-03-14 17:24       ` Serhei Makarov
2022-04-07 16:42         ` Keith Seitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).