bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation #1a Testrun Identifiers & Branching Every testrun in the Bunsen repo should have a unique user-visible identifier. Currently, this role is played by the testrun’s bunsen_commit_id, which is the hash of the Git commit containing the testlogs corresponding to the testrun. [Q] Should the bunsen_commit_id be deprecated entirely as a unique identifier? We could conceivably allow a single set of testlogs to define several testruns for the *same* project, which would break the uniqueness property for <project>+<bunsen_commit_id> Available fields for identifier: - (optional) source -- where the testrun was submitted from - Very important to keep locally added testruns separate from testruns that were pulled from a remote Bunsen repo. - (optional) project - (subset of) timestamp -- not too granular is good -> could be date of testrun completion, or that date of submission as a fallback. Don't expect too much meaningful precision. - (subset of) bunsen_commit_id -- not too long is good - sequence_no -- only added as a last resort (this is tacked on in case of collisions for the canonical identifier, e.g. year-month-day + 3 letters from commit id) When generating testruns in multiple projects from the same testlog, (as in the case of the gcc Jenkins tester) the identifiers should be the same except for <project>. Canonical format when displaying the identifier: - <source>/<project>/<year-month-day>.<3 letters of bunsen_commit_id>.<sequence_no> e.g. mcermak/systemtap/2021-08-04.c54 When accepting an identifier as input, we should permit more variation. Key properties: - Not too complicated to type out by hand - Can have reasonable elements in front for tab-completion - The top-level <source> and <project> fields can be dropped. - Should *not* include query style fields (e.g. arch, osver). That's the job of our Bunsen repo querying solution. - A subset of the full ID can be provided to identify the whole thing - Analysis scripts should fail gracefully if the ID is ambiguous, or treat ambiguous IDs as sets of testruns where appropriate. - IDs pulled from different sources should usually not collide even if the source is omitted. [Q] Is the 3-letter commit id sufficient for this? - The 'canonical' identifier includes the source, so this property is purely for convenient user input. Collisions without the source are survivable. - To accommodate fche's fine-grained branching idea, the branch names in the Bunsen repo can be based on the unique ID. Possible formats when accepting an identifier as input: - option 1: (<source>/)?(<project>/)?<bunsen_commit_id> - This corresponds to the current interface where we specify two options "project=... testrun=<commit_id>". - option 2: (<source>/)?(<project>/)?<timestamp>.<bunsen_commit_id>(.<sequence_no>?) - Canonically, the timestamp is <year-month-day> but the user could specify a more granular timestamp. - option 3: (<source>/)?(<project>/)?<full_timestamp>(.<sequence_no>?) - [Q] Needed for fche's suggestion to base the branch name on the identifier. Otherwise a nasty hack is needed to create the branch, commit to it, then rename the branch to include the bunsen_commit_id. [Q] Especially for CGI queries, slashes '/' should be replaceable with either colon ':' or dot '.'. Testrun ID examples - 2021-08.a4bf2c - systemtap/2021-08.a4b.1 - mcermak/systemtap/2021-08.a4b.1 - mcermak/systemtap/2021-08-12:15 - a4bf2c55 - systemtap/a4bf2c55 All of this machinery lets us specify natural-looking commands like: $ bunsen show systemtap/2022-02-13/12 -> show testrun $ bunsen show systemtap/2022-02-13/12:systemtap.log* -> show testlog $ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500 -> show subset of lines in a testlog $ bunsen ls systemtap/2022-02 -> list testruns matching a partial ID And corresponding web queries, e.g. - /bunsen-cgi.py?script=show_logs&id=systemtap.2022-02-13.12:systemtap.log:400-500 - [Q] Is it worth implementing per-command endpoints '/show_runs?id=...'? The fields in the testrun ID are also used to generate the names of the branches where the testruns and testlogs will be stored. Note that the branch name may be nontrivially user-visible if we're going by fche's suggestion of specifying subsets of branches to push and pull. Which is why we're discussing it here rather than in the 'Repository Layout' portion. (Testruns and testlogs branches are separate for two reasons -- (a) testlogs branches replace the worktree with each commit while testruns branches don't, and, (b) more relevantly to this discussion, we may want to save some space by pulling some testruns branches but not the corresponding testlogs branches.) Possible formats for the branch name: - mcermak/systemtap/testlogs-2021-08 -> the current scheme, split into branches by year_month - sergiodj/gdb/testlogs-2021-08-Ubuntu-Aarch64-m64 -> a scheme I previously tried for the GDB buildbots, with additional splitting by buildbot name - (mcermak/systemtap/testruns-2021-08-04-12:15.1 -> unlikely to scale well) - mcermak/systemtap/testruns-2021-08/04-12:15.1 -> fche’s suggestion to have 1 branch per testrun, probably requires the latter part of the branch identifier to be in a subdirectory like this Overall, though, the branch name format can be fairly free-form and different commit_logs scripts could even use different formats. The only design goals are to adequately split the testlogs into a fairly small number of fairly small branches (i.e. divide O(n*m) testlogs into O(n) branches of O(m) testlogs), and to allow 'bunsen pull' invocations to identify a meaningful subset of branches via wildcard or regex. The content doesn't matter when accessing the repo, because top-level index files specify the branch where testrun and testlog data is stored. The only required elements IMO are in the prefix: - (<source>/)?<project>/test{runs,logs}-... The current Bunsen implementation also strongly insists on <year-month> following immediately after test{runs,logs}, but I plan to relax this requirement. Don't see any reason why not. #1: Testrun Schema, very quick take1 version fields: these identify the version of the project being tested - source_commit: commit id in the project source repo - [Q] source_branch: branch being tested in the project source repo - Suggested by keiths at one point. - Overspecifies the version, but useful for tracking buildbot testing? - package_nvr: version id of a downstream package - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or just the version number ('4.4-6.el7')? - version: fallback field if the project uses some other type of versioning scheme - [Q] By default, we don’t know how to sort this. I could extend the analysis libraries to load the original parsing module for the project (e.g. systemtap.parse_dejagnu) and check for a method to extract the version sort key. configuration fields: identify the system configuration on which the project was tested - arch: hardware architecture - osver: identifier of the Linux distro e.g. fedora-36 - Other fields identify dependencies e.g. kernel_ver, gcc_ver - These are version fields similar to package_nvr. Analysis scripts should be configurable to sort on these instead of on / together with the project version. - [Q] More complex configuration subtleties could be captured by parsing things like config.log and creating a related testrun with the yes/no results? Needs some thinking since an autoconf 'no' isn't a 'fail' for the purposes of reporting regressions. - [Q] keiths had an alternate set of configuration fields for his GDB use case, should give some thought to this. - 'target_board' is the only one that isn't covered by existing fields? testcases: an array of objects as follows - name - DejaGNU .exp name - every testrun of a DejaGNU testsuite usually has the same .exps - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ... - subtest - DejaGNU subtest, very very very freeform - a DejaGNU .exp doesn't always have the same subtests, subtest strings mix an identifying part with a diagnostic part subtest strings aren't even necessarily unique within an .exp - origin_sum - cursor (subset of lines) in the dejagnu .sum file - origin_log - cursor (subset of lines) in the dejagnu .log file [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS subtests' format tradeoff is a complex one & I'll need to write a follow-up email to discuss it. Briefly, because the set of subtest strings varies wildly from testrun to testrun, the 'state' of an .exp is probably best described as 'the set of FAILing subtest strings in the .exp'. If there are no FAILs we assume the .exp is working correctly, which can be recorded as a single PASS; trying to keep track of which PASS subtests are 'supposed' to be present and flagging their absence as a failure would be a pretty daunting task. If we want to do it, it's best left to some complex follow-up analysis script. bookkeeping information: where is the relevant data located in the Bunsen Git repo? - project: the project being tested - testrun_id: unique identifier of the testrun within the project - bunsen_testruns_branch: branch where the parsed testrun JSON is stored - bunsen_testlogs_branch: branch where the testlogs are stored - bunsen_commit_id: commit ID of the commit where the testlogs are stored - [Q] (_cursor_commit_ids - A hack to shorten the JSON representation of origin_{sum,log}, I will get rid of it as this field is typically redundant with bunsen_commit_id) extras, purely for convenience - pass_count - fail_count - [Q] These are self-explanatory, but not as full-featured as DejaGNU’s summary of all outcome codes. I could remove this entirely, or add an outcome_counts field containing a map {‘PASS’: …, ‘FAIL’: …, ‘KFAIL’: …, …} Next time on Bunsen: - #2a: Repository Layout: including a more detailed discussion of pushing and pulling subsets of branches, in particular testruns-only or testruns+testlogs pull operations. - #2b: SQLite Cache and Django-esque API: in particular, will discuss how the Git repo looks if parsed testrun data is stored in a database. I still expect having JSON in the repo to be canonical, however, because pulling a *subset* of testruns from an SQLite database would be much more complex than 'git pull <remote> <list of branches>'. ([fche] assures me, however, that cloning an SQLite repo in its entirety would be straightforware.)
* #1c Details of DejaGNU name/outcome/subtest There are two options for the format of the 'testcases' array. *Option 1:* PASS subtests are combined into a single PASS entry for each .exp. Example: from a dejagnu sum file containing: 01 Running foo.exp... 02 PASS: rotate gears 03 PASS: clean widgets (widget no.1,2,3) 04 PASS: polish tea kettle 05 Running bar.exp... 06 PASS: greet guests 07 FAIL: serve casserole (casserole was burnt) 08 oven temperature was 1400 degrees F 09 XFAIL: guests are upset 3/5 stars 10 "You are a so-so host." 11 PASS: clean house after guests depart The Bunsen testcase entries corresponding to this entry would be: - {name:'foo.exp', outcome:'PASS', origin_sum:'project.sum:01-04'} - {name:'bar.exp', outcome:'FAIL', subtest:'serve casserole (casserole was burnt)', origin_sum:'project.sum:07-08'} - {name:'bar.exp', outcome:'XFAIL', subtest:'guests are upset 3/5 stars', origin_sum:'project.sum:09-10'} The current testrun format, as extensively tested with the SystemTap buildbots, combines PASS subtests into a single entry for each .exp (with 'subtest' field omitted), but stores FAIL subtests as separate entries. When working with portions of the testsuite that don't contain failures, this significantly reduces the size of the JSON that needs to be processed. The reasoning for why this format works is that the set of subtest messages across different DejaGNU runs is extremely inconsistent. Therefore, we define the 'state' of an entire .exp testcase as 'the set of FAIL subtests produced by this testcase' and compare testruns at the .exp level accordingly. [Q] We _assume_ that the 'absence' of a PASS subtest in a testrun is not a problem, and consider it the testsuite's responsibility to explicitly signal a FAIL when something doesn't work. Is this assumption accurate to how projects use DejaGNU? Note that the PASS subtest for bar.exp were dropped in the above example. If the set of failures is empty, we mark the entirety of bar.exp as a 'PASS'. Since the set of failures is nonempty, we just record the 'FAIL' subtests in the set. *Option 2:* PASS subtests are stored separately just like FAIL subtests. In this case, every subtest line in the testrun creates a corresponding entry in the 'testcases' array. keiths requested this mode for the GDB logs. It takes up a lot more space (although the strong de-duplication of Bunsen's storage format may make this a moot issue, it slows down batch jobs working with bulk quantities of testcase data). But it allows test results to be diffed to detect the 'absence of a PASS' as a problem. In principle, I don't see a reason why Bunsen couldn't continue supporting either mode, with some careful attention paid to the API in the various analysis scripts (the analysis scripts contain a collection of helper functions for working with passes, fails, diffs, etc. which has been gradually evolving into a proper library). * #1d Applying the testcase format scheme to config.log Including the yes/no answers from configure in the parsed testcase data. This is an idea from fche I find very intriguing, especially because changes in autoconf tests can correlate with regressions caused by the environment. Applying the scheme to config.log: - name = "config.log" - subtest = "checking for/whether VVV" - outcome = yes (PASS) or no (FAIL) This should probably be stored with both PASS and FAIL subtests 'separate'. [Q] Where should the parsed config.log be stored? - in the 'testcases' field of same testrun? -> analysis scripts must know how to treat 'config.log' FAIL entries and ignore them (as 'not real failures') if necessary. In short, 'config.log' FAIL entries are relevant to diffing/bisection for a known problem, but not relevant when reporting regressions. - in a different field of same testrun? e.g. 'config'? -> analysis scripts will ignore 'config' unless explicitly coded to look at it. - in a testrun in a separate project? (e.g. 'systemtap' -> 'systemtap-config') -> this is similar to the gcc buildbot case, where one testsuite run will create testruns in several 'projects' (e.g. 'gcc','ld','gas',...) [Q] In analysis scripts such as show_testcases, how to show changes in large testcases more granularly (e.g. check.exp, bpf.exp)? A brainstorm: - Add an option split_subtests which will try to show a single grid for every subtest - Scan the history for subtest strings, possibly with common prefix. Try to reduce the set / combine subtest strings with identical history - Generate a grid view for each subtest we've identified this way - In the CGI HTML view, for each .exp testcase of the grid view without split_subtests, add a link to the split_subtests view of that particular .exp - This would be much better than the current mcermak-derived option to show subtests when a '+' table cell is clicked. The HTML is lighter-weight and the history of separate subtests is clearly visible. - Possibly: identify the testcases which require more granularity (e.g. they are always failing, only the number of failures keeps changing) and expand them automatically in the top level grid view. [Q] For someone testing projects on a company-internal setup, how do we extract a 'safe' subset of data that can be shared with the public? - Option 1: analysis results only (e.g. grid views without subtests are guaranteed-safe) - Option 2: testrun data but not testlogs (includes subtest strings; these may or may not be safe) - Option 3: testrun data with scrubbed subtest strings (replace several FAIL outcomes with one testcase entry whose subtest says 'N failures') Note: Within the 'Makefile' scheme, the scrubbing could be handled by an analysis script that produces project 'systemtap-sourceware' from 'systemtap'.
Hi - > Every testrun in the Bunsen repo should have a unique user-visible > identifier. [...] Reminding: "testrun" = "analysis data derived from a set of log files stored in a git branch nearby". > [Q] Should the bunsen_commit_id be deprecated entirely as a unique > identifier? We could conceivably allow a single set of testlogs to > define several testruns for the *same* project, which would break the > uniqueness property for <project>+<bunsen_commit_id> Given that we're reconsidering use of git for purposes of storing this analysis data, there is no obvious hash for it. OTOH, given that it is the result of analysis, we have a lot of rich data to query by. I suspect we don't need much of a standard identifier scheme at all. An arbitrary nickname for query shortcutting could be a field associated with the "testrun" object itself. > - To accommodate fche's fine-grained branching idea, the > branch names in the Bunsen repo can be based on the unique ID. (I was talking about fine-grained branching *for the test logs*, not for the derived analysis data.) > All of this machinery lets us specify natural-looking commands like: > $ bunsen show systemtap/2022-02-13/12 > -> show testrun > $ bunsen show systemtap/2022-02-13/12:systemtap.log* > -> show testlog > $ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500 > -> show subset of lines in a testlog > $ bunsen ls systemtap/2022-02 > -> list testruns matching a partial ID > [...] > Overall, though, the branch name format can be fairly free-form and > different commit_logs scripts could even use different formats. The > only design goals are to adequately split the testlogs into a fairly > small number of fairly small branches (i.e. divide O(n*m) testlogs > into O(n) branches of O(m) testlogs), and to allow 'bunsen pull' > invocations to identify a meaningful subset of branches via wildcard > or regex. Noting we switched to talking about testlogs - the raw test result log files that are archived in git. These (rather than analysis data stored somewhere else) are indeed worth pushing & pulling & transporting between bunsen installations. I suspect that imposing a naming standard here is not necessary either. As long as we standardize what bunsen consumes from these repos (every tag? every commit? every branch? every ref matching a given regex?), a human or other tooling could decide their favorite naming convention. > [...] > #1: Testrun Schema, very quick take1 > > version fields: these identify the version of the project being tested > - source_commit: commit id in the project source repo > - [Q] source_branch: branch being tested in the project source repo > - [...] > - package_nvr: version id of a downstream package > - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or > just the version number ('4.4-6.el7')? > - version: fallback field if the project uses some other type of versioning scheme > - [Q] By default, we don’t know how to sort this. I could extend the > analysis libraries to load the original parsing module for the > project (e.g. systemtap.parse_dejagnu) and check for a method to > extract the version sort key. > > configuration fields: identify the system configuration on which the project was tested > - arch: hardware architecture > - [...] > - 'target_board' is the only one that isn't covered by existing fields? This list, and especially the "... and other fields ..." suggest to me that these don't need to be tightly specified, but function as a set of key/value tuples that the parsers would produce as results. Yeah, a conventional set is great for user convenience and for building other analysis on top of the initial parse values. But the list of key names can be left open in the tooling core, so queries can use any set of them. i.e., the schema for "testrun" analysis objects could be: (id, keyword, value) and then a user can find testruns by combination of keywords having particular values (or range or pattern). That can map straightforwardly to a relational filter (select query) if that's how we end up storing these things. > testcases: an array of objects as follows > - name - DejaGNU .exp name > - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ... > - subtest - DejaGNU subtest, very very very freeform > - origin_sum - cursor (subset of lines) in the dejagnu .sum file > - origin_log - cursor (subset of lines) in the dejagnu .log file OK, so that could be a separate relational table of testrun analysis data, derived from the testlog and related to the above set. > [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS > subtests' format tradeoff is a complex one [...] To the extent this represents an optimization, I'd tend to think of it as a policy decision that the dejagnu parser configuration should make. > bookkeeping information: where is the relevant data located in the Bunsen Git repo? > - project: the project being tested > - testrun_id: unique identifier of the testrun within the project > - bunsen_testruns_branch: branch where the parsed testrun JSON is stored > - bunsen_testlogs_branch: branch where the testlogs are stored > - bunsen_commit_id: commit ID of the commit where the testlogs are stored (Not sure how much of this needs to be formally separated, vs. just basic key/value tuples associated with the testrun vs. already represented otherwise.) > - pass_count > - fail_count > - [Q] These are self-explanatory, but not as full-featured as > DejaGNU’s summary of all outcome codes. I could remove this > entirely, or add an outcome_counts field containing a map > {‘PASS’: …, ‘FAIL’: …, ‘KFAIL’: …, …} These could also go as additional key/value tuples into the testrun. (Prefix their names with "dejagnu-" to identify the analysis/parse tool that created them.) - FChE
Thanks for those comments. I believe the current point of disagreement is whether we can abandon the JSON testrun storage format, while maintaining neat properties for easy cloning / archival of entire Bunsen repos, subsets of Bunsen repos, and so forth. I'm not convinced yet, especially if we go the route of every Bunsen instance parsing and generating its own local testrun data. However, your comments on how to specify the SQLite schema and split the 'dejagnu', 'autoconf', ... parser bits have convinced me to run my design exercise in the other direction: try my hand at specifying an SQLite schema, then check whether the same features can be replicated in the JSON version of the format. (When thinking in SQLite first, I can immediately think of a number of 'metaknowledge' bits that are currently hardcoded in the corresponding parsing scripts, but might conveniently be maintained and extended in tables. Things like knowing that "Red Hat Enterprise Linux \d+ (.*)" maps to "rhel-\1". But I'll see if a more convincing list of examples comes to mind or not.) On Thu, Mar 10, 2022, at 1:04 PM, Frank Ch. Eigler wrote: >> [Q] Should the bunsen_commit_id be deprecated entirely as a unique >> identifier? We could conceivably allow a single set of testlogs to >> define several testruns for the *same* project, which would break the >> uniqueness property for <project>+<bunsen_commit_id> > > Given that we're reconsidering use of git for purposes of storing this > analysis data, there is no obvious hash for it. Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs, not the commit storing the testruns JSON files. So moving the testruns to SQLite wouldn't eliminate the possibility of using this hash in a testrun identifier. >OTOH, given that it > is the result of analysis, we have a lot of rich data to query by. I > suspect we don't need much of a standard identifier scheme at all. An > arbitrary nickname for query shortcutting could be a field associated > with the "testrun" object itself. I don't get it. The testrun ID could be stored in a field associated with the "testrun" object. It is indeed an 'arbitrary nickname' that we are free to generate when the testrun is created. But we do need such a nickname to exist in order to store tuples in the (testrun_id, keyword, value) schema, as well as to issue unambiguous commands like 'delete this testrun'. In my prior email I was trying to improve on 'arbitrary hexsha' and design a 'nickname' that identifies the testrun somewhat and can be handled by human minds when necessary. The bunsen_commit_id 'hexsha' was sufficiently unique but it was also very confusing from a UI perspective since users will be thinking of commit IDs in the project Git repo whenever they see a bare hexsha. >> - To accommodate fche's fine-grained branching idea, the >> branch names in the Bunsen repo can be based on the unique ID. > > (I was talking about fine-grained branching *for the test logs*, not > for the derived analysis data.) If the testrun data is stored in SQLite, obviously the Git branching scheme would only apply to the testlogs. If the testrun data is stored in Git, the branching scheme would be based on the same concerns about easily pulling a subset of the data. I'm not convinced we need individual-testrun granularity for the branching, but the branching scheme I outlined easily accommodates this option. >> Overall, though, the branch name format can be fairly free-form and >> different commit_logs scripts could even use different formats. The >> only design goals are to adequately split the testlogs into a fairly >> small number of fairly small branches (i.e. divide O(n*m) testlogs >> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull' >> invocations to identify a meaningful subset of branches via wildcard >> or regex. > > Noting we switched to talking about testlogs - the raw test result log > files that are archived in git. These (rather than analysis data > stored somewhere else) are indeed worth pushing & pulling & > transporting between bunsen installations. I suspect that imposing a > naming standard here is not necessary either. As long as we > standardize what bunsen consumes from these repos (every tag? every > commit? every branch? every ref matching a given regex?), a human or > other tooling could decide their favorite naming convention. I'll need to think about this carefully. My intuition screams a strong disagreement. If we just treat the transported data as a set of unlabeled testlogs, and say that each Bunsen instance is going to be naming them separately & differently.... >> #1: Testrun Schema, very quick take1 > [...] the list of > key names can be left open in the tooling core, so queries can use any > set of them. i.e., the schema for "testrun" analysis objects could be: > > (id, keyword, value) > > and then a user can find testruns by combination of keywords having > particular values (or range or pattern). That can map > straightforwardly to a relational filter (select query) if that's how > we end up storing these things. As I understand it, this a neat solution to my earlier question of needing to define metadata fields in an SQLite schema and end up having to deal with migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we definitely don't get Django-like ORM functionality for free: we need to implement our own mapping between this (id, keyword, value) soup and individual Testrun objects that combine all keywords for a particular id. This mapping would take the place of the current code which translated Testrun objects to and from JSON. That said, there are properties to this scheme I like. The SQLite schema should be the same across any Bunsen instance for any project, and this satisfies the goal. To test the 'straightforward' part of your claim about querying, what might a 'select' query look like for 'all configuration fields of {set of testruns}'? We would want it to produce a table along the lines of: testrun arch osver kernel_ver id1 x86_64 fedora-35 4.6-whatever id2 aarch64 rhel-8 4.7-whatever (Another functionality [Q] came up for distribution & kernel version: we may want to store the most exact version of each component but then specify a granularity for analysis which combines some 'similar-enough' versions into the same configuration (e.g. treat all 5.x kernels as one configuration, then treat all 4.x kernels as another configuration). If we see a problem arise on 5.x but not 4.x, *then* we would want to look at the detailed history of changes within 5.x.) >> testcases: an array of objects as follows >> - name - DejaGNU .exp name >> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ... >> - subtest - DejaGNU subtest, very very very freeform >> - origin_sum - cursor (subset of lines) in the dejagnu .sum file >> - origin_log - cursor (subset of lines) in the dejagnu .log file > > OK, so that could be a separate relational table of testrun analysis > data, derived from the testlog and related to the above set. Agreed. >> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS >> subtests' format tradeoff is a complex one [...] > > To the extent this represents an optimization, I'd tend to think of it > as a policy decision that the dejagnu parser configuration should make. Agreed, I would look at it that way too. We might want to flag what format a stored testrun is using, in a way that can be easily found out later, or changed by changing the parser configuration and rebuilding. >> bookkeeping information: where is the relevant data located in the Bunsen Git repo? >> - project: the project being tested >> - testrun_id: unique identifier of the testrun within the project >> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored >> - bunsen_testlogs_branch: branch where the testlogs are stored >> - bunsen_commit_id: commit ID of the commit where the testlogs are stored > > (Not sure how much of this needs to be formally separated, vs. just > basic key/value tuples associated with the testrun vs. already > represented otherwise.) The testrun_id appears to be basic to this scheme, since it's used to label which key/value tuples are associated with which testruns. > These could also go as additional key/value tuples into the testrun. > (Prefix their names with "dejagnu-" to identify the analysis/parse > tool that created them.) Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'? That could work, with an iterator like: for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ... key = 'dejagnu-{}-count' val = testrun[key] yield outcome, key, val ...
> (When thinking in SQLite first, I can immediately think of a number of > 'metaknowledge' bits that are currently hardcoded in the corresponding parsing > scripts, but might conveniently be maintained and extended in tables. > [...]) (Pure configuration could well just live outside too.) >>> [Q] Should the bunsen_commit_id be deprecated entirely as a unique >>> identifier? We could conceivably allow a single set of testlogs to >>> define several testruns for the *same* project, which would break the >>> uniqueness property for <project>+<bunsen_commit_id> >> >> Given that we're reconsidering use of git for purposes of storing this >> analysis data, there is no obvious hash for it. > Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs, > not the commit storing the testruns JSON files. So moving the testruns to SQLite > wouldn't eliminate the possibility of using this hash in a testrun > identifier. I believe you were talking about identification (unique key) for testruns rather than testlogs. >>OTOH, given that it >> is the result of analysis, we have a lot of rich data to query by. I >> suspect we don't need much of a standard identifier scheme at all. An >> arbitrary nickname for query shortcutting could be a field associated >> with the "testrun" object itself. > I don't get it. > > The testrun ID could be stored in a field associated with the "testrun" object. > It is indeed an 'arbitrary nickname' that we are free to generate when the > testrun is created. But we do need such a nickname to exist in order to store > tuples in the (testrun_id, keyword, value) schema, as well as to issue > unambiguous commands like 'delete this testrun'. As far as the database is concerned, a testrun ID can just be some random unique integer that the user never even sees. A user may see the nickname or some other synthetic description (kinda like git describe?) when needed, which may be mappable back to the related rows. The testrun objects would relate to the testlog commit#. (If that relationship happens to be one-to-one, then that commit# could be a good nickname.) > In my prior email I was trying to improve on 'arbitrary hexsha' and > design a 'nickname' that identifies the testrun somewhat and can be > handled by human minds when necessary. The bunsen_commit_id 'hexsha' > was sufficiently unique but it was also very confusing from a UI > perspective since users will be thinking of commit IDs in the project > Git repo whenever they see a bare hexsha. Yeah. I suspect an identification-by-query that happens to expose a convenient nickname would work here. And you're right, it'd be unfortunate to have ambiguity between the testlog & testrun objects. (OTOH it could be that the bunsen CLIs generally deal with testrun such things.) >> Noting we switched to talking about testlogs - the raw test result log >> files that are archived in git. These (rather than analysis data >> stored somewhere else) are indeed worth pushing & pulling & >> transporting between bunsen installations. I suspect that imposing a >> naming standard here is not necessary either. As long as we >> standardize what bunsen consumes from these repos (every tag? every >> commit? every branch? every ref matching a given regex?), a human or >> other tooling could decide their favorite naming convention. > > I'll need to think about this carefully. My intuition screams a strong > disagreement. > > If we just treat the transported data as a set of unlabeled testlogs, > and say that each Bunsen instance is going to be naming them separately > & differently.... Well, if they are externally identified usually by common nickname, or by predicates like time/tool/host or something, it need not vary from installation to installation. >>> #1: Testrun Schema, very quick take1 >> [...] the list of >> key names can be left open in the tooling core, so queries can use any >> set of them. i.e., the schema for "testrun" analysis objects could be: >> >> (id, keyword, value) >> >> and then a user can find testruns by combination of keywords having >> particular values (or range or pattern). That can map >> straightforwardly to a relational filter (select query) if that's how >> we end up storing these things. > > As I understand it, this a neat solution to my earlier question of needing to > define metadata fields in an SQLite schema and end up having to deal with > migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we > definitely don't get Django-like ORM functionality for free [...] Yes, right. If one doesn't reify the set of keys to describe a testrun, they won't show up as ORM columns/fields. > [...] > To test the 'straightforward' part of your claim about querying, what might a > 'select' query look like for 'all configuration fields of {set of testruns}'? > > We would want it to produce a table along the lines of: > > testrun arch osver kernel_ver > id1 x86_64 fedora-35 4.6-whatever > id2 aarch64 rhel-8 4.7-whatever In SQL, it'd be a join like this: select tr.id, trkv1.value, trkv2.value, trkv3.value from testrun tr, testrunkv kv1, testrunkv kv2, testrunkv kv3, where kv1.id = tr.id and kv1.name = 'arch' and kv2.id = tr.id and kv2.name = 'osver' and kv3.id = tr.id and kv3.name = 'kernel_ver'; (modulo testrun nickname). > (Another functionality [Q] came up for distribution & kernel version: > we may want to store the most exact version of each component > but then specify a granularity for analysis > which combines some 'similar-enough' versions into > the same configuration > (e.g. treat all 5.x kernels as one configuration, > then treat all 4.x kernels as another configuration). > > If we see a problem arise on 5.x but not 4.x, > *then* we would want to look at the detailed history of > changes within 5.x.) That should be expressible a variety of ways, even within sql ... where .... and kv3.value like '4.%'; >> These could also go as additional key/value tuples into the testrun. >> (Prefix their names with "dejagnu-" to identify the analysis/parse >> tool that created them.) > Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'? > That could work, with an iterator like: > > for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ... > key = 'dejagnu-{}-count' > val = testrun[key] > yield outcome, key, val > ... Yeah. Just a place to stash summaries. Or: why not, a whole separate derived analysis table with only a handful of rows: dejagnu-TOOL-counts (testrun_id, outcome, count) - FChE
My main concern with the SQLite storage schema is how well SQLite handles string deduplication. The main source of redundancy in parsed testrun data is the subtest strings repeated across different testruns, so we might end up needing to do an additional join as follows: (name, outcome, subtest_id) >< (subtest_id, subtest_text) Anyways, rather than guesswork back-and-forth over whether going SQLite-only is a good decision, I'm thinking over how to rework the bunsen/model.py class outlines to experiment with it. Given the weak correspondence between Testrun/Testcase objects, we end up needing a set of explicit ser/de methods under the hood, just as with the JSON representation. My current understanding of the schema would be as follows: - testruns (testrun_id, project, testlogs_commit_id) - ANALYSIS_testrun_kvs (testrun_id, key, value) - For analysis results that annotate the original testrun. - testrun_kv_types (project, key, schema) - We will probably need to store some 'schema' information like this. - testcases (testrun_id, name, outcome, subtest_id) - subtest_strs (subtest_id, subtest_text) According to your vision, there would also need to be additional tables for analysis results which don't follow the format of testrun key-value annotations. Not relevant just yet, and we may need to decide a different schema for each category of analysis (diff, grid view, regression report, etc.). You are hoping to track fine-grained provenance, e.g. which testruns are present in each regression report. This is not the same as the dependency info used to decide when an analysis would be regenerated e.g. the regression report would be re-run in response to changes in the entire set of testruns it's run on, including testruns that are not present in the regression report and therefore wouldn't be marked in the provenance info. If the queries producing the input set of testruns are complex, caching them as a set of testruns would a loss in terms of storage. You would need to re-run the analysis whenever the project changes. First attempt at class outlines: class Testrun(dict): def __init__(self, repo=None, from_json=None, from_sqlite=None, schema=None): self._schema = schema self._repo = repo ... def _load_json(self, from_json=None) # from_json is a JSON object def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id def add_testcase(self, tc) # tc is a Testcase obj def add_testcase(self, name, outcome, subtest=None) # create a Testcase obj # use like: # tc = trun.add_testcase("foo.exp", "PASS", subtest="subtest text") # tc.origin_log = ... # ... methods for updating key/values with analysis provenance, e.g. ... def set_provenance(field_name, analysis_name) def set_field(field_name, field_value, analysis_name) def validate(self, ...) # checks required fields for saving to DB def save(self) # update tables in SQLite DB def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object class Testcase(dict): def __init__(self, from_json=None, from_sqlite=None, schema=None, parent_testrun=None, repo=None): self._schema = schema if parent_testrun is None: self._repo = repo self.parent_testrun = parent_testrun ... def _load_json(self, from_json=None) # from_json is a JSON object def _load_sqlite(self, from_sqlite=None) def set_parent(parent_testrun) # if created with parent_testrun=None def validate(self, ...) # checks required fields for saving to DB def save(self) # update tables in SQLite DB def delete(self) # delete from tables in SQLite DB def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object The 'repo' field in each object refers to a bunsen.Repo object, which specifies such things as where testrun data is stored (in Git repo or in an SQLite cache, or in a mix of both). If storing analysis-derived key-values in separate tables per analysis, an analysis could be implemented along the lines of: # in my_analysis.py for testrun in input_testruns: testrun.key = analyze_stuff() # by default, sets provenance to my_analysis testrun.save() # saves the updated key in my_analysis_testrun_kvs, # without touching the tables that don't belong to my_analysis You could re-run this code whenever input_testruns changes and the my_analysis_testrun_kvs table would be updated downstream? A minimal ORMish template for a cacheable Analysis object *not* tied to testrun key-values might be: class Analysis(dict): def __init__(self, repo=None, from_json=None, from_sqlite=None, ...): self._repo = repo ... def _load_json(self, from_json=None) # from_json is a JSON object def _load_sqlite(self, from_sqlite=None) # from_sqlite is a testrun_id def validate(self, ...) # checks required fields for saving to DB def save(self) # update tables in SQLite DB def to_json(self, summary=False, pretty=False, as_dict=False) # create JSON object
On 3/14/22 10:24, Serhei Makarov wrote: > Given the weak correspondence between Testrun/Testcase objects, we end > up needing a set of explicit ser/de methods under the hood, just as > with the JSON representation. In Sept 2020, I played with this a bit, and I have old (probably stale) patches lying around to let SQLite do the serialization. The patches do nothing to replace the JSON or the presented data model. I reported my initial findings here: https://sourceware.org/pipermail/bunsen/2020q3/000034.html If you'd like to peek at that work, I can send it along, but it is probably quite bit-rot by now. > My current understanding of the schema would be as follows: > > - testruns (testrun_id, project, testlogs_commit_id) > - ANALYSIS_testrun_kvs (testrun_id, key, value) > - For analysis results that annotate the original testrun. > - testrun_kv_types (project, key, schema) > - We will probably need to store some 'schema' information like this. > - testcases (testrun_id, name, outcome, subtest_id) > - subtest_strs (subtest_id, subtest_text) > > According to your vision, there would also need to be additional > tables for analysis results which don't follow the format of testrun > key-value annotations. Not relevant just yet, and we may need to > decide a different schema for each category of analysis (diff, grid > view, regression report, etc.). I have a prototype schema just to record test data. Bunsen's git repo is still used to store the actual .sum/.log files and provide commit IDs to identify a particular testrun, i.e., no bunsne metadata. Keith