From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com [64.147.123.20]) by sourceware.org (Postfix) with ESMTPS id A2D803858415 for ; Wed, 9 Mar 2022 15:07:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A2D803858415 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=serhei.io Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=serhei.io Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id 04FD032015ED; Wed, 9 Mar 2022 10:07:46 -0500 (EST) Received: from imap47 ([10.202.2.97]) by compute5.internal (MEProxy); Wed, 09 Mar 2022 10:07:47 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=cc :cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to; s=fm1; bh=bEYoEUU4NGXZ77hxP6dM0b2CVq7yYgoD2oiitn 31Qrk=; b=rWvFYq/SycTUImVVO0ECLmmIyDygt1exEDS+7ZLLlTeVi414QUvd5r npizIJhgMI2tuZCZik3q6AnjIYgd8+q+zuJ21456pKZ66aeOCOVBqYzzLQ1dsNq5 xjlFDBcbk8PK6tr6xUYtyS2XXKO/ei8G6p6gSvpj39e2nQ19EUo+j+XIQbFFnLd3 y1q1HLkfOKS8PEnnLZ0KWnKZf8YqPRsi1ws6a9rUo8pXeEFcyzPBwaV0wuxzFANO s2njSZIMRDSKpAIIwPZmyYn6Ga24qFrx9uW1PJ3R44Lf5Loaq9rkAqRQI08YuDn1 mJ2Se2OtjN3mF4bftTNmygNcZkQZJ8XQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=bEYoEU U4NGXZ77hxP6dM0b2CVq7yYgoD2oiitn31Qrk=; b=aYicfaGuv2Fr6wmPGUy9Er Pk1uFT2JtnHO7a+ou1355uIDfMpoG76htX0py/ZntvOWr65E+aN8R5p4WTHie2AV GcBIsGNr/bHty8BRzdvhezVfSlS8EVG6MmT6Sg7EvmuM8wQ3SzYAEckh5EQdQYap m36zeu9MTc/pB0wtYT3sRJOHqP/OIeJ7yy5DY02J+09D59PsMxQcyCgUVbsOluh5 DmVWm+Anrwof2hYtz5XbrmcWjc/ekR2lKtXaIUwUWMShkYWpwogEgau4qz7GUwKM BHTXY6o7Y46NkfyiG386SZyfaoJTpv7Jin1jeXmVymS024epivc4WRi0Kzsw19Uw == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddruddukedgjeduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefofgggkfffhffvufgtgfesthhqre dtreerjeenucfhrhhomhepfdfuvghrhhgvihcuofgrkhgrrhhovhdfuceoshgvrhhhvghi sehsvghrhhgvihdrihhoqeenucggtffrrghtthgvrhhnpeeihedtuefhvdfhvedtvddule dtheegtdefvdeigeehffeuvdffvdejledtueelkeenucffohhmrghinhepsghunhhsvghn qdgtghhirdhphienucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfh hrohhmpehsvghrhhgvihesshgvrhhhvghirdhioh X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 4F16127406B6; Wed, 9 Mar 2022 10:07:46 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-4778-g14fba9972e-fm-20220217.001-g14fba997 Mime-Version: 1.0 Message-Id: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com> Date: Wed, 09 Mar 2022 10:07:25 -0500 From: "Serhei Makarov" To: Bunsen Subject: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2022 15:07:52 -0000 bunsen (re)design discussion #1: testrun & branch identifiers, testrun r= epresentation #1a Testrun Identifiers & Branching Every testrun in the Bunsen repo should have a unique user-visible identifier. Currently, this role is played by the testrun=E2=80=99s bunsen_commit_id, which is the hash of the Git commit containing the testlogs corresponding to the testrun. [Q] Should the bunsen_commit_id be deprecated entirely as a unique identifier? We could conceivably allow a single set of testlogs to define several testruns for the *same* project, which would break the uniqueness property for + Available fields for identifier: - (optional) source -- where the testrun was submitted from - Very important to keep locally added testruns separate from testruns that were pulled from a remote Bunsen repo. - (optional) project - (subset of) timestamp -- not too granular is good -> could be date of testrun completion, or that date of submission as a fallback. Don't expect too much meaningful precision. - (subset of) bunsen_commit_id -- not too long is good - sequence_no -- only added as a last resort (this is tacked on in case of collisions for the canonical identifier, e.g. year-month-day + 3 letters from commit id) When generating testruns in multiple projects from the same testlog, (as in the case of the gcc Jenkins tester) the identifiers should be the same except for . Canonical format when displaying the identifier: - //.<3 letters of bunsen_commit_id>. e.g. mcermak/systemtap/2021-08-04.c54 When accepting an identifier as input, we should permit more variation. Key properties: - Not too complicated to type out by hand - Can have reasonable elements in front for tab-completion - The top-level and fields can be dropped. - Should *not* include query style fields (e.g. arch, osver). That's the job of our Bunsen repo querying solution. - A subset of the full ID can be provided to identify the whole thing - Analysis scripts should fail gracefully if the ID is ambiguous, or treat ambiguous IDs as sets of testruns where appropriate. - IDs pulled from different sources should usually not collide even if t= he source is omitted. [Q] Is the 3-letter commit id sufficient for this? - The 'canonical' identifier includes the source, so this property is purely for convenient user input. Collisions without the source are survivable. - To accommodate fche's fine-grained branching idea, the branch names in the Bunsen repo can be based on the unique ID. Possible formats when accepting an identifier as input: - option 1: (/)?(/)? - This corresponds to the current interface where we specify two options "project=3D... testrun=3D". - option 2: (/)?(/)?.(.?) - Canonically, the timestamp is but the user could specify a more granular timestamp. - option 3: (/)?(/)?(.?) - [Q] Needed for fche's suggestion to base the branch name on the identifier. Otherwise a nasty hack is needed to create the branch, commit to it, then rename the branch to include the bunsen_commit_id. [Q] Especially for CGI queries, slashes '/' should be replaceable with either colon ':' or dot '.'. Testrun ID examples - 2021-08.a4bf2c - systemtap/2021-08.a4b.1 - mcermak/systemtap/2021-08.a4b.1 - mcermak/systemtap/2021-08-12:15 - a4bf2c55 - systemtap/a4bf2c55 All of this machinery lets us specify natural-looking commands like: $ bunsen show systemtap/2022-02-13/12 -> show testrun $ bunsen show systemtap/2022-02-13/12:systemtap.log* -> show testlog $ bunsen show systemtap/2022-02-13/12:systemtap.log:400-500 -> show subset of lines in a testlog $ bunsen ls systemtap/2022-02 -> list testruns matching a partial ID And corresponding web queries, e.g. - /bunsen-cgi.py?script=3Dshow_logs&id=3Dsystemtap.2022-02-13.12:systemt= ap.log:400-500 - [Q] Is it worth implementing per-command endpoints '/show_runs?id=3D..= .'? The fields in the testrun ID are also used to generate the names of the branches where the testruns and testlogs will be stored. Note that the branch name may be nontrivially user-visible if we're going by fche's suggestion of specifying subsets of branches to push and pull. Which is why we're discussing it here rather than in the 'Repository Layout' portion. (Testruns and testlogs branches are separate for two reasons -- (a) testlogs branches replace the worktree with each commit while testruns branches don't, and, (b) more relevantly to this discussion, we may want to save some space by pulling some testruns branches but not the corresponding testlogs branches.) Possible formats for the branch name: - mcermak/systemtap/testlogs-2021-08 -> the current scheme, split into branches by year_month - sergiodj/gdb/testlogs-2021-08-Ubuntu-Aarch64-m64 -> a scheme I previously tried for the GDB buildbots, with additional splitting by buildbot name - (mcermak/systemtap/testruns-2021-08-04-12:15.1 -> unlikely to scale we= ll) - mcermak/systemtap/testruns-2021-08/04-12:15.1 -> fche=E2=80=99s suggestion to have 1 branch per testrun, probably requires the latter part of the branch identifier to be in a subdirectory like this Overall, though, the branch name format can be fairly free-form and different commit_logs scripts could even use different formats. The only design goals are to adequately split the testlogs into a fairly small number of fairly small branches (i.e. divide O(n*m) testlogs into O(n) branches of O(m) testlogs), and to allow 'bunsen pull' invocations to identify a meaningful subset of branches via wildcard or regex. The content doesn't matter when accessing the repo, because top-level index files specify the branch where testrun and testlog data is stored. The only required elements IMO are in the prefix: - (/)?/test{runs,logs}-... The current Bunsen implementation also strongly insists on following immediately after test{runs,logs}, but I plan to relax this requirement. Don't see any reason why not. #1: Testrun Schema, very quick take1 version fields: these identify the version of the project being tested - source_commit: commit id in the project source repo - [Q] source_branch: branch being tested in the project source repo - Suggested by keiths at one point. - Overspecifies the version, but useful for tracking buildbot testing? - package_nvr: version id of a downstream package - [Q] Should this be the full nvr (e.g. 'systemtap-4.4-6.el7') or just the version number ('4.4-6.el7')? - version: fallback field if the project uses some other type of version= ing scheme - [Q] By default, we don=E2=80=99t know how to sort this. I could exte= nd the analysis libraries to load the original parsing module for the project (e.g. systemtap.parse_dejagnu) and check for a method to extract the version sort key. configuration fields: identify the system configuration on which the pro= ject was tested - arch: hardware architecture - osver: identifier of the Linux distro e.g. fedora-36 - Other fields identify dependencies e.g. kernel_ver, gcc_ver - These are version fields similar to package_nvr. Analysis scripts should be configurable to sort on these instead of on / together with the project version. - [Q] More complex configuration subtleties could be captured by parsing things like config.log and creating a related testrun with the yes/no results? Needs some thinking since an autoconf 'no' isn't a 'fail' for the purposes of reporting regressions. - [Q] keiths had an alternate set of configuration fields for his GDB use case, should give some thought to this. - 'target_board' is the only one that isn't covered by existing fields? testcases: an array of objects as follows=20 - name - DejaGNU .exp name - every testrun of a DejaGNU testsuite usually has the same .exps - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ... - subtest - DejaGNU subtest, very very very freeform - a DejaGNU .exp doesn't always have the same subtests, subtest strings mix an identifying part with a diagnostic part subtest strings aren't even necessarily unique within an .exp - origin_sum - cursor (subset of lines) in the dejagnu .sum file - origin_log - cursor (subset of lines) in the dejagnu .log file [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS subtests' format tradeoff is a complex one & I'll need to write a follow-up email to discuss it. Briefly, because the set of subtest strings varies wildly from testrun to testrun, the 'state' of an .exp is probably best described as 'the set of FAILing subtest strings in the .exp'. If there are no FAILs we assume the .exp is working correctly, which can be recorded as a single PASS; trying to keep track of which PASS subtests are 'supposed' to be present and flagging their absence as a failure would be a pretty daunting task. If we want to do it, it's best left to some complex follow-up analysis script. bookkeeping information: where is the relevant data located in the Bunse= n Git repo? - project: the project being tested - testrun_id: unique identifier of the testrun within the project - bunsen_testruns_branch: branch where the parsed testrun JSON is stored - bunsen_testlogs_branch: branch where the testlogs are stored - bunsen_commit_id: commit ID of the commit where the testlogs are stored - [Q] (_cursor_commit_ids - A hack to shorten the JSON representation of origin_{sum,log}, I will get rid of it as this field is typically redundant with bunsen_commit_id) extras, purely for convenience=20 - pass_count - fail_count - [Q] These are self-explanatory, but not as full-featured as DejaGNU=E2=80=99s summary of all outcome codes. I could remove this entirely, or add an outcome_counts field containing a map {=E2=80=98PASS=E2=80=99: =E2=80=A6, =E2=80=98FAIL=E2=80=99: =E2=80=A6= , =E2=80=98KFAIL=E2=80=99: =E2=80=A6, =E2=80=A6} Next time on Bunsen: - #2a: Repository Layout: including a more detailed discussion of pushing and pulling subsets of branches, in particular testruns-only or testruns+testlogs pull operations. - #2b: SQLite Cache and Django-esque API: in particular, will discuss how the Git repo looks if parsed testrun data is stored in a database. I still expect having JSON in the repo to be canonical, however, because pulling a *subset* of testruns from an SQLite database would be much more complex than 'git pull '. ([fche] assures me, however, that cloning an SQLite repo in its entirety would be straightforware.)