From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by sourceware.org (Postfix) with ESMTPS id 95DDF3858002 for ; Thu, 10 Mar 2022 20:00:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 95DDF3858002 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=serhei.io Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=serhei.io Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id D1AEE32013BE; Thu, 10 Mar 2022 15:00:38 -0500 (EST) Received: from imap47 ([10.202.2.97]) by compute5.internal (MEProxy); Thu, 10 Mar 2022 15:00:39 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=cc :cc:content-type:date:date:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to; s=fm1; bh=4S2Kp5gwKi2ixjLI03zjWvnP6jchg9IU0jBPvu 7Eno8=; b=Q2LkpfMlW0oYe+LfpLiniP0a5w7bcdqRAHjHtSa6Hp6ctwkazFXI2j 5vjoVR2U4TxjvSQ+Atd6mh+OxArWBv1TZ6wPpcEciSZxj4rja7zxi/sYWD0FUmlK hN+NwtX8Ku+WPHSucEQdAu0hdobZqFKanKuByFNect78GXi33HCoCMbfEWR4HMtK O+QTLRNVNRY6i2RZcn1VfH2KhMQnwH4ajH+xFOYG/KVUg4JOOkKjPcxSyAFk1kbb HTZtun8zn9AUq1bspYwv5i5gInUaypc5tzKn++QmN2DFiL+yeCVlwNgxrHCEoM7e xJ3zyN55rU0pViyrwmQIde5Pi5qRklCg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=4S2Kp5gwKi2ixjLI0 3zjWvnP6jchg9IU0jBPvu7Eno8=; b=mNVRsqPBrGag/IjdVxX9IIz/zuRqO8bPQ UM9Q1lwZU9E94vPzkgZJSArVt8FN/X9hAaYeynVksMwZIbk+CvfYVb0ecI/c60+J mXy70ARYUZVQjU7zF/7a+d/A4Pb78hp7hpfn5DwiA/pehagLBh0zs12v6SqIjehr Y9SYWd/eVwjm2lrGThA43FZhPnOR32h74r4qgMqBDgZTij/+YLudJGW0t16FWJRf Cuyl9h+M6KXOs1wLFUCF0IvpJEOR6C7/fHqq7FTOq7u+7os8PFOTLS1DcI+mpWAS dUV14mSoyGebX0aKI3BfcOjPm2Q6MJCXhKIqVhG+79eSrmitJSgrg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddruddvtddgudefvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesth dtredtreertdenucfhrhhomhepfdfuvghrhhgvihcuofgrkhgrrhhovhdfuceoshgvrhhh vghisehsvghrhhgvihdrihhoqeenucggtffrrghtthgvrhhnpefhuddvieeukeffkefhle ehuedtvdehkeevfefgvdelgfevudelffefhfefuddugfenucevlhhushhtvghrufhiiigv pedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehsvghrhhgvihesshgvrhhhvghirdhioh X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 2CC2627404A0; Thu, 10 Mar 2022 15:00:38 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-4778-g14fba9972e-fm-20220217.001-g14fba997 Mime-Version: 1.0 Message-Id: <34fdccc3-5756-492a-89b9-3a2cab65b16a@www.fastmail.com> In-Reply-To: <20220310180414.GC28310@redhat.com> References: <320ed3c9-2612-4f64-bb1a-6a791bef4168@www.fastmail.com> <20220310180414.GC28310@redhat.com> Date: Thu, 10 Mar 2022 15:00:13 -0500 From: "Serhei Makarov" To: "Frank Ch. Eigler" Cc: Bunsen Subject: Re: bunsen (re)design discussion #1: testrun & branch identifiers, testrun representation Content-Type: text/plain X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2022 20:00:44 -0000 Thanks for those comments. I believe the current point of disagreement is whether we can abandon the JSON testrun storage format, while maintaining neat properties for easy cloning / archival of entire Bunsen repos, subsets of Bunsen repos, and so forth. I'm not convinced yet, especially if we go the route of every Bunsen instance parsing and generating its own local testrun data. However, your comments on how to specify the SQLite schema and split the 'dejagnu', 'autoconf', ... parser bits have convinced me to run my design exercise in the other direction: try my hand at specifying an SQLite schema, then check whether the same features can be replicated in the JSON version of the format. (When thinking in SQLite first, I can immediately think of a number of 'metaknowledge' bits that are currently hardcoded in the corresponding parsing scripts, but might conveniently be maintained and extended in tables. Things like knowing that "Red Hat Enterprise Linux \d+ (.*)" maps to "rhel-\1". But I'll see if a more convincing list of examples comes to mind or not.) On Thu, Mar 10, 2022, at 1:04 PM, Frank Ch. Eigler wrote: >> [Q] Should the bunsen_commit_id be deprecated entirely as a unique >> identifier? We could conceivably allow a single set of testlogs to >> define several testruns for the *same* project, which would break the >> uniqueness property for + > > Given that we're reconsidering use of git for purposes of storing this > analysis data, there is no obvious hash for it. Disagree. The bunsen_commit_id hash is based on the commit storing the testlogs, not the commit storing the testruns JSON files. So moving the testruns to SQLite wouldn't eliminate the possibility of using this hash in a testrun identifier. >OTOH, given that it > is the result of analysis, we have a lot of rich data to query by. I > suspect we don't need much of a standard identifier scheme at all. An > arbitrary nickname for query shortcutting could be a field associated > with the "testrun" object itself. I don't get it. The testrun ID could be stored in a field associated with the "testrun" object. It is indeed an 'arbitrary nickname' that we are free to generate when the testrun is created. But we do need such a nickname to exist in order to store tuples in the (testrun_id, keyword, value) schema, as well as to issue unambiguous commands like 'delete this testrun'. In my prior email I was trying to improve on 'arbitrary hexsha' and design a 'nickname' that identifies the testrun somewhat and can be handled by human minds when necessary. The bunsen_commit_id 'hexsha' was sufficiently unique but it was also very confusing from a UI perspective since users will be thinking of commit IDs in the project Git repo whenever they see a bare hexsha. >> - To accommodate fche's fine-grained branching idea, the >> branch names in the Bunsen repo can be based on the unique ID. > > (I was talking about fine-grained branching *for the test logs*, not > for the derived analysis data.) If the testrun data is stored in SQLite, obviously the Git branching scheme would only apply to the testlogs. If the testrun data is stored in Git, the branching scheme would be based on the same concerns about easily pulling a subset of the data. I'm not convinced we need individual-testrun granularity for the branching, but the branching scheme I outlined easily accommodates this option. >> Overall, though, the branch name format can be fairly free-form and >> different commit_logs scripts could even use different formats. The >> only design goals are to adequately split the testlogs into a fairly >> small number of fairly small branches (i.e. divide O(n*m) testlogs >> into O(n) branches of O(m) testlogs), and to allow 'bunsen pull' >> invocations to identify a meaningful subset of branches via wildcard >> or regex. > > Noting we switched to talking about testlogs - the raw test result log > files that are archived in git. These (rather than analysis data > stored somewhere else) are indeed worth pushing & pulling & > transporting between bunsen installations. I suspect that imposing a > naming standard here is not necessary either. As long as we > standardize what bunsen consumes from these repos (every tag? every > commit? every branch? every ref matching a given regex?), a human or > other tooling could decide their favorite naming convention. I'll need to think about this carefully. My intuition screams a strong disagreement. If we just treat the transported data as a set of unlabeled testlogs, and say that each Bunsen instance is going to be naming them separately & differently.... >> #1: Testrun Schema, very quick take1 > [...] the list of > key names can be left open in the tooling core, so queries can use any > set of them. i.e., the schema for "testrun" analysis objects could be: > > (id, keyword, value) > > and then a user can find testruns by combination of keywords having > particular values (or range or pattern). That can map > straightforwardly to a relational filter (select query) if that's how > we end up storing these things. As I understand it, this a neat solution to my earlier question of needing to define metadata fields in an SQLite schema and end up having to deal with migrations, etc during the lifecycle of a Bunsen repo. Of course, it means we definitely don't get Django-like ORM functionality for free: we need to implement our own mapping between this (id, keyword, value) soup and individual Testrun objects that combine all keywords for a particular id. This mapping would take the place of the current code which translated Testrun objects to and from JSON. That said, there are properties to this scheme I like. The SQLite schema should be the same across any Bunsen instance for any project, and this satisfies the goal. To test the 'straightforward' part of your claim about querying, what might a 'select' query look like for 'all configuration fields of {set of testruns}'? We would want it to produce a table along the lines of: testrun arch osver kernel_ver id1 x86_64 fedora-35 4.6-whatever id2 aarch64 rhel-8 4.7-whatever (Another functionality [Q] came up for distribution & kernel version: we may want to store the most exact version of each component but then specify a granularity for analysis which combines some 'similar-enough' versions into the same configuration (e.g. treat all 5.x kernels as one configuration, then treat all 4.x kernels as another configuration). If we see a problem arise on 5.x but not 4.x, *then* we would want to look at the detailed history of changes within 5.x.) >> testcases: an array of objects as follows >> - name - DejaGNU .exp name >> - outcome - DejaGNU outcome code e.g. PASS, FAIL, KPASS, KFAIL, ... >> - subtest - DejaGNU subtest, very very very freeform >> - origin_sum - cursor (subset of lines) in the dejagnu .sum file >> - origin_log - cursor (subset of lines) in the dejagnu .log file > > OK, so that could be a separate relational table of testrun analysis > data, derived from the testlog and related to the above set. Agreed. >> [Q] (!!) The 'store PASS subtests separately' vs 'consolidate PASS >> subtests' format tradeoff is a complex one [...] > > To the extent this represents an optimization, I'd tend to think of it > as a policy decision that the dejagnu parser configuration should make. Agreed, I would look at it that way too. We might want to flag what format a stored testrun is using, in a way that can be easily found out later, or changed by changing the parser configuration and rebuilding. >> bookkeeping information: where is the relevant data located in the Bunsen Git repo? >> - project: the project being tested >> - testrun_id: unique identifier of the testrun within the project >> - bunsen_testruns_branch: branch where the parsed testrun JSON is stored >> - bunsen_testlogs_branch: branch where the testlogs are stored >> - bunsen_commit_id: commit ID of the commit where the testlogs are stored > > (Not sure how much of this needs to be formally separated, vs. just > basic key/value tuples associated with the testrun vs. already > represented otherwise.) The testrun_id appears to be basic to this scheme, since it's used to label which key/value tuples are associated with which testruns. > These could also go as additional key/value tuples into the testrun. > (Prefix their names with "dejagnu-" to identify the analysis/parse > tool that created them.) Do you mean something like 'dejagnu-PASS' or 'dejagnu-PASS-count'? That could work, with an iterator like: for outcome in dejagnu_all_outcomes: # 'PASS', 'FAIL', ... key = 'dejagnu-{}-count' val = testrun[key] yield outcome, key, val ...