From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by sourceware.org (Postfix) with ESMTPS id DBAA9385801E for ; Thu, 17 Mar 2022 21:23:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DBAA9385801E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=serhei.io Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=serhei.io Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 3E4545C00FD; Thu, 17 Mar 2022 17:23:55 -0400 (EDT) Received: from imap50 ([10.202.2.100]) by compute5.internal (MEProxy); Thu, 17 Mar 2022 17:23:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h=cc :cc:content-type:date:date:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to; s=fm1; bh=s 4+LhdHenPGxNe+pDvE8GmsSwhpvQKMjmtx9KRghEA4=; b=llzbYtqCoLpKK20xl s1MXWmGKtjEDY9xh+cUhXpkyiqVGbTte34KQU8mhUFjHgHD+BgIH3XJPEbyrdkgM ltCOZN9ws0N5bKE2ad4VUSs7DT8f17TW/H+KK9F5SUEg5EupwO1ifOsZuWTGq6DT fO3pQfBp2cKuKka9HfyQv0n1bQdBgyjl4ECnzj5hWC/UpXBZ6117kUOW2r3mqHwj qHBa9GtsyzkpsvJrK1AeZEFv4AkEBA8tIOHaBQH4CddUWDb5D8+84fB6HYPeURYB wzJEKJLn9olkwLvR6B15MV1+3ZhSM5cA9DYYq+Sqp90ucqs5mtimefTsJyCS42WO 2CqHQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm3; bh=s4+LhdHenPGxNe+pDvE8GmsSwhpvQKMjmtx9KRghE A4=; b=c3lD7XJYlmGV4wl9szQ+Th9V/eAi5z4xkzkMbykHLucnTqLAZ/1Hx7pfa AxFHuhUESkENIIJDe8q66FVHxdCWRNqfj/RLHbvlo8S/bjUu8YKes5v2vA7SSA+p g0eE6dxK4WO1zcWKV3JQZqntVDrfBu+xxDHxsucUU/DBUMpsqwJ3MLFxEtRH+GXb ZanlSzkumhFPnDR8ndSBGmuoXiSDqKxG6ElWfw+gifoOwJ51sS7gna0hki9YzdvW f66gJ+9YkIBHTV/Xe2L9KPyPtxR4Vb/aZjZwi8OknumjYBZJavkNOzE1zHRrdVV4 esNz4wsLBSBZsDqM+gGi1Z1DhLJzA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrudefgedgudegiecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkfffhvffutgesthdtre dtreertdenucfhrhhomhepfdfuvghrhhgvihcuofgrkhgrrhhovhdfuceomhgvsehsvghr hhgvihdrihhoqeenucggtffrrghtthgvrhhnpefgveeuieejtdefvdegteeukedtueelfe ehueduffdugfdutedtheeviedtleehffenucevlhhushhtvghrufhiiigvpedtnecurfgr rhgrmhepmhgrihhlfhhrohhmpehmvgesshgvrhhhvghirdhioh X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id CBB2A192833C; Thu, 17 Mar 2022 17:23:54 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-4907-g25ce6f34a9-fm-20220311.001-g25ce6f34 Mime-Version: 1.0 Message-Id: <1796a6e2-2b2a-49a7-b350-9d58700d3e30@www.fastmail.com> Date: Thu, 17 Mar 2022 17:23:04 -0400 From: "Serhei Makarov" To: Bunsen Subject: bunsen (re)design discussion #2: Repository Layout, SQLite design idea + Django-esque API(?) Content-Type: text/plain X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Mar 2022 21:24:00 -0000 * #2a Repository Layout -- where do you put a bunsen repository? Plausible locations of the Bunsen repo on disk: - bunsen/.bunsen: inside a Git checkout of the Bunsen codebase. - bunsen/bunsen-data: inside a Git checkout of the Bunsen codebase, not hidden. - project/.bunsen: inside a Git checkout of the project source code. - project/bunsen-data: inside a Git checkout of the project source code, not hidden. - project.bunsen: standalone Bunsen repo. When we install Bunsen as an RPM, the 'bunsen' command should autodetect which repo it's being run against. Commands such as the following make sense: - cd bunsen-systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data - cd systemtap && bunsen ls # -> uses ./.bunsen, ./bunsen-data - cd project.bunsen && bunsen ls # -> uses . For pushing and pulling, the Bunsen Git repo could be placed in a separate location, e.g. - /public/bunsen-systemtap.git - /notnfs/staplogs/bunsen-systemtap/.bunsen/config specifies git_repo=/public/bunsen-systemtap.git * #2b Repository Layout -- how is the content of the repo laid out This is a subject of ongoing discussion with fche, because we have quite different opinions about what is convenient to keep in Git vs SQLite, how flexible the Git layout should be etc. In principle, the analysis and data representation will work more or less the same regardless of what solution we settle on. Therefore, the repository layout can be made configurable (and potentially the configurability can be reduced down the line as we settle on what makes sense and what doesn't). There are *several* types of artefacts we're contemplating between the two of us. (1) testlogs Git repo. Requested by fche. *Every* commit of this repo is a directory tree of log files. The branch naming scheme is completely free-form. This is meant to allow third parties to start creating git repos for us to pull down, as quickly as possible, with minimal tooling or explanation. (2) index Git repo. Preferred by serhei, described in a prior email. - (/)?index - contains JSON files --- - (/)?/testlogs- - (/)?/testruns- An addendum re: branch naming: Git already prefixes cloned branches as 'remotes/foo/branch'. So this would most likely replace any prefix we would be adding on our own. Keeping in mind we have to work directly with 'remotes/foo/branch' and be careful with any Git operations (e.g. checkout) that default to creating a local mirror of the branch. (3) index SQLite DB. Discussed in a later email (4) JSON analysis cache. This would just be a directory of JSON files, with format defined by the analysis script/library that generates it. * #2c Repository Layout -- examples Now for some examples of how this could be arranged: (a) The current Bunsen setup for SystemTap uses all Git. We have: - An index Git repo at .bunsen/bunsen.git Cloning/archiving this repo is super simple. Cloning a subset of this repo involves cloning a subset of the branches, and is also fairly simple if the branch naming scheme allows us to use sensible wildcards. (a') Suppose we evolve this format to allow several separate Git repos. - foo.bunsen/index.git - the main index+testlogs git repo - foo.bunsen/index-myproject.git - an index git repo for a separate project - foo.bunsen/extra-input1.git - a testlogs Git repo cloned from one source - foo.bunsen/extra-input2.git - a testlogs Git repo cloned from another source - path/to/public/export.git - extracted testruns from index git repo, with any sensitive data omitted - cache.sqlite - an SQLite cache for analysis artefacts Cloning this repo is a bit more complicated. We have to specify which branches from which git repo on the remote will be cloned to which git repo on the local. So the following command API might need to be complicated somewhat: $ bunsen clone user@ssh.server:foo.bunsen/ source_name $ bunsen clone https://server/path/to/public/export.git source_name Although there are sensible defaults: - if making a fresh clone, clone all the same git repos and all the branches - if adding data to a new repo (b) fche's preference for an all-SQLite setup. - foo.bunsen/bunsen.git - the main testlogs git repo - index.sqlite - an SQLite DB with parsed testrun data - ??? cache.sqlite - an SQLite cache for transient analysis artefacts Whether index or cache should be separate or merged is arguable. If cloning the SQLite DB is a possibility, separation may be useful because there *is* a distinction between a permanent index which is worth keeping around vs. a transient analysis which functions as an annotation to this index. I don't see cloning the SQLite DB as being a standard operation in this design, so merging the DBs would be less annoying. But, on the other hand.... The main disadvantage to cloning this type of repo is that the receiving side has to repeat the parsing work that was done on the sending side. Since this could take hour(s) when we clone a repo for the first time, for me this is a deal-breaker. The only workaround is keeping tight control on the size of the SQLite DB that has the parsed testrun data, so that we *can* clone it together with the testlogs as a special case. This requires the possibility of offloading large/transient analysis tables to a separate SQLite DB from small/persistent/necessary analysis tables. I don't know yet if this will significantly complicate the code or not.