From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by sourceware.org (Postfix) with ESMTPS id 42489398680F for ; Wed, 16 Sep 2020 23:10:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 42489398680F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=serhei.io Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=me@serhei.io Received: from compute7.internal (compute7.nyi.internal [10.202.2.47]) by mailout.nyi.internal (Postfix) with ESMTP id 6B1F45C0B1E; Wed, 16 Sep 2020 19:10:52 -0400 (EDT) Received: from imap21 ([10.202.2.71]) by compute7.internal (MEProxy); Wed, 16 Sep 2020 19:10:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=serhei.io; h= mime-version:message-id:in-reply-to:references:date:from:to :subject:content-type; s=fm2; bh=0lfA+sScxFtgjBWIY1Sty2iGWEC/o8l OkMr5mN7S48A=; b=c6Y5o4pvuL1kZw5jU0AeheejU7zgoc4zSV403nq2oK006i6 fuOTpo0rW0gkN49i30H/8yorASMACH+rqGhjoPdpxe0yU1biFBkYyY1cxjeJPYjR 7J+vwGGMNxwmam84Sfo92A273sEKtvnLkmrBWcqhHiICbGaR8EM7YSwNzJQSi+Uo H3jd+Ok/Xx+yHzvHukvJjNYqanDUcai6YnUMxeb5VarKCBN2rRWoxJZoN28CGzeh kvrfRag9UgDhwVO8Ry9mSlIcXFg21FMaPZmX4VLQE48hCk6IKSIp2+0oEslh2Fwy 0nk0DfN2LwEMz3ocKdWEZrM559PmjJ09s9N535g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=0lfA+s ScxFtgjBWIY1Sty2iGWEC/o8lOkMr5mN7S48A=; b=YZYxETA11QIfDbIRUgqDgt dwO3e9/u3n8K4K/kmV7v0nnYolyFKyvIM9mJBT00obRCmjErWpok2+6F5wkZaL6L OfKOksAeb/x3Eku4lSLfy+SGfcUAmy0d9oZxXwyGIeq72XZ8Wi7nsSZG5/70EqTg j3xyMLkQCqmab42/idTddOcOXlhalG++N/Fp/q7xXc9r8DQ3BWFK81fbZ8deO2hC 2T9tPdO4sM1rpEAmrhccNXDdpN0+UJ3cwjRJk8ZhqVbmLSyOz9s1nEnAKa+zdugk CfulAp1xoj/rNqJa7h1xvwrikETrcs7A8p6paqLKaLr6nbFJIY05D/0LfQPXT3aQ == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedujedrtdefgddukecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefofgggkfgjfhffhffvufgtsehttdertderredtnecuhfhrohhmpedfufgvrhhh vghiucforghkrghrohhvfdcuoehmvgesshgvrhhhvghirdhioheqnecuggftrfgrthhtvg hrnhepfeejgffggeffudehiedtledtgeejtdekgfeuleehuddtgfegvdfghfejkeeigfdt necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepmhgvse hsvghrhhgvihdrihho X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id C036E660069; Wed, 16 Sep 2020 19:10:44 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.3.0-324-g0f99587-fm-20200916.004-g0f995879 Mime-Version: 1.0 Message-Id: In-Reply-To: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com> References: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com> Date: Wed, 16 Sep 2020 19:09:18 -0400 From: "Serhei Makarov" To: "Keith Seitz" , "goldgold098123 at gmail dot com via Bunsen" Subject: Re: Initial findings of bunsen performance and questions Content-Type: text/plain X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Sep 2020 23:10:55 -0000 Hello Keith, Thanks for the extensive work and experimentation. I'm not sure your show_test script should be that slow and I hope to get to the bottom of why that's happening for you. At the same time, your SQLite database work should be very useful and necessary for other reasons. On Wed, Sep 16, 2020, at 6:18 PM, Keith Seitz via Bunsen wrote: > The first question to ask is: Is this an oversimplified/naive implementation > of this script? It looks right to me. I briefly suspected the lambda might do some additional object copying, but that doesn't seem to be the case. > Continuing: This script takes over twenty seconds to complete on my computer. > IMO that is a show-stopping problem. Hmm. Very odd. A script would take very long to complete if it was a 'column query' (e.g. looking at one testcase across *all* versions would require parsing the JSON for every testrun -- essentially the entire repo -- and throwing away most of that data). There's good reasons to run queries like that, which makes your concerns legitimate. This doesn't look like it. It reads just one testrun. My brief tests of your script yielded ~3sec to run against a SystemTap testrun and ~1.5s against a GDB buildbot run. Since it reads one particular testrun, the scale of the repo is immaterial (my SystemTap repo has ~2000-3000 runs in it). I suspect my code for building the repo has some bug when using consolidate_pass=False. Could you place the Git/JSON repo you built somewhere I have access to? (Also, you could try the +diff_runs script on your repo. If the JSON parsing is the source of the slowdown and reading one run took 20s, logically reading two runs would take you 40s.) > There are two quantitative measures that I concentrated on: speed and size. > > First up: Speed. The "show_test.py" script above now completes in less > than one second (down from 20 seconds). Even as an unoptimized proof-of-concept, > that is, IMO, adequate performance. [Additional gains could be realized by > bypassing serialization altogether and using only sql queries.] > > I took two measurements (with consolidate_pass=False to get all test results > stored): I'll need to look further into whether that option is appropriate. > 2) With eight additional, different gdb.{log,sum} files imported on top of > #1. Bunsen used 46340kB of storage. The database approach used 49400kB. > That's a 6.6% difference in just ten runs. IMO the comparison has to be done with 100s to 1000s of similar test runs since Git's de-duplication must be compared to whatever SQLite does, at that scale of data. I doubt it's important though, for this use case we have disk space to burn and the query speedup even justifies keeping both forms of storage. > Is this an approach that seems viable as a supplement (replacement?) > to JSON output? Is this approach something worth pursuing? Definitely worth pursuing due to the aforementioned possibility of 'column queries' which I don't see any way of handling well with the design I currently have. I'm not sure if SQLite is better used as a replacement for the JSON/Git storage or as a supplemental cache built from it and used to speed up queries. (Also, the original log files must be retained in any case.) All the best, Serhei