From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by sourceware.org (Postfix) with ESMTP id E61603851C35 for ; Wed, 16 Sep 2020 22:18:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E61603851C35 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-500-oveqepTfMReTULB3k489_g-1; Wed, 16 Sep 2020 18:18:46 -0400 X-MC-Unique: oveqepTfMReTULB3k489_g-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id AC7D41868421 for ; Wed, 16 Sep 2020 22:18:45 +0000 (UTC) Received: from theo.uglyboxes.com (ovpn-115-46.phx2.redhat.com [10.3.115.46]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8B45B7880C for ; Wed, 16 Sep 2020 22:18:45 +0000 (UTC) From: Keith Seitz Subject: Initial findings of bunsen performance and questions To: bunsen@sourceware.org Message-ID: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com> Date: Wed, 16 Sep 2020 15:18:44 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Mimecast-Spam-Score: 0.001 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: bunsen@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Bunsen mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Sep 2020 22:18:50 -0000 Hi, I've spent the (two?) last week(s) playing with Bunsen, reading sources, and otherwise attempting to comprehend how it all fits together, and I have some initial findings on which I would like to seek comment before forging too much further ahead. First off, a reminder. My use case is release analysis. It's not too far off from the Bunsen use model at all. However, a consequence of my use case is that I need access to /all/ the results in the test run. To start playing around, I wrote a simple script to display the results of a given test in a given Bunsen commit, show_test.py (uninteresting bits removed): ----- meat of show_test.py ----- b = bunsen.Bunsen() opts = b.cmdline_args(sys.argv, info=info, args=cmdline_args, required_args=['test_name', 'commit']) testrun = b.testrun(opts.commit) all_tests = testrun.testcases found_tests = list(filter(lambda t: t['name'] == opts.test_name, all_tests)) if found_tests: print("found {} matching tests for \"{}\"".format(len(found_tests), opts.test_name)) for t in found_tests: print(t) else: print("found no tests matching \"{}\"".format(opts.test_name)) ----- end of show_test.py ----- The first question to ask is: Is this an oversimplified/naive implementation of this script? Continuing: This script takes over twenty seconds to complete on my computer. IMO that is a show-stopping problem. After investigating this, I've determined that the majority of the slowness is in the serialization of Testrun (from JSON). Maybe there's a better way to serialize the data from JSON? I don't know. Certainly nothing obvious jumps out at me. [Warning: I'm no expert in the area of JSON parsing in python.] So I experimented with replacing the JSON Testrun data with a sqlite database. Why a database? I chose this route for one simple reason: database queries are a standard of modern *-as-a-service design predominant on the web, and this is how I intend to implement a front-end for my work. Last year, I wrote a proof-of-concept which parsed gdb.sum (much like Bunsen does today), outputting a sqlite database. Two trivial database queries could reproduce the summary lines of the given test run, e.g., # passed tests, # failed, etc -- in under 0.1 seconds. This shows the potential of database-based data models. This week, I've completed a proof-of-concept of this idea in Bunsen, and I'd like to present my findings here. There are two quantitative measures that I concentrated on: speed and size. First up: Speed. The "show_test.py" script above now completes in less than one second (down from 20 seconds). Even as an unoptimized proof-of-concept, that is, IMO, adequate performance. [Additional gains could be realized by bypassing serialization altogether and using only sql queries.] Second: Size. Bunsen stores its data in a git repository located in bunsen_upload IIUC. I've measured the size of this directory in between the two implementations. I took two measurements (with consolidate_pass=False to get all test results stored): 1) With two identical gdb.{log,sum} files (different commits). In this case, JSON-based Bunsen used 5.2MB of storage. The database-based approach uses 5.0MB. IMO there is no difference. 2) With eight additional, different gdb.{log,sum} files imported on top of #1. Bunsen used 46340kB of storage. The database approach used 49400kB. That's a 6.6% difference in just ten runs. I consider this 6.6% storage trade-off acceptable for the massive increase in speed (and the ability to avoid custom serialization altogether). Is this an approach that seems viable as a supplement (replacement?) to JSON output? Is this approach something worth pursuing? Keith