From: Keith Seitz <keiths@redhat.com>
To: bunsen@sourceware.org
Subject: Initial findings of bunsen performance and questions
Date: Wed, 16 Sep 2020 15:18:44 -0700 [thread overview]
Message-ID: <30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com> (raw)
Hi,
I've spent the (two?) last week(s) playing with Bunsen, reading sources, and
otherwise attempting to comprehend how it all fits together, and I have some
initial findings on which I would like to seek comment before forging too
much further ahead.
First off, a reminder. My use case is release analysis. It's not too far off
from the Bunsen use model at all. However, a consequence of my use case is
that I need access to /all/ the results in the test run.
To start playing around, I wrote a simple script to display the results of
a given test in a given Bunsen commit, show_test.py (uninteresting bits
removed):
----- meat of show_test.py -----
b = bunsen.Bunsen()
opts = b.cmdline_args(sys.argv, info=info, args=cmdline_args,
required_args=['test_name', 'commit'])
testrun = b.testrun(opts.commit)
all_tests = testrun.testcases
found_tests = list(filter(lambda t: t['name'] == opts.test_name, all_tests))
if found_tests:
print("found {} matching tests for \"{}\"".format(len(found_tests),
opts.test_name))
for t in found_tests:
print(t)
else:
print("found no tests matching \"{}\"".format(opts.test_name))
----- end of show_test.py -----
The first question to ask is: Is this an oversimplified/naive implementation
of this script?
Continuing: This script takes over twenty seconds to complete on my computer.
IMO that is a show-stopping problem.
After investigating this, I've determined that the majority of the slowness
is in the serialization of Testrun (from JSON).
Maybe there's a better way to serialize the data from JSON? I don't know.
Certainly nothing obvious jumps out at me. [Warning: I'm no expert in the
area of JSON parsing in python.]
So I experimented with replacing the JSON Testrun data with a sqlite
database. Why a database? I chose this route for one simple reason:
database queries are a standard of modern *-as-a-service design predominant
on the web, and this is how I intend to implement a front-end for my work.
Last year, I wrote a proof-of-concept which parsed gdb.sum (much like Bunsen
does today), outputting a sqlite database. Two trivial database queries could
reproduce the summary lines of the given test run, e.g., # passed tests,
# failed, etc -- in under 0.1 seconds. This shows the potential of database-based
data models.
This week, I've completed a proof-of-concept of this idea in Bunsen, and I'd
like to present my findings here.
There are two quantitative measures that I concentrated on: speed and size.
First up: Speed. The "show_test.py" script above now completes in less
than one second (down from 20 seconds). Even as an unoptimized proof-of-concept,
that is, IMO, adequate performance. [Additional gains could be realized by
bypassing serialization altogether and using only sql queries.]
Second: Size. Bunsen stores its data in a git repository located in bunsen_upload
IIUC. I've measured the size of this directory in between the two implementations.
I took two measurements (with consolidate_pass=False to get all test results
stored):
1) With two identical gdb.{log,sum} files (different commits). In this case,
JSON-based Bunsen used 5.2MB of storage. The database-based approach
uses 5.0MB. IMO there is no difference.
2) With eight additional, different gdb.{log,sum} files imported on top of
#1. Bunsen used 46340kB of storage. The database approach used 49400kB.
That's a 6.6% difference in just ten runs.
I consider this 6.6% storage trade-off acceptable for the massive increase
in speed (and the ability to avoid custom serialization altogether).
Is this an approach that seems viable as a supplement (replacement?)
to JSON output? Is this approach something worth pursuing?
Keith
next reply other threads:[~2020-09-16 22:18 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-16 22:18 Keith Seitz [this message]
2020-09-16 23:09 ` Serhei Makarov
2020-09-18 16:16 ` Keith Seitz
2020-09-21 20:08 ` Serhei Makarov
2020-09-21 20:16 ` Keith Seitz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=30950cc2-5d7f-eb93-42b1-d1c7a9138e81@redhat.com \
--to=keiths@redhat.com \
--cc=bunsen@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).