From: Dominic Letz <dominic.letz@berlin.de>
To: libc-help@sourceware.org
Subject: Remove string length check from sscanf()
Date: Thu, 4 Mar 2021 12:20:03 +0100 [thread overview]
Message-ID: <6d2fa0f2-457d-db67-fde1-fd7b373a0f58@berlin.de> (raw)
You might or might not have seen this article on the bad performance of
sscanf() when used repeatedly in scanning a big file.
https://news.ycombinator.com/item?id=26296339
It boils down to sscanf("%f", JSON) needing to call strlen (or in glibc
case __rawmemchr) on a 10MB string. Now the article is entertaining and
all but I was wondering why not fix the root cause in sscanf() itself
instead of having workarounds. Especially as this is not limited to GTA
but also other apps: https://www.mattkeeter.com/blog/2021-03-01-happen/
So why do check the size at all?
I thought one could go char by char until the pattern (e.g. "%f") is
fulfilled and then return. That would solve the issue on it's root --
and who knows how many programs would suddenly get faster...
So looking at glibc sources I've found the culprit in abstraction. Looks
like a FILE* like string-buffer object is created around the c-string:
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/sscanf.c;h=75daedd2aebe392e7f0d9e5d8816c1524b28f6ec;hb=HEAD#l34
And that abstraction does call __rawmemchr when initializing to know
it's bounds here:
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strops.c;h=6208518cdd861b770787c3f64cbfa3b6b9fb9afd;hb=HEAD#l41
I'm reading this source the first time, but I guess to not break
anything I could introduce a new type of FILE* string-buffer let's say
in 'strops_incr.c' that is working incrementally reading one char at the
time from the underlying string skipping the strlen()...
So that brings me to my question. Is the incremental approach something
that would get accepted when I prepare a patch? And how / where to
submit that? (Sorry GitHub generation speaking).
Happy to hear any feedback!
Best
Dominic
next reply other threads:[~2021-03-04 11:20 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-04 11:20 Dominic Letz [this message]
2021-03-04 12:01 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6d2fa0f2-457d-db67-fde1-fd7b373a0f58@berlin.de \
--to=dominic.letz@berlin.de \
--cc=libc-help@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).