public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
* Remove string length check from sscanf()
@ 2021-03-04 11:20 Dominic Letz
  2021-03-04 12:01 ` Florian Weimer
  0 siblings, 1 reply; 2+ messages in thread
From: Dominic Letz @ 2021-03-04 11:20 UTC (permalink / raw)
  To: libc-help

You might or might not have seen this article on the bad performance of 
sscanf() when used repeatedly in scanning a big file. 
https://news.ycombinator.com/item?id=26296339

It boils down to sscanf("%f", JSON) needing to call strlen (or in glibc 
case __rawmemchr) on a 10MB string. Now the article is entertaining and 
all but I was wondering why not fix the root cause in sscanf() itself 
instead of having workarounds. Especially as this is not limited to GTA 
but also other apps: https://www.mattkeeter.com/blog/2021-03-01-happen/

So why do check the size at all?

I thought one could go char by char until the pattern (e.g. "%f") is 
fulfilled and then return. That would solve the issue on it's root -- 
and who knows how many programs would suddenly get faster...

So looking at glibc sources I've found the culprit in abstraction. Looks 
like a FILE* like string-buffer object is created around the c-string: 
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/sscanf.c;h=75daedd2aebe392e7f0d9e5d8816c1524b28f6ec;hb=HEAD#l34

And that abstraction does call __rawmemchr when initializing to know 
it's bounds here: 
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strops.c;h=6208518cdd861b770787c3f64cbfa3b6b9fb9afd;hb=HEAD#l41

I'm reading this source the first time, but I guess to not break 
anything I could introduce a new type of FILE* string-buffer let's say 
in 'strops_incr.c' that is working incrementally reading one char at the 
time from the underlying string skipping the strlen()...

So that brings me to my question. Is the incremental approach something 
that would get accepted when I prepare a patch? And how / where to 
submit that? (Sorry GitHub generation speaking).

Happy to hear any feedback!
Best
Dominic

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Remove string length check from sscanf()
  2021-03-04 11:20 Remove string length check from sscanf() Dominic Letz
@ 2021-03-04 12:01 ` Florian Weimer
  0 siblings, 0 replies; 2+ messages in thread
From: Florian Weimer @ 2021-03-04 12:01 UTC (permalink / raw)
  To: Dominic Letz; +Cc: libc-help

* Dominic Letz:

> I'm reading this source the first time, but I guess to not break
> anything I could introduce a new type of FILE* string-buffer let's say 
> in 'strops_incr.c' that is working incrementally reading one char at
> the time from the underlying string skipping the strlen()...

And maybe special-case the char-by-char check for small buffer sizes
based on the results of a strnlen call.  I agree that this should work.

Even today, applications can use fmemopen and fscanf to avoid the
overhead.

> So that brings me to my question. Is the incremental approach
> something that would get accepted when I prepare a patch? And how /
> where to submit that? (Sorry GitHub generation speaking).

You could send a patch to libc-alpha (git send-email should work).  Note
that the change will exceed the size threshold for FSF copyright
assignment, so you'll have to follow that process as well:

  Contribution Checklist: FSF copyright assignment
  <https://sourceware.org/glibc/wiki/Contribution%20checklist#FSF_copyright_assignment>

Thanks,
Florian


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-03-04 12:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04 11:20 Remove string length check from sscanf() Dominic Letz
2021-03-04 12:01 ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).