From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dominic.letz@berlin.de>
Received: from cosmopolitan.snafu.de (cosmopolitan.snafu.de
 [IPv6:2001:1560:3:255::151])
 by sourceware.org (Postfix) with ESMTPS id 15CD43835414
 for <libc-help@sourceware.org>; Thu,  4 Mar 2021 11:20:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 15CD43835414
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=berlin.de
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=dominic.letz@berlin.de
X-Trace: 507c646f6d696e69632e6c65747a406265726c696e2e64657c3138352e3232332e
 3134372e3231307c316c486d31552d303030464c4a2d36677c3136313438353638
 3034
Received: from cosmopolitan.snafu.de ([10.151.10.49] helo=localhost)
 by cosmopolitan.snafu.de with esmtpsa (Exim 4.94) 
 id 1lHm1U-000FLJ-6g
 for libc-help@sourceware.org; Thu, 04 Mar 2021 12:20:04 +0100
To: libc-help@sourceware.org
From: Dominic Letz <dominic.letz@berlin.de>
Subject: Remove string length check from sscanf()
Message-ID: <6d2fa0f2-457d-db67-fde1-fd7b373a0f58@berlin.de>
Date: Thu, 4 Mar 2021 12:20:03 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
X-SA-Exim-Connect-IP: 185.223.147.210
X-SA-Exim-Mail-From: dominic.letz@berlin.de
X-SA-Exim-Scanned: No (on cosmopolitan.snafu.de);
 SAEximRunCond expanded to false
X-Spam-Status: No, score=0.7 required=5.0 tests=BAYES_00, FREEMAIL_FROM,
 KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, TXREP,
 T_SPF_PERMERROR autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-help@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-help mailing list <libc-help.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-help/>
List-Help: <mailto:libc-help-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Mar 2021 11:20:07 -0000

You might or might not have seen this article on the bad performance of 
sscanf() when used repeatedly in scanning a big file. 
https://news.ycombinator.com/item?id=26296339

It boils down to sscanf("%f", JSON) needing to call strlen (or in glibc 
case __rawmemchr) on a 10MB string. Now the article is entertaining and 
all but I was wondering why not fix the root cause in sscanf() itself 
instead of having workarounds. Especially as this is not limited to GTA 
but also other apps: https://www.mattkeeter.com/blog/2021-03-01-happen/

So why do check the size at all?

I thought one could go char by char until the pattern (e.g. "%f") is 
fulfilled and then return. That would solve the issue on it's root -- 
and who knows how many programs would suddenly get faster...

So looking at glibc sources I've found the culprit in abstraction. Looks 
like a FILE* like string-buffer object is created around the c-string: 
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/sscanf.c;h=75daedd2aebe392e7f0d9e5d8816c1524b28f6ec;hb=HEAD#l34

And that abstraction does call __rawmemchr when initializing to know 
it's bounds here: 
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strops.c;h=6208518cdd861b770787c3f64cbfa3b6b9fb9afd;hb=HEAD#l41

I'm reading this source the first time, but I guess to not break 
anything I could introduce a new type of FILE* string-buffer let's say 
in 'strops_incr.c' that is working incrementally reading one char at the 
time from the underlying string skipping the strlen()...

So that brings me to my question. Is the incremental approach something 
that would get accepted when I prepare a patch? And how / where to 
submit that? (Sorry GitHub generation speaking).

Happy to hear any feedback!
Best
Dominic