From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cosmopolitan.snafu.de (cosmopolitan.snafu.de [IPv6:2001:1560:3:255::151]) by sourceware.org (Postfix) with ESMTPS id 15CD43835414 for ; Thu, 4 Mar 2021 11:20:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 15CD43835414 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=berlin.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=dominic.letz@berlin.de X-Trace: 507c646f6d696e69632e6c65747a406265726c696e2e64657c3138352e3232332e 3134372e3231307c316c486d31552d303030464c4a2d36677c3136313438353638 3034 Received: from cosmopolitan.snafu.de ([10.151.10.49] helo=localhost) by cosmopolitan.snafu.de with esmtpsa (Exim 4.94) id 1lHm1U-000FLJ-6g for libc-help@sourceware.org; Thu, 04 Mar 2021 12:20:04 +0100 To: libc-help@sourceware.org From: Dominic Letz Subject: Remove string length check from sscanf() Message-ID: <6d2fa0f2-457d-db67-fde1-fd7b373a0f58@berlin.de> Date: Thu, 4 Mar 2021 12:20:03 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-SA-Exim-Connect-IP: 185.223.147.210 X-SA-Exim-Mail-From: dominic.letz@berlin.de X-SA-Exim-Scanned: No (on cosmopolitan.snafu.de); SAEximRunCond expanded to false X-Spam-Status: No, score=0.7 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, TXREP, T_SPF_PERMERROR autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Mar 2021 11:20:07 -0000 You might or might not have seen this article on the bad performance of sscanf() when used repeatedly in scanning a big file. https://news.ycombinator.com/item?id=26296339 It boils down to sscanf("%f", JSON) needing to call strlen (or in glibc case __rawmemchr) on a 10MB string. Now the article is entertaining and all but I was wondering why not fix the root cause in sscanf() itself instead of having workarounds. Especially as this is not limited to GTA but also other apps: https://www.mattkeeter.com/blog/2021-03-01-happen/ So why do check the size at all? I thought one could go char by char until the pattern (e.g. "%f") is fulfilled and then return. That would solve the issue on it's root -- and who knows how many programs would suddenly get faster... So looking at glibc sources I've found the culprit in abstraction. Looks like a FILE* like string-buffer object is created around the c-string: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/sscanf.c;h=75daedd2aebe392e7f0d9e5d8816c1524b28f6ec;hb=HEAD#l34 And that abstraction does call __rawmemchr when initializing to know it's bounds here: https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strops.c;h=6208518cdd861b770787c3f64cbfa3b6b9fb9afd;hb=HEAD#l41 I'm reading this source the first time, but I guess to not break anything I could introduce a new type of FILE* string-buffer let's say in 'strops_incr.c' that is working incrementally reading one char at the time from the underlying string skipping the strlen()... So that brings me to my question. Is the incremental approach something that would get accepted when I prepare a patch? And how / where to submit that? (Sorry GitHub generation speaking). Happy to hear any feedback! Best Dominic