public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "shabbyx at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug stdio/17577] New: sscanf extremely slow on large strings
Date: Tue, 11 Nov 2014 13:49:00 -0000	[thread overview]
Message-ID: <bug-17577-131@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=17577

            Bug ID: 17577
           Summary: sscanf extremely slow on large strings
           Product: glibc
           Version: 2.19
            Status: NEW
          Severity: normal
          Priority: P2
         Component: stdio
          Assignee: unassigned at sourceware dot org
          Reporter: shabbyx at gmail dot com

The bug is that `sscanf` (`vsscanf` actually) is very slow on large strings. It
_seems_ like `vsscanf` is first trying to find the end of the string (with
`rawmemchr`), which is expensive given that most of the string is not going to
be read.

Here's a test code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

#define N 50000

static int _debug_helper(const char *src, int *a, int *n)
{
#if 1
 return sscanf(src, "%d%n", a, n);
#else
 char *end;

 errno = 0;
 *a = strtol(src, &end, 10);
 *n = end - src;

 return errno == 0 && *n > 0;
#endif
}

int main()
{
 int i;
 int a;
 int n;
 int so_far = 0;

 char *big_string = malloc(N * 4 + 1);
 for (i = 0; i < N; ++i)
  strcpy(big_string + i * 4, "123 ");
 big_string[N * 4] = '\0';

 while (1)
 {
  if (_debug_helper(big_string + so_far, &a, &n) != 1)
   break;
  so_far += n;
 }

 return 0;
}

Compile with: gcc -Wall -g -O0 main.c

Running this code with `N = 50000` and using `sscanf`, I get:

$ time ./a.out

real 0m1.602s
user 0m1.596s
sys 0m0.000s

Running it with `N = 50000` and using the substitute code, I get:

$ time ./a.out

real 0m0.002s
user 0m0.000s
sys 0m0.000s

Which is considrably smaller. Note that this shows that the part with `malloc`
and initialization take very small time. Running callgrind shows that almost
all of the time when using `sscanf` is spent in `rawmemchr`. Indeed, using gdb
and randomly hitting CTRL+C, you always end up with a stack trace like this:

#0 __rawmemchr_ia32 () at ../sysdeps/i386/i686/multiarch/../../rawmemchr.S:167
#1 0xb7e78a06 in _IO_str_init_static_internal () at strops.c:44
#2 0xb7e5c857 in __GI___isoc99_vsscanf () at isoc99_vsscanf.c:41
#3 0xb7e5c7cf in __isoc99_sscanf () at isoc99_sscanf.c:31
#4 0x08048494 in _debug_helper () at main.c:11
#5 0x08048517 in main () at main.c:41

This means that `rawmemchr` is slowing down `sscanf` by an unnecessary degree.
To further prove this point (and to confirm my guess that `rawmemchr` is
reading the whole string), here are a couple more tests:

With `N = 25000` and using `sscanf`:

$ time ./a.out

real 0m0.407s
user 0m0.404s
sys 0m0.000s

With `N = 12500` and using `sscanf`:

$ time ./a.out

real 0m0.106s
user 0m0.104s
sys 0m0.000s

This clearly shows an `O(N^2)` behavior. The main loop of the program is
`O(N)`, which means `sscanf` is running at `O(N)`. For large `N`, this is
significant. On the other hand, the actual behavior of `sscanf` should be to
read from the string according to the format string and no more, which in this
case (using `%d` and "123" as values) is of constant time.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


             reply	other threads:[~2014-11-11 13:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-11 13:49 shabbyx at gmail dot com [this message]
2014-11-11 13:50 ` [Bug stdio/17577] " shabbyx at gmail dot com
2021-03-01 12:56 ` marat at slonopotamus dot org
2021-03-02 12:46 ` amonakov at gmail dot com
2021-03-03  2:22 ` benfrantzdale at gmail dot com
2021-03-03  2:26 ` benfrantzdale at gmail dot com
2021-03-03  3:35 ` fweimer at redhat dot com
2021-03-03 19:31 ` bugdal at aerifal dot cx
2021-03-15 20:38 ` dkk089 at gmail dot com
2021-03-15 21:56 ` amonakov at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-17577-131@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).