From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25402 invoked by alias); 21 Apr 2007 21:36:24 -0000 Received: (qmail 25345 invoked by uid 48); 21 Apr 2007 21:36:14 -0000 Date: Sat, 21 Apr 2007 21:36:00 -0000 From: "aurelien at aurel32 dot net" To: glibc-bugs@sources.redhat.com Message-ID: <20070421223613.4403.aurelien@aurel32.net> Reply-To: sourceware-bugzilla@sourceware.org Subject: [Bug libc/4403] New: strfry() gives skewed distributions X-Bugzilla-Reason: CC Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org X-SW-Source: 2007-04/txt/msg00081.txt.bz2 strfry() tries to shuffle its string using random swaps, but it uses the wrong strategy, and thus not all permutations are equally likely. The code doing the shuffling itself looks like this: len = strlen (string); for (i = 0; i < len; ++i) { int32_t j; char c; __random_r (&rdata, &j); j %= len; c = string[i]; string[i] = string[j]; string[j] = c; } In other words, for the string 'abc' j will always be between 0 and 2 (inclusive), giving the following possibilities (all equally likely): j0 j1 j2 result 0 0 0 cab 0 0 1 bca 0 0 2 bac 0 1 0 cba 0 1 1 acb 0 1 2 abc 0 2 0 bca 0 2 1 abc 0 2 2 acb 1 0 0 cba 1 0 1 acb 1 0 2 abc 1 1 0 cab 1 1 1 bca 1 1 2 bac 1 2 0 acb 1 2 1 bac 1 2 2 bca 2 0 0 acb 2 0 1 bac 2 0 2 bca 2 1 0 abc 2 1 1 cab 2 1 2 cba 2 2 0 bac 2 2 1 cba 2 2 2 cab Sorting and counting gives us the following distribution: 4 abc 5 acb 5 bac 5 bca 4 cab 4 cba In other words, this is clearly skewed; some strings will appear 25% more often than others. Steinar H. Gunderson proposed the following patch: --- string/strfry.c 2007-04-21 23:12:47.000000000 +0200 +++ string/strfry.c 2007-04-21 23:22:46.000000000 +0200 @@ -38,17 +38,17 @@ } len = strlen (string); - for (i = 0; i < len; ++i) + for (i = 0; i < len - 1; ++i) { int32_t j; char c; __random_r (&rdata, &j); - j %= len; + j %= (len - i); c = string[i]; - string[i] = string[j]; - string[j] = c; + string[i] = string[j + i]; + string[j + i] = c; } return string; It turns strfry() into a proper Fisher-Yates shuffle. This gives exactly n! paths for a string with N characters, and since there are n! possible permutations, this means a one-to-one mapping, and all possibilities are equally likely. -- Summary: strfry() gives skewed distributions Product: glibc Version: unspecified Status: NEW Severity: normal Priority: P2 Component: libc AssignedTo: drepper at redhat dot com ReportedBy: aurelien at aurel32 dot net CC: glibc-bugs at sources dot redhat dot com GCC build triplet: x86_64-unknown-linux-gnu GCC host triplet: x86_64-unknown-linux-gnu GCC target triplet: x86_64-unknown-linux-gnu http://sourceware.org/bugzilla/show_bug.cgi?id=4403 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.