From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dragonfly.birch.relay.mailchannels.net (dragonfly.birch.relay.mailchannels.net [23.83.209.51]) by sourceware.org (Postfix) with ESMTPS id DF4B13857C7F for ; Sat, 12 Jun 2021 10:56:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DF4B13857C7F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=gotplt.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gotplt.org X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 9CBBA680F05; Sat, 12 Jun 2021 10:56:13 +0000 (UTC) Received: from pdx1-sub0-mail-a12.g.dreamhost.com (100-96-11-18.trex.outbound.svc.cluster.local [100.96.11.18]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 4C3CF680F7E; Sat, 12 Jun 2021 10:56:12 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org Received: from pdx1-sub0-mail-a12.g.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384) by 100.96.11.18 (trex/6.3.1); Sat, 12 Jun 2021 10:56:13 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org X-MailChannels-Auth-Id: dreamhost X-Fearful-Snatch: 4adf925129818b4c_1623495373432_2009623255 X-MC-Loop-Signature: 1623495373432:98704948 X-MC-Ingress-Time: 1623495373432 Received: from pdx1-sub0-mail-a12.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a12.g.dreamhost.com (Postfix) with ESMTP id 0D6B287E70; Sat, 12 Jun 2021 03:56:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gotplt.org; h=subject:to :cc:references:from:message-id:date:mime-version:in-reply-to :content-type:content-transfer-encoding; s=gotplt.org; bh=qQ4zIv oyRaaEO9rmpQw+vlHn9OQ=; b=ZKig81Fk/KygOQgdZpe43LUqFq1dLnYxhelNfZ T/wcYF00lEze28SVU/V16B2qdeXIAdXI0RD3s661eDOUJQU9fwLMcWNlqhx/0yWX EW31pWmgvPom8AK5gE6w4NkDzfBlYfEfVhI0HqTCMKDbx6pQe8b3DD9ry+TyC3WH bmjHQ= Received: from [192.168.1.126] (unknown [1.186.101.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: siddhesh@gotplt.org) by pdx1-sub0-mail-a12.g.dreamhost.com (Postfix) with ESMTPSA id 5259987E6F; Sat, 12 Jun 2021 03:56:08 -0700 (PDT) Subject: Re: [Final testing with benchmark tests] Fastest String Search Algorithm. To: Paul Zimmermann , Amit Choudhary Cc: libc-alpha@sourceware.org References: <39bd7d71-eebf-6b7c-b6e7-6816037a4492@cs.ucla.edu> X-DH-BACKEND: pdx1-sub0-mail-a12 From: Siddhesh Poyarekar Message-ID: Date: Sat, 12 Jun 2021 16:26:03 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3029.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Jun 2021 10:56:16 -0000 On 6/12/21 1:31 PM, Paul Zimmermann wrote: > Dear Amit, > >> I ran the benchmark tests for my algorithm and for strstr(). >> >> Total number of tests = 332. >> >> My algorithm is faster 182 times. >> >> strstr() is faster 150 times. > > and your algorithm can be up to 3531 times slower than strstr(): > > s: 3.75006e+08 106175 > > The largest strstr() time is 1.5e6, whereas the largest time of your > algorithm is 9.8e8, i.e., 656 times larger. > > Can you improve your algorithm so that the worst case time is similar to that > of strstr()? This is going around in circles so I am going to summarize the several problems (and not just the worst case time) here that need to be addressed before this proposal can even be considered with any seriousness. You may consider this my objection to the proposal conditional these points being resolved. Starting with the technical issues: 1. In the benchmark results shared, the geomean performance of the proposed algorithm is worse by 22.62%. This is by no means (no pun intended) a better result. In general, worse cases are much worse than when they're better. 2. The worst case performances are thousands of times slower as PaulZ pointed out. But wait, it's worse because it's probably not an issue of scaling with length... 3. ... The benchmark results shared don't actually indicate which length pairs result in those worst case performances but if I had to assume that it's related to glibc benchmark inputs, (there's one input less than the glibc benchmarks) it appears that the performance starts going downhill from 224 byte haystacks, especially with misaligned strings. 4. Given that the benchmark program is outside of the benchtests infrastructure (because this is clearly not the benchtests program), there is no information about how the benchmark is run and what conditions affect its output. 5. The baseline for comparison is not clear. Is it the default strstr() in string/strstr.c? Is it the system strstr()? In the latter case, it could be anything from string/strstr.c to the various microarchitecture-specific asm ifuncs. 6. Amit has made handwavy claims about why a certain set of inputs matter over others without providing any references to prior art or reviewable experimental data that can back his claims. The proposal needs clearer commentary here with proper references. 7. Refusal to use the glibc-benchtests - either in its current form or by adding more inputs that the community agrees as relevant - is IMO a blocker to getting any changes accepted to strstr(). A better strstr() must show better results on the benchmarks, without which future maintenance of the function will be challenging. If Amit doesn't write those, someone else needs to volunteer to do it for him. Now on to the non-technical issues: 1. First and foremost, Amit has been insinuating racial discrimination or crying victim when someone makes suggestions for verification or points out flaws in the algorithm. He was warned once[1] but has refused to treat community members with respect. Let this not be the low we set for acceptable behaviour in the community. 2. There is a plain text algorithm and several C code snippets, none of which have proper copyright notices. There is no indication even of whether Amit owns the copyright for the shared content (as opposed to his employer since many employers claim ownership for everything the employee creates) or whether he has given us permission to use the content. 3. On a related point, it is unclear if Amit or his employer have signed a copyright assignment with the FSF. I personally would like to see glibc move away from the assignment like gcc did recently, but that's an orthogonal point at the moment. 4. Amit has on various occasions expressed disinterest in contributing to glibc and put conditions on what he will or will not do even before posting a single line patch. I have no confidence that if we do end up accepting his contribution at some point, he will make himself available to maintain the code he contributed, leaving the community to deal with the technical debt. Siddhesh [1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html