From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 107892 invoked by alias); 13 May 2015 15:11:00 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 107877 invoked by uid 89); 13 May 2015 15:10:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.5 required=5.0 tests=AWL,BAYES_05,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 X-HELO: sasl.smtp.pobox.com Message-ID: <555369A3.7010108@pobox.com> Date: Wed, 13 May 2015 15:11:00 -0000 From: Daniel Santos User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: =?UTF-8?B?T25kxZllaiBCw61sa2E=?= CC: libc-alpha@sourceware.org, carlos@redhat.com Subject: Re: Propose C Metaprogramming-based qsort as GNU extension References: <54E25DDB.8060900@pobox.com> <20150401152657.GA31359@domone> In-Reply-To: <20150401152657.GA31359@domone> Content-Type: text/plain; charset=UTF-8; format=flowed X-Pobox-Relay-ID: 4050A7EC-F982-11E4-9032-E49F9F42C9D4-06139138!pb-smtp1.pobox.com Content-Transfer-Encoding: quoted-printable X-SW-Source: 2015-05/txt/msg00198.txt.bz2 On 04/01/2015 10:26 AM, Ond=C5=99ej B=C3=ADlka wrote: > On Mon, Feb 16, 2015 at 03:15:07PM -0600, Daniel Santos wrote: >> As part of my work on a paper on C Metaprogramming with gcc, I have >> implemented a sqort/insertion sort algorithm based off of the legacy >> glibc implementation (when msort is not used) who's performance far >> exceeds the current implementation. I have opened an enhancement bug >> here https://sourceware.org/bugzilla/show_bug.cgi?id=3D17941 with >> details about this, and was told I need to talk to you guys. >> >> In short, a basic benchmark shows an average performance increase >> 83% (16k of data, using element sizes from 1 to 8192 bytes). These >> are techniques that I appear to have invented in 2012; I haven't run >> into anybody else doing it just yet, but I hope it will catch on. >> Currently I only know this to work on GCC, but it may be possible >> that other compilers have enough support (in their extensions) to >> make it possible. >> >> The basic theory of operation is that by exploiting >> -findirect-inline, and attributes always_inline & flatten (in some >> cases), we force the compiler to inline a larger function than would >> normally be considered beneficial -- this is where instantiation of >> the "pseudo-template function" (as I call it) occurs. In this way a >> more efficient sort function is generated, because we know at >> compile-time the exact size and alignment of data elements and we >> can inline the compar function. We can even decide if we'll be doing >> a direct or indirect sort at compile-time, tune stack size, etc. The >> code its self needs more cleanup to be ready to integrate into >> glibc, although the algorithm appears to now be quite efficient and >> error-free. >> >> The main question is really where it belongs. The Boost project was >> started as a place for experimental libraries, many of which ended >> up in a later C++ standard. As I see it, a similar process must take >> place with C metaprograming, as it provides a powerful tool to >> improve performance in programs, libraries and system-level code. I >> thought that glibc might be a nice place for this because there are >> already so many extensions. This is slightly different however, >> because we aren't just providing new functions, but new >> metafunctions. >> >> Either way, I would very much like this work to fall under the GNU >> umbrella if at all possible. >> >> Daniel > Something like this is on my backlog. You can start with patch with > always_inline qsort template in header surrounded by > > # if __GNUC_PREREQ (4, 8) > > You do several basic optimizations. I wanted to do more, a hard part > would make gcc recognize different comparison functions. In most cases > radixsort is faster than quick/mergesort. I sincerely apologize for my delayed response, I've had some things come=20 up that have arrested my attention for the last month or so. Please elaborate on the difficulty of getting gcc to recognize different=20 comparison functions. At current, in none of my tests have I found a=20 comparison function not being fully inlined (via pointer indirection).=20 It is true that __attribute__((always_inline)) isn't really guaranteed=20 to always inline a function via pointer (at current), but if I=20 understand correctly, that is the reason that the flatten attribute was=20 added -- exactly to ensure that that is occurs. Obviously, the=20 comparison function needs to be in the same translation unit or an=20 inefficient function call is made -- and it's not the function call so=20 much that tends to be costly, but the inability to perform optimizations=20 across the function boundary. Aside from that, I would love to hear some suggestions on other=20 optimizations. I think I need to finish up my benchmarking scripts to=20 be able to more quickly discern performance changes due to differing=20 techniques. As far as radix sort vs quick/merge sort, your suggestion=20 causes me to reconsider even the naming of the template function, being=20 that there's no naming standard that needs being followed, perhaps=20 "qmsort_template" would be more descriptive and a separate=20 "radixsort_template" could be added. One final note on the quick/merge sort, one of it's major drawbacks is=20 memory consumption and one thing I haven't addressed yet is the static=20 stack size the function allocates. This will eventually be controllable=20 via the struct qsort_def fields max_count_bits and max_stack_space,=20 which don't do anything yet. Another optimization that isn't employed is=20 that if a program knows that it will always have less than 2^16 or 2^32=20 data elements (assuming 64 bit pointers on the later), stack space can=20 be further reduced by using offsets rather than pointers, potentially=20 decreasing data cache misses at the expense (perhaps nominal) of using=20 indexed addressing. Daniel