From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-58842-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 107892 invoked by alias); 13 May 2015 15:11:00 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 107877 invoked by uid 89); 13 May 2015 15:10:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.5 required=5.0 tests=AWL,BAYES_05,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2
X-HELO: sasl.smtp.pobox.com
Message-ID: <555369A3.7010108@pobox.com>
Date: Wed, 13 May 2015 15:11:00 -0000
From: Daniel Santos <daniel.santos@pobox.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: =?UTF-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
CC: libc-alpha@sourceware.org, carlos@redhat.com
Subject: Re: Propose C Metaprogramming-based qsort as GNU extension
References: <54E25DDB.8060900@pobox.com> <20150401152657.GA31359@domone>
In-Reply-To: <20150401152657.GA31359@domone>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-Pobox-Relay-ID:
 4050A7EC-F982-11E4-9032-E49F9F42C9D4-06139138!pb-smtp1.pobox.com
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2015-05/txt/msg00198.txt.bz2

On 04/01/2015 10:26 AM, Ond=C5=99ej B=C3=ADlka wrote:
> On Mon, Feb 16, 2015 at 03:15:07PM -0600, Daniel Santos wrote:
>> As part of my work on a paper on C Metaprogramming with gcc, I have
>> implemented a sqort/insertion sort algorithm based off of the legacy
>> glibc implementation (when msort is not used) who's performance far
>> exceeds the current implementation. I have opened an enhancement bug
>> here https://sourceware.org/bugzilla/show_bug.cgi?id=3D17941 with
>> details about this, and was told I need to talk to you guys.
>>
>> In short, a basic benchmark shows an average performance increase
>> 83% (16k of data, using element sizes from 1 to 8192 bytes). These
>> are techniques that I appear to have invented in 2012; I haven't run
>> into anybody else doing it just yet, but I hope it will catch on.
>> Currently I only know this to work on GCC, but it may be possible
>> that other compilers have enough support (in their extensions) to
>> make it possible.
>>
>> The basic theory of operation is that by exploiting
>> -findirect-inline, and attributes always_inline & flatten (in some
>> cases), we force the compiler to inline a larger function than would
>> normally be considered beneficial -- this is where instantiation of
>> the "pseudo-template function" (as I call it) occurs. In this way a
>> more efficient sort function is generated, because we know at
>> compile-time the exact size and alignment of data elements and we
>> can inline the compar function. We can even decide if we'll be doing
>> a direct or indirect sort at compile-time, tune stack size, etc. The
>> code its self needs more cleanup to be ready to integrate into
>> glibc, although the algorithm appears to now be quite efficient and
>> error-free.
>>
>> The main question is really where it belongs. The Boost project was
>> started as a place for experimental libraries, many of which ended
>> up in a later C++ standard. As I see it, a similar process must take
>> place with C metaprograming, as it provides a powerful tool to
>> improve performance in programs, libraries and system-level code. I
>> thought that glibc might be a nice place for this because there are
>> already so many extensions. This is slightly different however,
>> because we aren't just providing new functions, but new
>> metafunctions.
>>
>> Either way, I would very much like this work to fall under the GNU
>> umbrella if at all possible.
>>
>> Daniel
> Something like this is on my backlog. You can start with patch with
> always_inline qsort template in header surrounded by
>
> # if __GNUC_PREREQ (4, 8)
>
> You do several basic optimizations. I wanted to do more, a hard part
> would make gcc recognize different comparison functions. In most cases
> radixsort is faster than quick/mergesort.

I sincerely apologize for my delayed response, I've had some things come=20
up that have arrested my attention for the last month or so.

Please elaborate on the difficulty of getting gcc to recognize different=20
comparison functions. At current, in none of my tests have I found a=20
comparison function not being fully inlined (via pointer indirection).=20
It is true that __attribute__((always_inline)) isn't really guaranteed=20
to always inline a function via pointer (at current), but if I=20
understand correctly, that is the reason that the flatten attribute was=20
added -- exactly to ensure that that is occurs. Obviously, the=20
comparison function needs to be in the same translation unit or an=20
inefficient function call is made -- and it's not the function call so=20
much that tends to be costly, but the inability to perform optimizations=20
across the function boundary.

Aside from that, I would love to hear some suggestions on other=20
optimizations.  I think I need to finish up my benchmarking scripts to=20
be able to more quickly discern performance changes due to differing=20
techniques. As far as radix sort vs quick/merge sort, your suggestion=20
causes me to reconsider even the naming of the template function, being=20
that there's no naming standard that needs being followed, perhaps=20
"qmsort_template" would be more descriptive and a separate=20
"radixsort_template" could be added.

One final note on the quick/merge sort, one of it's major drawbacks is=20
memory consumption and one thing I haven't addressed yet is the static=20
stack size the function allocates. This will eventually be controllable=20
via the struct qsort_def fields max_count_bits and max_stack_space,=20
which don't do anything yet. Another optimization that isn't employed is=20
that if a program knows that it will always have less than 2^16 or 2^32=20
data elements (assuming 64 bit pointers on the later), stack space can=20
be further reduced by using offsets rather than pointers, potentially=20
decreasing data cache misses at the expense (perhaps nominal) of using=20
indexed addressing.

Daniel