From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20935 invoked by alias); 27 Jun 2013 07:13:10 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 20852 invoked by uid 48); 27 Jun 2013 07:13:07 -0000 From: "jakub at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/36041] Speed up builtin_popcountll Date: Thu, 27 Jun 2013 07:13:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 4.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: enhancement X-Bugzilla-Who: jakub at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2013-06/txt/msg01643.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041 --- Comment #21 from Jakub Jelinek --- Created attachment 30382 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30382&action=edit gcc49-pr36041.patch Untested libgcc2.c implementation (no hw support). HW support is IMHO best dealt on the compiler side doing something, already a PLT call is fairly expensive, but it depends if __builtin_popcount* is used in a hot loop or in cold code (in the latter case it really doesn't matter). I've looked at code generated for: int foo (unsigned long long i) { i = i - ((i >> 1) & 0x5555555555555555); i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333); i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F; return (i * 0x101010101010101) >> 56; } int bar (unsigned long long i) { unsigned int i1 = i, i2 = i >> 32; i1 = i1 - ((i1 >> 1) & 0x55555555); i2 = i2 - ((i2 >> 1) & 0x55555555); i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333); i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333); i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F; i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F; return ((i1 + i2) * 0x1010101) >> 24; } int baz (unsigned long long i) { i = i - ((i >> 1) & 0x5555555555555555); i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333); i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F; return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24; } on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype implementation, I guess that is likely the case on most targets. Note that the patch still doesn't attempt to figure out if UWtype multiplication is expensive or not, perhaps a useful test would be whether we emit __umul3 for that mode inside of libgcc - we'd need to define some macro and if multiplication is expensive use the shifts + additions alternative instead.