From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-425264-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20935 invoked by alias); 27 Jun 2013 07:13:10 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 20852 invoked by uid 48); 27 Jun 2013 07:13:07 -0000
From: "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/36041] Speed up builtin_popcountll
Date: Thu, 27 Jun 2013 07:13:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 4.2.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: jakub at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: attachments.created
Message-ID: <bug-36041-4-LH5ZCHksJy@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-36041-4@http.gcc.gnu.org/bugzilla/>
References: <bug-36041-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-06/txt/msg01643.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 30382
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30382&action=edit
gcc49-pr36041.patch

Untested libgcc2.c implementation (no hw support).  HW support is IMHO best
dealt on the compiler side doing something, already a PLT call is fairly
expensive, but it depends if __builtin_popcount* is used in a hot loop or in
cold code (in the latter case it really doesn't matter).

I've looked at code generated for:
int
foo (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return (i * 0x101010101010101) >> 56;
}

int
bar (unsigned long long i)
{
  unsigned int i1 = i, i2 = i >> 32;
  i1 = i1 - ((i1 >> 1) & 0x55555555);
  i2 = i2 - ((i2 >> 1) & 0x55555555);
  i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333);
  i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333);
  i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F;
  i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F;
  return ((i1 + i2) * 0x1010101) >> 24;
}

int
baz (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24;
}
on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype
implementation, I guess that is likely the case on most targets.  Note that the
patch still doesn't attempt to figure out if UWtype multiplication is expensive
or not, perhaps a useful test would be whether we emit __umul<mode>3 for that
mode inside of libgcc - we'd need to define some macro and if multiplication is
expensive use the shifts + additions alternative instead.