From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26884 invoked by alias); 9 Sep 2015 07:55:31 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 26853 invoked by uid 48); 9 Sep 2015 07:55:27 -0000 From: "peter at cordes dot ca" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/67510] New: Faster code is possible for integer absolute value Date: Wed, 09 Sep 2015 07:55:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 5.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: minor X-Bugzilla-Who: peter at cordes dot ca X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-09/txt/msg00688.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510 Bug ID: 67510 Summary: Faster code is possible for integer absolute value Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Consider integer absolute value in its own function: int absval(int x) { return x >= 0 ? x : -x; } // https://goo.gl/PBduHM I think the ideal sequence would be this, which only has two instructions on the critical path: xorl %eax, %eax # not on the critical path. # 1 uop, no execution unit on Intel SnB and later. subl %edi, %eax # 1 uop, port 0/1/5/6 cmovl %edi, %eax # 2 uops, p0156, latency=2 # ret # not included in calculations Intel Sandybridge microarchitecture: latency = 3, throughput = 1. 4 uops (3 for p0156, 1 eliminated at register rename). Intel P6 (Core2/Nehalem): Same, but the xor takes an execution port. Silvermont: 3 uops, latency=3, throughput=2 per 3c. AMD Bulldozer-family: 3 m-ops, 2c latency. throughput=2 per 3c. The xor is recognized as independent, but still uses one of the two execution ports. AMD K10: 3 m-ops, 2c latency. throughput = 1 per c. clang already uses a variant of this: absval(int): clang 3.7 movl %edi, %eax # on the critical path negl %eax cmovll %edi, %eax retq On Sandybridge, xor self,self is eliminated at register-rename time (and doesn't take an execution unit), but mov is only eliminated on IvyBridge. On all CPUs other than Intel IvyBridge and later, my version is better than clang's. (On IvB+, they're equivalent.) On AMD Piledriver / Steamroller, mov r,r has higher throughput than other integer instructions, but it still has latency=1c, unlike IvyBridge's zero-latency move elimination. So clang's version may be better on recent AMD, if throughput trumps latency. gcc and icc both use the xor/sub bithack formula: sign = x>>31; // (register filled with sign bit of x). Abs(x) = (x ^ sign) - sign; absval(int): gcc various versions, including 5.2.0 (godbolt) movl %edi, %edx # Why do this at all? movl %edi, %eax # no latency on IvyBridge and later sarl $31, %edx xorl %edx, %eax subl %edx, %eax ret Why does gcc copy edi to edx before shifting? It could movl %edi, %eax / sarl $31, %edi / etc. I guess that's a separate bug, and wouldn't come up most of the time when inlining. Let's pretend gcc is smart, and only uses one mov. Intel IvyBridge and later: 3(+1) uops (1 p0/5, 2 p015(6), 1 eliminated mov). Lat=3 (+1 on SnB and earlier, no move elimination) Throughput ~= 1 per cycle. (3 per 4 cycles on Sandybridge and earlier, because the move still needs one of the 3 execution ports (.) Intel Silvermont: uops=4, Lat=4, Tput = 1 per 2c. Intel P6 (pre SnB): this avoids any multi-uop instructions which can bottleneck the decoders. Performance is the same as Sandybridge: lat=4, tput=3 per 4c. AMD Bulldozer family: mov reg,reg has higher throughput than other instructions, but still 1 cycle latency. 4 m-ops, one of them being a reg-reg move. Lat=4, Tput=2 per 3c (Piledriver & Steamroller, where mov r,r can run on an AG port). Or 1 per 2c (Bulldozer where mov r,r is on ports EX01) AMD K10: 4 m-ops. Lat=4, Tput=3 per 4c. absval(int): icc13 movl %edi, %eax # not needed if we can generate input in eax cltd # replaces mov + sar 31. CDQ in NASM syntax xorl %edx, %edi subl %edx, %edi movl %edi, %eax # not needed if output in the input reg is ok # ret This is an interesting optimization which saves a mov if we can generate the input in eax, and clobber it with the output, bringing this down to 3 instructions with lat=3. None of the instructions are mov, and cltd/xor/sub are all 1 cycle latency instructions that can run on any port, on all Intel/AMD CPUs. ======================================================== If we look at something that uses the absolute value without caring what register it's in (e.g. as a LUT index), that takes out the extra move in gcc's abs() idiom: int abslookup(int x, int *LUT) { return LUT[x >= 0 ? x : -x]; } # gcc 5.2 movl %edi, %eax sarl $31, %eax xorl %eax, %edi subl %eax, %edi movslq %edi, %rdi # abs(INT_MIN) is still negative :/ movl (%rsi,%rdi,4), %eax ret This would have more parallelism if we shifted edi and produced the result in eax. That would shorten the critical path for CPUs without zero-latency mov. (The mov and sar could happen in parallel.) Any chance the extra mov in plain abs, and the badly chosen dependency chain here, are related? It would also let us save another couple instruction bytes by using cltq (aka NASM cdqe) to sign-extend eax to rax, instead of movslq. (Clang does this after generating abs(x) in eax with cmov.)