[Bug target/67510] New: Faster code is possible for integer absolute value

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/67510] New: Faster code is possible for integer absolute value
@ 2015-09-09  7:55 peter at cordes dot ca
  2015-09-14 11:20 ` [Bug target/67510] x86: " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: peter at cordes dot ca @ 2015-09-09  7:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510

            Bug ID: 67510
           Summary: Faster code is possible for integer absolute value
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Consider integer absolute value in its own function:

int absval(int x) { return x >= 0 ? x : -x; }  // https://goo.gl/PBduHM

I think the ideal sequence would be this, which only has two instructions on
the critical path:
        xorl      %eax, %eax    # not on the critical path.  
                                # 1 uop, no execution unit on Intel SnB and
later.
        subl      %edi, %eax    # 1 uop, port 0/1/5/6
        cmovl     %edi, %eax    # 2 uops, p0156, latency=2
        # ret  # not included in calculations

Intel Sandybridge microarchitecture: latency = 3, throughput = 1.  4 uops (3
for p0156, 1 eliminated at register rename).

Intel P6 (Core2/Nehalem): Same, but the xor takes an execution port.

Silvermont: 3 uops, latency=3, throughput=2 per 3c.

AMD Bulldozer-family: 3 m-ops, 2c latency.  throughput=2 per 3c.  The xor is
recognized as independent, but still uses one of the two execution ports.

AMD K10: 3 m-ops, 2c latency.  throughput = 1 per c.

clang already uses a variant of this:

absval(int): clang 3.7
        movl    %edi, %eax  # on the critical path
        negl    %eax
        cmovll  %edi, %eax
        retq

 On Sandybridge, xor self,self is eliminated at register-rename time (and
doesn't take an execution unit), but mov is only eliminated on IvyBridge.  On
all CPUs other than Intel IvyBridge and later, my version is better than
clang's.  (On IvB+, they're equivalent.)

 On AMD Piledriver / Steamroller, mov r,r has higher throughput than other
integer instructions, but it still has latency=1c, unlike IvyBridge's
zero-latency move elimination.  So clang's version may be better on recent AMD,
if throughput trumps latency.

gcc and icc both use the xor/sub bithack formula:
  sign = x>>31; // (register filled with sign bit of x).
  Abs(x) = (x ^ sign) - sign;

absval(int): gcc various versions, including 5.2.0 (godbolt)
        movl    %edi, %edx   # Why do this at all?
        movl    %edi, %eax   # no latency on IvyBridge and later
        sarl    $31, %edx
        xorl    %edx, %eax
        subl    %edx, %eax
        ret

Why does gcc copy edi to edx before shifting?  It could movl %edi, %eax  / sarl
$31, %edi / etc.  I guess that's a separate bug, and wouldn't come up most of
the time when inlining.  Let's pretend gcc is smart, and only uses one mov.

Intel IvyBridge and later: 3(+1) uops (1 p0/5, 2 p015(6), 1 eliminated mov).
  Lat=3 (+1 on SnB and earlier, no move elimination)
  Throughput ~= 1 per cycle.  (3 per 4 cycles on Sandybridge and earlier,
because the move still needs one of the 3 execution ports (.)

Intel Silvermont: uops=4, Lat=4, Tput = 1 per 2c.

Intel P6 (pre SnB): this avoids any multi-uop instructions which can bottleneck
the decoders.  Performance is the same as Sandybridge: lat=4, tput=3 per 4c.

AMD Bulldozer family: mov reg,reg has higher throughput than other
instructions, but still 1 cycle latency.  4 m-ops, one of them being a reg-reg
move.
  Lat=4, Tput=2 per 3c (Piledriver & Steamroller, where mov r,r can run on an
AG port).  Or 1 per 2c (Bulldozer where mov r,r is on ports EX01)

AMD K10: 4 m-ops.  Lat=4, Tput=3 per 4c.

absval(int): icc13
        movl      %edi, %eax  # not needed if we can generate input in eax
        cltd               # replaces mov + sar 31.  CDQ in NASM syntax
        xorl      %edx, %edi
        subl      %edx, %edi
        movl      %edi, %eax  # not needed if output in the input reg is ok
        # ret

This is an interesting optimization which saves a mov if we can generate the
input in eax, and clobber it with the output, bringing this down to 3
instructions with lat=3.  None of the instructions are mov, and cltd/xor/sub
are all 1 cycle latency instructions that can run on any port, on all Intel/AMD
CPUs.

========================================================

If we look at something that uses the absolute value without caring what
register it's in (e.g. as a LUT index), that takes out the extra move in gcc's
abs() idiom:

int abslookup(int x, int *LUT) { return LUT[x >= 0 ? x : -x]; }
        # gcc 5.2
        movl    %edi, %eax
        sarl    $31, %eax
        xorl    %eax, %edi
        subl    %eax, %edi
        movslq  %edi, %rdi         # abs(INT_MIN) is still negative :/
        movl    (%rsi,%rdi,4), %eax
        ret

This would have more parallelism if we shifted edi and produced the result in
eax.  That would shorten the critical path for CPUs without zero-latency mov. 
(The mov and sar could happen in parallel.)  Any chance the extra mov in plain
abs, and the badly chosen dependency chain here, are related?

  It would also let us save another couple instruction bytes by using cltq (aka
NASM cdqe) to sign-extend eax to rax, instead of movslq.  (Clang does this
after generating abs(x) in eax with cmov.)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/67510] x86: Faster code is possible for integer absolute value
  2015-09-09  7:55 [Bug target/67510] New: Faster code is possible for integer absolute value peter at cordes dot ca
@ 2015-09-14 11:20 ` rguenth at gcc dot gnu.org
  2021-07-26 23:44 ` pinskia at gcc dot gnu.org
  2024-03-07 22:06 ` pcordes at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-09-14 11:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-09-14
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  x86 doesn't have integer abs patterns it seems and we end up using
expand_abs_nojump()s

  /* If this machine has expensive jumps, we can do integer absolute
     value of X as (((signed) x >> (W-1)) ^ x) - ((signed) x >> (W-1)),
     where W is the width of MODE.  */


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/67510] x86: Faster code is possible for integer absolute value
  2015-09-09  7:55 [Bug target/67510] New: Faster code is possible for integer absolute value peter at cordes dot ca
  2015-09-14 11:20 ` [Bug target/67510] x86: " rguenth at gcc dot gnu.org
@ 2021-07-26 23:44 ` pinskia at gcc dot gnu.org
  2024-03-07 22:06 ` pcordes at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-26 23:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=97873,
                   |                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=92651
   Target Milestone|---                         |11.0
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Fixed by r10-5498 and r11-5429 .

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/67510] x86: Faster code is possible for integer absolute value
  2015-09-09  7:55 [Bug target/67510] New: Faster code is possible for integer absolute value peter at cordes dot ca
  2015-09-14 11:20 ` [Bug target/67510] x86: " rguenth at gcc dot gnu.org
  2021-07-26 23:44 ` pinskia at gcc dot gnu.org
@ 2024-03-07 22:06 ` pcordes at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: pcordes at gmail dot com @ 2024-03-07 22:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67510

--- Comment #3 from Peter Cordes <pcordes at gmail dot com> ---
(In reply to Andrew Pinski from comment #2)
> Fixed by r10-5498 and r11-5429 .

Looks ok for -mtune=generic and non-Intel, but MOV/NEG isn't ideal on some
microarchitectures.

We still aren't using CMOV for -mtune=broadwell or later where it's single-uop
with 1 cycle latency, same as AMD.  https://uops.info/  (CMOVBE/CMOVA which
still 2 uops on Intel P-cores, since they read CF, and ZF from the SPAZO group,
so 4 total inputs.  But absolute value only uses signed conditions, which don't
involve CF.)

Intel E-cores (silvermont family) also have efficient CMOV.  Latency = 1 cycle
from one input, 2 cycles from the other, but single uop for the front-end.  It
looks like it uses two execution pipes like INC does, but often the front-end
is the bottleneck). So we should be using it there, too.  The non-CMOV code-gen
has even higher latency and worse throughput.

Earlier Intel SnB-family (Sandy / Ivy Bridge and Haswell) might be better with
CMOV, too, since the uop cache reduces the downside of a 2-uop instruction only
being able to decode in the first decoder (so maybe keep the current code-gen
on P6-family?), but it's 4 uops either way and smaller machine-code size, and
better on later CPUs.

----

The current CMOV code-gen has MOV on the latency critical path for
microarchitectures without mov-elimination.  (bdver* and earlier AMD where we
currently use CMOV, and Ice Lake, first-gen Sandybridge, and P6 family where we
currently don't.  I haven't checked on Zhaoxin.)

https://gcc.godbolt.org/z/Mbsqzhbfs
        movl    %edi, %eax
        negl    %eax
        cmovs   %edi, %eax

This could be easily avoided, either with XOR/SUB like I suggested in my
original report, or by NEGating the original value instead of the copy, for
CMOVNS. CMOV would still be overwriting the MOV result so would still free up
mov-elimination resources promptly.
(https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu
).

But unlike XOR/SUB, that destroys the original value so isn't an option if the
input register is needed later.  And with APX available, NEG into a new
register is optimal.  So it might be easier to only consider patterns that
produce the negated value in a separate register, which should be easy to
peephole optimize with APX non-destructive operations.

XOR/SUB saves code-size for absolute value of 64-bit integers; the xor-zeroing
doesn't need a REX prefix if the destination is a low reg. e.g

# better version for most CPUs, especially for 64-bit operand-size
    xor    %eax, %eax   # off the critical path on all x86-64, eliminated on
recent uarches
    sub    %rdi, %rax   # REX prefix
    cmovs  %rdi, %rax   # REX prefix

In terms of back-end execution-unit cost:

* Zen 3 and later eliminate both MOV and XOR-zeroing, so both should be equal
performance except for the code-size difference with 64-bit operands.
* Zen 1 and 2 eliminate MOV but not XOR-zeroing.  (XOR-zero is dep-breaking so
it's not on the critical path, but still needs an execution unit to write the
zero, so ~4 per clock throughput)
* Gracemont E-cores eliminates MOV but not XOR-zeroing, like Zen 1 and 2.  Its
front-end is probably more likely to be a bottleneck so this is even less of a
big deal.  
 https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/ is
my source for this and Zen 1/2 vs. Zen 3.  

https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
has Zen 4 numbers, although its Sunny Cove (Ice Lake) numbers don't seem to
reflect the microcode update that disabled mov-elim for GPRs
* Tremont and earlier Silvermont family: MOV is not eliminated, so XOR-zeroing
is better.
* Ice Lake (and first-gen Sandy Bridge) eliminate XOR-zeroing but not integer
MOV, so the current tune=generic code-gen has worse latency than necessary.  
(Thanks to Intel's microcode workarounds for CPU errata, there will be CPUs
with non-zero latency for  mov reg,reg  in service for many more years than
previously expected.)

* Ivy Bridge and later except Ice Lake family (Ice / Tiger / Rocket) eliminate
both so either way is fine, XOR / SUB being preferable because of Ice Lake
existing.

* P6-family doesn't eliminate either MOV or XOR.  P6-family is obsolete enough
that it's probably not worth anyone's time to update the tuning stuff for them.
 But XOR/NEG is clearly better if we're using CMOV at all.  (2-uop instructions
like CMOVS can be a problem depending on surrounding code, given their lack of
a uop cache.)

  Early P6-family (up to Pentium III) has a false dependency with xor-zeroing:
it's special in terms of marking the register as EAX=AL upper-bits-zero to
avoid partial register stalls, but isn't independent of the old value of EAX. 
-mtune=intel and -mtune=generic shouldn't care about this.  (Especially not
with -m64 since the affected CPUs are 32-bit only).  Despite the false
dependency, XOR-zero is probably still a good bet for -mtune=pentiumpro /
pentium2  / pentium3 if we can pick a register that's likely not to have been
written recently.  (The reorder buffer isn't huge on those old CPUs.)

* Netburst (Pentium 4) doesn't eliminate either, so XOR-zeroing is clearly
better, like Pentium-M and later P6.

* Via: single-uop CMOV on Nano 3000, 2 uops on Nano 2000.  MOV is not
eliminated so XOR-zeroing / SUB should be used.
* Zhaoxin: I don't know, haven't seen data on these.  Presumably XOR/SUB/CMOV
is good.

So in summary, -mtune=generic should use this:

    xor    %eax, %eax
    sub    %edi, %eax
    cmovs  %edi, %eax

It's better for critical-path latency on Ice Lake, and old CPUs like Sandy
Bridge and earlier, Bulldozer-family, P6-family, and Tremont and earlier
Silvermont-family.

It's *slightly* worse (needing an extra back-end uop to execute) on Zen 1 and
2, and Gracemont.  But that back-end uop has no dependencies so can execute in
any spare cycle on that execution port.

And for int64_t, this saves a byte of code-size in the xor-zeroing if the
destination isn't R8-R15.

Also, -march=broadwell and later should be using CMOV.

Current code-gen from -march=icelake has 4 cycle critical-path latency vs. 2
for XOR/SUB/CMOV.

        movl    %edi, %eax      # not eliminated on Ice Lake
        cltd                    # 1 uop, 1 cycle latency to write EDX
        xorl    %edx, %eax
        subl    %edx, %eax
        ret

Even worse, -march=goldmont avoids CDQ (AT&T cltd), costing an extra uop.  It
still has 4-cycle latency

        movl    %edi, %edx
        sarl    $31, %edx        # could have been cltd after the next MOV
        movl    %edi, %eax
        xorl    %edx, %eax
        subl    %edx, %eax
        ret

CPUs like Goldmont and Gracemont have CMOV with 1 cycle latency from the
destination operand, 2 cycle latency from the source and FLAGS.  So
unfortunately there's no way to get latency down to 2 cycles (we need a cycle
to generate the FLAGS input however we arrange the GPR operands).  Either
XOR/SUB/CMOV or MOV/NEG-original/CMOV have 3-cycle latency, vs. 4 for the
current tune=generic code-gen (except on Gracemont where zero-latency MOV keeps
it down to 3 cycles).

https://uops.info/table.html?search=cmovs%20r32)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_CFL=on&cb_ICL=on&cb_ADLP=on&cb_GLM=on&cb_ADLE=on&cb_ZENp=on&cb_ZEN4=on&cb_measurements=on&cb_base=on

https://uops.info/html-lat/GLM/CMOVS_R32_R32-Measurements.html

Same goes for Intel P-cores before Broadwell: XOR/SUB/CMOV is a total of 3
cycle latency, vs. 3 or 4 for mov/cltd/xor/sub depending on mov-elimination for
the initial mov.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-03-07 22:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-09  7:55 [Bug target/67510] New: Faster code is possible for integer absolute value peter at cordes dot ca
2015-09-14 11:20 ` [Bug target/67510] x86: " rguenth at gcc dot gnu.org
2021-07-26 23:44 ` pinskia at gcc dot gnu.org
2024-03-07 22:06 ` pcordes at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).