* [__mulvti3] register allocator plays shell game
@ 2020-10-25 19:35 Stefan Kanthak
2020-10-26 10:19 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kanthak @ 2020-10-25 19:35 UTC (permalink / raw)
To: gcc
Hi,
for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
the first few lines of the __mulvDI3() function from libgcc2.c
| DWtype
| __mulvDI3 (DWtype u, DWtype v)
| {
| /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
| but the checked multiplication needs only two. */
| const DWunion uu = {.ll = u};
| const DWunion vv = {.ll = v};
|
| if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
| {
| /* u fits in a single Wtype. */
| if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
| {
| /* v fits in a single Wtype as well. */
| /* A single multiplication. No overflow risk. */
| return (DWtype) uu.s.low * (DWtype) vv.s.low;
| }
are compiled to this braindead code (obtained from libgcc.a of
GCC 10.2.0 installed on Debian):
0000000000000000 <__mulvti3>:
0: 41 55 push %r13
2: 49 89 cb mov %rcx,%r11
5: 48 89 d0 mov %rdx,%rax
8: 49 89 d2 mov %rdx,%r10
b: 41 54 push %r12
d: 49 89 fc mov %rdi,%r12
10: 48 89 d1 mov %rdx,%rcx
13: 49 89 f0 mov %rsi,%r8
16: 4c 89 e2 mov %r12,%rdx
19: 49 89 f5 mov %rsi,%r13
1c: 53 push %rbx
1d: 48 89 fe mov %rdi,%rsi
20: 48 c1 fa 3f sar $0x3f,%rdx
24: 48 c1 f8 3f sar $0x3f,%rax
28: 4c 89 df mov %r11,%rdi
2b: 4c 39 c2 cmp %r8,%rdx
2e: 75 18 jne 48 <__mulvti3+0x48>
30: 4c 39 d8 cmp %r11,%rax
33: 75 6b jne a0 <__mulvti3+0xa0>
35: 4c 89 e0 mov %r12,%rax
38: 49 f7 ea imul %r10
3b: 5b pop %rbx
3c: 41 5c pop %r12
3e: 41 5d pop %r13
40: c3 retq
...
There are EIGHT superfluous MOV instructions here, clobbering the
non-volatile registers RBX, R12 and R13, plus THREE superfluous
PUSH/POP pairs.
What stops GCC from generating the following straightforward code
(11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
.intel_syntax noprefix
__mulvti3:
mov r8, rdi
mov r9, rdx
sra r8, 63
sra r9, 63
cmp r8, rsi
jne __mulvti3+0x48+65-31
cmp r9, rcx
jne __mulvti3+0xa0+65-31
mov rax, rdi
imul rdx
ret
...
not amused
Stefan Kanthak
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [__mulvti3] register allocator plays shell game
2020-10-25 19:35 [__mulvti3] register allocator plays shell game Stefan Kanthak
@ 2020-10-26 10:19 ` Richard Biener
2020-10-26 22:56 ` Stefan Kanthak
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2020-10-26 10:19 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: GCC Development
On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Hi,
>
> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
> the first few lines of the __mulvDI3() function from libgcc2.c
>
> | DWtype
> | __mulvDI3 (DWtype u, DWtype v)
> | {
> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
> | but the checked multiplication needs only two. */
> | const DWunion uu = {.ll = u};
> | const DWunion vv = {.ll = v};
> |
> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
> | {
> | /* u fits in a single Wtype. */
> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
> | {
> | /* v fits in a single Wtype as well. */
> | /* A single multiplication. No overflow risk. */
> | return (DWtype) uu.s.low * (DWtype) vv.s.low;
> | }
>
> are compiled to this braindead code (obtained from libgcc.a of
> GCC 10.2.0 installed on Debian):
>
> 0000000000000000 <__mulvti3>:
> 0: 41 55 push %r13
> 2: 49 89 cb mov %rcx,%r11
> 5: 48 89 d0 mov %rdx,%rax
> 8: 49 89 d2 mov %rdx,%r10
> b: 41 54 push %r12
> d: 49 89 fc mov %rdi,%r12
> 10: 48 89 d1 mov %rdx,%rcx
> 13: 49 89 f0 mov %rsi,%r8
> 16: 4c 89 e2 mov %r12,%rdx
> 19: 49 89 f5 mov %rsi,%r13
> 1c: 53 push %rbx
> 1d: 48 89 fe mov %rdi,%rsi
> 20: 48 c1 fa 3f sar $0x3f,%rdx
> 24: 48 c1 f8 3f sar $0x3f,%rax
> 28: 4c 89 df mov %r11,%rdi
> 2b: 4c 39 c2 cmp %r8,%rdx
> 2e: 75 18 jne 48 <__mulvti3+0x48>
> 30: 4c 39 d8 cmp %r11,%rax
> 33: 75 6b jne a0 <__mulvti3+0xa0>
> 35: 4c 89 e0 mov %r12,%rax
> 38: 49 f7 ea imul %r10
> 3b: 5b pop %rbx
> 3c: 41 5c pop %r12
> 3e: 41 5d pop %r13
> 40: c3 retq
> ...
>
> There are EIGHT superfluous MOV instructions here, clobbering the
> non-volatile registers RBX, R12 and R13, plus THREE superfluous
> PUSH/POP pairs.
>
> What stops GCC from generating the following straightforward code
> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>
> .intel_syntax noprefix
> __mulvti3:
> mov r8, rdi
> mov r9, rdx
> sra r8, 63
> sra r9, 63
> cmp r8, rsi
> jne __mulvti3+0x48+65-31
> cmp r9, rcx
> jne __mulvti3+0xa0+65-31
> mov rax, rdi
> imul rdx
> ret
> ...
>
>
> not amused
can you open a bugreport please?
Richard.
> Stefan Kanthak
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [__mulvti3] register allocator plays shell game
2020-10-26 10:19 ` Richard Biener
@ 2020-10-26 22:56 ` Stefan Kanthak
2020-10-27 8:33 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Stefan Kanthak @ 2020-10-26 22:56 UTC (permalink / raw)
To: Richard Biener; +Cc: GCC Development
Richard Biener <richard.guenther@gmail.com> wrote:
> On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> Hi,
>>
>> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
>> the first few lines of the __mulvDI3() function from libgcc2.c
>>
>> | DWtype
>> | __mulvDI3 (DWtype u, DWtype v)
>> | {
>> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
>> | but the checked multiplication needs only two. */
>> | const DWunion uu = {.ll = u};
>> | const DWunion vv = {.ll = v};
>> |
>> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
>> | {
>> | /* u fits in a single Wtype. */
>> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
>> | {
>> | /* v fits in a single Wtype as well. */
>> | /* A single multiplication. No overflow risk. */
>> | return (DWtype) uu.s.low * (DWtype) vv.s.low;
>> | }
>>
>> are compiled to this braindead code (obtained from libgcc.a of
>> GCC 10.2.0 installed on Debian):
>>
>> 0000000000000000 <__mulvti3>:
>> 0: 41 55 push %r13
>> 2: 49 89 cb mov %rcx,%r11
>> 5: 48 89 d0 mov %rdx,%rax
>> 8: 49 89 d2 mov %rdx,%r10
>> b: 41 54 push %r12
>> d: 49 89 fc mov %rdi,%r12
>> 10: 48 89 d1 mov %rdx,%rcx
>> 13: 49 89 f0 mov %rsi,%r8
>> 16: 4c 89 e2 mov %r12,%rdx
>> 19: 49 89 f5 mov %rsi,%r13
>> 1c: 53 push %rbx
>> 1d: 48 89 fe mov %rdi,%rsi
>> 20: 48 c1 fa 3f sar $0x3f,%rdx
>> 24: 48 c1 f8 3f sar $0x3f,%rax
>> 28: 4c 89 df mov %r11,%rdi
>> 2b: 4c 39 c2 cmp %r8,%rdx
>> 2e: 75 18 jne 48 <__mulvti3+0x48>
>> 30: 4c 39 d8 cmp %r11,%rax
>> 33: 75 6b jne a0 <__mulvti3+0xa0>
>> 35: 4c 89 e0 mov %r12,%rax
>> 38: 49 f7 ea imul %r10
>> 3b: 5b pop %rbx
>> 3c: 41 5c pop %r12
>> 3e: 41 5d pop %r13
>> 40: c3 retq
>> ...
>>
>> There are EIGHT superfluous MOV instructions here, clobbering the
>> non-volatile registers RBX, R12 and R13, plus THREE superfluous
>> PUSH/POP pairs.
>>
>> What stops GCC from generating the following straightforward code
>> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>>
>> .intel_syntax noprefix
>> __mulvti3:
>> mov r8, rdi
>> mov r9, rdx
>> sra r8, 63
>> sra r9, 63
>> cmp r8, rsi
>> jne __mulvti3+0x48+65-31
>> cmp r9, rcx
>> jne __mulvti3+0xa0+65-31
>> mov rax, rdi
>> imul rdx
>> ret
>> ...
>>
>>
>> not amused
>
> can you open a bugreport please?
I'd like to discuss alternative implementations of __mulvti3() first:
the very first lines of libgcc2.c reads
| /* More subroutines needed by GCC output code on some machines. */
| /* Compile this one with gcc. */
Why don't you take advantage of that and implement __mulvDI3() as
follows?
DWtype
__mulvDI3 (DWtype multiplicand, DWtype multiplier)
{
DWtype product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
abort();
return product;
}
For the AMD64 platform, instead of the 131 instructions in 439 bytes
(partially cited above) GCC now generates 107 instructions in 324 bytes
... (un)fortunately showing this bug again, now clobbering RBP and RBX,
pushing/popping RBP, RBX, R12, R14 and R15, and still moving them around
without necessity:
__mulvti3:
movq %rdi, %r8
movq %rdx, %rdi
pushq %r15
movq %rcx, %r9
movq %r8, %rdx
movq %rdi, %rax
pushq %r14
sarq $63, %rdx
pushq %r12
sarq $63, %rax
pushq %rbp
xorl %ebp, %ebp
pushq %rbx
cmpq %rsi, %rdx
jne .L4
cmpq %rcx, %rax
jne .L5
movq %r8, %rax
imulq %rdi
movq %rax, %rbx
.L2:
testq %rbp, %rbp
jne __mulvti3.cold
movq %rbx, %rax
popq %rbx
popq %rbp
popq %r12
popq %r14
popq %r15
ret
...
.L4:
cmpq %rcx, %rax
jne .L7
...
jns .L8
xorl %r10d, %r10d
subq %r10, %rax
sbbq %r12, %rdx
.L8:
testq %r12, %r12
jns .L9
...
jne .L10
movq %rcx, %rbx
movq %r10, %rdx
jmp .L2
...
jmp .L6
.L7:
...
ja .L3
...
ja .L3
...
jne .L11
...
jl .L2
.L3:
movl $1, %ebp
jmp .L2
.L11:
testq %r10, %r10
js .L2
jmp .L3
.L10:
...
jmp .L3
__mulvti3.cold:
.L13:
call abort
Besides the poor register allocation, this code shows 12 conditional
plus 5 unconditional branches.
Let's try a second alternative implementation:
DWtype
__mulvDI3 (DWtype multiplicand, DWtype multiplier)
{
UDWtype product;
UDWtype sign = multiplicand >> (DW_TYPE_SIZE - 1);
UDWtype tmp = multiplier >> (DW_TYPE_SIZE - 1);
if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
(multiplier ^ tmp) - tmp,
&product))
abort();
sign ^= tmp;
product = (product ^ sign) - sign;
if ((__int128_t) (product ^ sign) < 0)
abort();
return product;
}
This lets GCC generate 98 instructions in 293 bytes, with 6 conditional
plus 3 unconditional branches:
__mulvti3:
movq %rsi, %rax
pushq %r15
sarq $63, %rax
pushq %r14
movq %rdx, %r14
movq %rax, %r10
pushq %r13
movq %rax, %r11
movq %rcx, %rax
pushq %r12
sarq $63, %rax
pushq %rbp
movq %rdi, %rbp
movq %rsi, %rdi
movq %rax, %r8
xorq %r10, %rbp
xorq %r10, %rdi
movq %rax, %r9
pushq %rbx
movq %rbp, %rax
movq %rdi, %rdx
subq %r10, %rax
sbbq %r10, %rdx
xorq %r8, %r14
xorq %r8, %rcx
movq %rax, %rsi
movq %r14, %r12
movq %rcx, %r13
movq %rdx, %rdi
subq %r8, %r12
sbbq %r8, %r13
xorl %ebp, %ebp
movq %r13, %rcx
testq %rdx, %rdx
jne .L4
testq %r13, %r13
jne .L5
mulq %r12
movq %rax, %r14
movq %rdx, %rcx
.L2:
testq %rbp, %rbp
jne .L10
movq %r10, %rax
xorq %r8, %rax
movq %rax, %r12
movq %r11, %rax
xorq %r9, %rax
xorq %r12, %r14
movq %rax, %r13
movq %r14, %rax
xorq %r13, %rcx
subq %r12, %rax
movq %r13, %rbx
movq %rcx, %rdx
sbbq %r13, %rdx
xorq %rdx, %rbx
js .L10
popq %rbx
popq %rbp
popq %r12
popq %r13
popq %r14
popq %r15
ret
.p2align 4,,10
.p2align 3
.L4:
testq %r13, %r13
jne .L8
movq %rdx, %rcx
movq %r12, %rbx
.L6:
movq %rsi, %rax
mulq %r12
movq %rax, %r14
movq %rbx, %rax
movq %rdx, %r15
xorl %ebx, %ebx
mulq %rcx
addq %r15, %rax
adcq %rbx, %rdx
testq %rdx, %rdx
jne .L8
movq %rax, %rcx
jmp .L2
.p2align 4,,10
.p2align 3
.L5:
movq %rax, %rbx
jmp .L6
.p2align 4,,10
.p2align 3
.L8:
movq %rdi, %rcx
movq %r13, %rax
movl $1, %ebp
imulq %rsi, %rax
imulq %r12, %rcx
addq %rax, %rcx
movq %rsi, %rax
mulq %r12
addq %rdx, %rcx
movq %rax, %r14
jmp .L2
__mulvti3.cold:
.L10:
call abort
PROPER (assembly) code but uses only 44 instructions in just 121 bytes,
with ONE conditional branch!
--- mulvti3.s ---
# Copyright (C) 2014-2020, Stefan Kanthak
.arch generic64
.code64
.intel_syntax noprefix
.global abort
.global __mulvti3
.type __mulvti3, @function
.text
# rsi:rdi = multiplicand
# rcx:rdx = multiplier
__mulvti3:
mov r10, rdx
mov r11, rcx
mov rax, rcx
cqo # rdx = tmp
mov r9, rdx
xor r10, rdx
xor r11, rdx
sub r10, rdx
sbb r11, rdx # r11:r10 = |multiplier|
mov rax, rsi
cqo # rdx = sign
xor r9, rdx # r9 = sign ^ tmp
xor rdi, rdx
xor rsi, rdx
sub rdi, rdx
sbb rsi, rdx # rsi:rdi = |multiplicand|
xor ecx, ecx
cmp rcx, rsi
sbb edx, edx # edx = (|multiplicand| < 2**64) ? 0 : -1
cmp rcx, r11
sbb ecx, ecx # ecx = (|multiplier| < 2**64) ? 0 : -1
and ecx, edx # ecx = (|product| < 2**128) ? 0 : -1
mov rax, rsi
mul r10
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rsi, rax
mov rax, rdi
mul r11
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
add rsi, rax
# adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rax, rdi
mul r10
add rdx, rsi
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
add rax, r9
adc rdx, r9 # rdx:rax = (product < 0) ? ~product % 2**128 : product % 2**128
mov rsi, rdx # rsi = (product % 2**128 < -2**127)
# | (product % 2**128 >= 2**127) ? negative : positive
xor rax, r9
xor rdx, r9 # rdx:rax = product % 2**128
add rsi, rsi
adc ecx, ecx # ecx = (product >= -2**127)
# & (product < 2**127) ? 0 : *
jnz .overflow
ret
.overflow:
call abort
.end
--- EOF ---
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [__mulvti3] register allocator plays shell game
2020-10-26 22:56 ` Stefan Kanthak
@ 2020-10-27 8:33 ` Richard Biener
2020-10-27 9:59 ` Stefan Kanthak
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2020-10-27 8:33 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: GCC Development
On Tue, Oct 27, 2020 at 12:01 AM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Richard Biener <richard.guenther@gmail.com> wrote:
>
> > On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
> >>
> >> Hi,
> >>
> >> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
> >> the first few lines of the __mulvDI3() function from libgcc2.c
> >>
> >> | DWtype
> >> | __mulvDI3 (DWtype u, DWtype v)
> >> | {
> >> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
> >> | but the checked multiplication needs only two. */
> >> | const DWunion uu = {.ll = u};
> >> | const DWunion vv = {.ll = v};
> >> |
> >> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
> >> | {
> >> | /* u fits in a single Wtype. */
> >> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
> >> | {
> >> | /* v fits in a single Wtype as well. */
> >> | /* A single multiplication. No overflow risk. */
> >> | return (DWtype) uu.s.low * (DWtype) vv.s.low;
> >> | }
> >>
> >> are compiled to this braindead code (obtained from libgcc.a of
> >> GCC 10.2.0 installed on Debian):
> >>
> >> 0000000000000000 <__mulvti3>:
> >> 0: 41 55 push %r13
> >> 2: 49 89 cb mov %rcx,%r11
> >> 5: 48 89 d0 mov %rdx,%rax
> >> 8: 49 89 d2 mov %rdx,%r10
> >> b: 41 54 push %r12
> >> d: 49 89 fc mov %rdi,%r12
> >> 10: 48 89 d1 mov %rdx,%rcx
> >> 13: 49 89 f0 mov %rsi,%r8
> >> 16: 4c 89 e2 mov %r12,%rdx
> >> 19: 49 89 f5 mov %rsi,%r13
> >> 1c: 53 push %rbx
> >> 1d: 48 89 fe mov %rdi,%rsi
> >> 20: 48 c1 fa 3f sar $0x3f,%rdx
> >> 24: 48 c1 f8 3f sar $0x3f,%rax
> >> 28: 4c 89 df mov %r11,%rdi
> >> 2b: 4c 39 c2 cmp %r8,%rdx
> >> 2e: 75 18 jne 48 <__mulvti3+0x48>
> >> 30: 4c 39 d8 cmp %r11,%rax
> >> 33: 75 6b jne a0 <__mulvti3+0xa0>
> >> 35: 4c 89 e0 mov %r12,%rax
> >> 38: 49 f7 ea imul %r10
> >> 3b: 5b pop %rbx
> >> 3c: 41 5c pop %r12
> >> 3e: 41 5d pop %r13
> >> 40: c3 retq
> >> ...
> >>
> >> There are EIGHT superfluous MOV instructions here, clobbering the
> >> non-volatile registers RBX, R12 and R13, plus THREE superfluous
> >> PUSH/POP pairs.
> >>
> >> What stops GCC from generating the following straightforward code
> >> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
> >>
> >> .intel_syntax noprefix
> >> __mulvti3:
> >> mov r8, rdi
> >> mov r9, rdx
> >> sra r8, 63
> >> sra r9, 63
> >> cmp r8, rsi
> >> jne __mulvti3+0x48+65-31
> >> cmp r9, rcx
> >> jne __mulvti3+0xa0+65-31
> >> mov rax, rdi
> >> imul rdx
> >> ret
> >> ...
> >>
> >>
> >> not amused
> >
> > can you open a bugreport please?
>
>
> I'd like to discuss alternative implementations of __mulvti3() first:
> the very first lines of libgcc2.c reads
>
> | /* More subroutines needed by GCC output code on some machines. */
> | /* Compile this one with gcc. */
>
>
> Why don't you take advantage of that and implement __mulvDI3() as
> follows?
>
> DWtype
> __mulvDI3 (DWtype multiplicand, DWtype multiplier)
> {
> DWtype product;
>
> if (__builtin_mul_overflow(multiplicand, multiplier, &product))
> abort();
>
> return product;
> }
Sure, that's possible - but the main concern is that such fancy builtins
can end up calling libgcc. Which isn't really relevant for the
trap-on-overflow variants though.
I guess such change is desirable (also for the other arithmetic ops)
and code generation improvements should be done for the
__builtin_mul_overflow expansion, outside of libgcc.
Note __OPvMODE are considered legacy and -ftrapv dysfunctional
in general ...
>
> For the AMD64 platform, instead of the 131 instructions in 439 bytes
> (partially cited above) GCC now generates 107 instructions in 324 bytes
> ... (un)fortunately showing this bug again, now clobbering RBP and RBX,
> pushing/popping RBP, RBX, R12, R14 and R15, and still moving them around
> without necessity:
>
>
> __mulvti3:
> movq %rdi, %r8
> movq %rdx, %rdi
> pushq %r15
> movq %rcx, %r9
> movq %r8, %rdx
> movq %rdi, %rax
> pushq %r14
> sarq $63, %rdx
> pushq %r12
> sarq $63, %rax
> pushq %rbp
> xorl %ebp, %ebp
> pushq %rbx
> cmpq %rsi, %rdx
> jne .L4
> cmpq %rcx, %rax
> jne .L5
> movq %r8, %rax
> imulq %rdi
> movq %rax, %rbx
> .L2:
> testq %rbp, %rbp
> jne __mulvti3.cold
> movq %rbx, %rax
> popq %rbx
> popq %rbp
> popq %r12
> popq %r14
> popq %r15
> ret
> ...
> .L4:
> cmpq %rcx, %rax
> jne .L7
> ...
> jns .L8
> xorl %r10d, %r10d
> subq %r10, %rax
> sbbq %r12, %rdx
> .L8:
> testq %r12, %r12
> jns .L9
> ...
> jne .L10
> movq %rcx, %rbx
> movq %r10, %rdx
> jmp .L2
> ...
> jmp .L6
> .L7:
> ...
> ja .L3
> ...
> ja .L3
> ...
> jne .L11
> ...
> jl .L2
> .L3:
> movl $1, %ebp
> jmp .L2
> .L11:
> testq %r10, %r10
> js .L2
> jmp .L3
> .L10:
> ...
> jmp .L3
> __mulvti3.cold:
> .L13:
> call abort
>
> Besides the poor register allocation, this code shows 12 conditional
> plus 5 unconditional branches.
>
>
> Let's try a second alternative implementation:
>
> DWtype
> __mulvDI3 (DWtype multiplicand, DWtype multiplier)
> {
> UDWtype product;
> UDWtype sign = multiplicand >> (DW_TYPE_SIZE - 1);
> UDWtype tmp = multiplier >> (DW_TYPE_SIZE - 1);
>
> if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
> (multiplier ^ tmp) - tmp,
> &product))
> abort();
>
> sign ^= tmp;
> product = (product ^ sign) - sign;
>
> if ((__int128_t) (product ^ sign) < 0)
> abort();
>
> return product;
> }
>
>
> This lets GCC generate 98 instructions in 293 bytes, with 6 conditional
> plus 3 unconditional branches:
>
> __mulvti3:
> movq %rsi, %rax
> pushq %r15
> sarq $63, %rax
> pushq %r14
> movq %rdx, %r14
> movq %rax, %r10
> pushq %r13
> movq %rax, %r11
> movq %rcx, %rax
> pushq %r12
> sarq $63, %rax
> pushq %rbp
> movq %rdi, %rbp
> movq %rsi, %rdi
> movq %rax, %r8
> xorq %r10, %rbp
> xorq %r10, %rdi
> movq %rax, %r9
> pushq %rbx
> movq %rbp, %rax
> movq %rdi, %rdx
> subq %r10, %rax
> sbbq %r10, %rdx
> xorq %r8, %r14
> xorq %r8, %rcx
> movq %rax, %rsi
> movq %r14, %r12
> movq %rcx, %r13
> movq %rdx, %rdi
> subq %r8, %r12
> sbbq %r8, %r13
> xorl %ebp, %ebp
> movq %r13, %rcx
> testq %rdx, %rdx
> jne .L4
> testq %r13, %r13
> jne .L5
> mulq %r12
> movq %rax, %r14
> movq %rdx, %rcx
> .L2:
> testq %rbp, %rbp
> jne .L10
> movq %r10, %rax
> xorq %r8, %rax
> movq %rax, %r12
> movq %r11, %rax
> xorq %r9, %rax
> xorq %r12, %r14
> movq %rax, %r13
> movq %r14, %rax
> xorq %r13, %rcx
> subq %r12, %rax
> movq %r13, %rbx
> movq %rcx, %rdx
> sbbq %r13, %rdx
> xorq %rdx, %rbx
> js .L10
> popq %rbx
> popq %rbp
> popq %r12
> popq %r13
> popq %r14
> popq %r15
> ret
> .p2align 4,,10
> .p2align 3
> .L4:
> testq %r13, %r13
> jne .L8
> movq %rdx, %rcx
> movq %r12, %rbx
> .L6:
> movq %rsi, %rax
> mulq %r12
> movq %rax, %r14
> movq %rbx, %rax
> movq %rdx, %r15
> xorl %ebx, %ebx
> mulq %rcx
> addq %r15, %rax
> adcq %rbx, %rdx
> testq %rdx, %rdx
> jne .L8
> movq %rax, %rcx
> jmp .L2
> .p2align 4,,10
> .p2align 3
> .L5:
> movq %rax, %rbx
> jmp .L6
> .p2align 4,,10
> .p2align 3
> .L8:
> movq %rdi, %rcx
> movq %r13, %rax
> movl $1, %ebp
> imulq %rsi, %rax
> imulq %r12, %rcx
> addq %rax, %rcx
> movq %rsi, %rax
> mulq %r12
> addq %rdx, %rcx
> movq %rax, %r14
> jmp .L2
> __mulvti3.cold:
> .L10:
> call abort
>
>
> PROPER (assembly) code but uses only 44 instructions in just 121 bytes,
> with ONE conditional branch!
>
> --- mulvti3.s ---
> # Copyright (C) 2014-2020, Stefan Kanthak
> .arch generic64
> .code64
> .intel_syntax noprefix
> .global abort
> .global __mulvti3
> .type __mulvti3, @function
> .text
> # rsi:rdi = multiplicand
> # rcx:rdx = multiplier
> __mulvti3:
> mov r10, rdx
> mov r11, rcx
> mov rax, rcx
> cqo # rdx = tmp
> mov r9, rdx
> xor r10, rdx
> xor r11, rdx
> sub r10, rdx
> sbb r11, rdx # r11:r10 = |multiplier|
> mov rax, rsi
> cqo # rdx = sign
> xor r9, rdx # r9 = sign ^ tmp
> xor rdi, rdx
> xor rsi, rdx
> sub rdi, rdx
> sbb rsi, rdx # rsi:rdi = |multiplicand|
> xor ecx, ecx
> cmp rcx, rsi
> sbb edx, edx # edx = (|multiplicand| < 2**64) ? 0 : -1
> cmp rcx, r11
> sbb ecx, ecx # ecx = (|multiplier| < 2**64) ? 0 : -1
> and ecx, edx # ecx = (|product| < 2**128) ? 0 : -1
> mov rax, rsi
> mul r10
> adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
> mov rsi, rax
> mov rax, rdi
> mul r11
> adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
> add rsi, rax
> # adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
> mov rax, rdi
> mul r10
> add rdx, rsi
> adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
> add rax, r9
> adc rdx, r9 # rdx:rax = (product < 0) ? ~product % 2**128 : product % 2**128
> mov rsi, rdx # rsi = (product % 2**128 < -2**127)
> # | (product % 2**128 >= 2**127) ? negative : positive
> xor rax, r9
> xor rdx, r9 # rdx:rax = product % 2**128
> add rsi, rsi
> adc ecx, ecx # ecx = (product >= -2**127)
> # & (product < 2**127) ? 0 : *
> jnz .overflow
> ret
> .overflow:
> call abort
> .end
> --- EOF ---
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [__mulvti3] register allocator plays shell game
2020-10-27 8:33 ` Richard Biener
@ 2020-10-27 9:59 ` Stefan Kanthak
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Kanthak @ 2020-10-27 9:59 UTC (permalink / raw)
To: Richard Biener; +Cc: GCC Development
Richard Biener <richard.guenther@gmail.com> wrote:
> On Tue, Oct 27, 2020 at 12:01 AM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> wrote:
>>
>>> On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>>>
>>>> Hi,
>>>>
>>>> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
>>>> the first few lines of the __mulvDI3() function from libgcc2.c
>>>>
>>>> | DWtype
>>>> | __mulvDI3 (DWtype u, DWtype v)
>>>> | {
>>>> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
>>>> | but the checked multiplication needs only two. */
>>>> | const DWunion uu = {.ll = u};
>>>> | const DWunion vv = {.ll = v};
>>>> |
>>>> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
>>>> | {
>>>> | /* u fits in a single Wtype. */
>>>> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
>>>> | {
>>>> | /* v fits in a single Wtype as well. */
>>>> | /* A single multiplication. No overflow risk. */
>>>> | return (DWtype) uu.s.low * (DWtype) vv.s.low;
>>>> | }
>>>>
>>>> are compiled to this braindead code (obtained from libgcc.a of
>>>> GCC 10.2.0 installed on Debian):
>>>>
>>>> 0000000000000000 <__mulvti3>:
>>>> 0: 41 55 push %r13
>>>> 2: 49 89 cb mov %rcx,%r11
>>>> 5: 48 89 d0 mov %rdx,%rax
>>>> 8: 49 89 d2 mov %rdx,%r10
>>>> b: 41 54 push %r12
>>>> d: 49 89 fc mov %rdi,%r12
>>>> 10: 48 89 d1 mov %rdx,%rcx
>>>> 13: 49 89 f0 mov %rsi,%r8
>>>> 16: 4c 89 e2 mov %r12,%rdx
>>>> 19: 49 89 f5 mov %rsi,%r13
>>>> 1c: 53 push %rbx
>>>> 1d: 48 89 fe mov %rdi,%rsi
>>>> 20: 48 c1 fa 3f sar $0x3f,%rdx
>>>> 24: 48 c1 f8 3f sar $0x3f,%rax
>>>> 28: 4c 89 df mov %r11,%rdi
>>>> 2b: 4c 39 c2 cmp %r8,%rdx
>>>> 2e: 75 18 jne 48 <__mulvti3+0x48>
>>>> 30: 4c 39 d8 cmp %r11,%rax
>>>> 33: 75 6b jne a0 <__mulvti3+0xa0>
>>>> 35: 4c 89 e0 mov %r12,%rax
>>>> 38: 49 f7 ea imul %r10
>>>> 3b: 5b pop %rbx
>>>> 3c: 41 5c pop %r12
>>>> 3e: 41 5d pop %r13
>>>> 40: c3 retq
>>>> ...
>>>>
>>>> There are EIGHT superfluous MOV instructions here, clobbering the
>>>> non-volatile registers RBX, R12 and R13, plus THREE superfluous
>>>> PUSH/POP pairs.
>>>>
>>>> What stops GCC from generating the following straightforward code
>>>> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>>>>
>>>> .intel_syntax noprefix
>>>> __mulvti3:
>>>> mov r8, rdi
>>>> mov r9, rdx
>>>> sra r8, 63
>>>> sra r9, 63
>>>> cmp r8, rsi
>>>> jne __mulvti3+0x48+65-31
>>>> cmp r9, rcx
>>>> jne __mulvti3+0xa0+65-31
>>>> mov rax, rdi
>>>> imul rdx
>>>> ret
>>>> ...
>>>>
>>>>
>>>> not amused
>>>
>>> can you open a bugreport please?
>>
>>
>> I'd like to discuss alternative implementations of __mulvti3() first:
>> the very first lines of libgcc2.c reads
>>
>> | /* More subroutines needed by GCC output code on some machines. */
>> | /* Compile this one with gcc. */
>>
>>
>> Why don't you take advantage of that and implement __mulvDI3() as
>> follows?
>>
>> DWtype
>> __mulvDI3 (DWtype multiplicand, DWtype multiplier)
>> {
>> DWtype product;
>>
>> if (__builtin_mul_overflow(multiplicand, multiplier, &product))
>> abort();
>>
>> return product;
>> }
>
> Sure, that's possible - but the main concern is that such fancy builtins
> can end up calling libgcc.
1. these "fancy" builtins are provided by GCC itself.
2. "fancy" builtins like __builtin_clzll() or __builtin_expect() are already
used in the sources for libgcc.
> Which isn't really relevant for the trap-on-overflow variants though.
Correct.
JFTR: if __builtin_*_overflow() were calling libgcc I would not propose
to implement libgcc's __OPvMODE() functions using these builtins!
> I guess such change is desirable (also for the other arithmetic ops)
That's why I started this thread in the first place!
See <https://skanthak.homepage.t-online.de/gcc.html> for quite some
more bugs and deficiencies, plus properly optimised assembly code.
> and code generation improvements should be done for the
> __builtin_mul_overflow expansion, outside of libgcc.
Correct.
My second and third example show that the current implementation of
__builtin_mul_overflow() (at least for AMD64) is but RATHER b(raindea)d.
My initial example shows that the register allocator and the optimiser
need quite SOME optimisation, independent of any code generation
improvements for __builtin_*_overflow().
And my web page shows that the code for MANY of the libgcc functions is
FAR from optimised...
> Note __OPvMODE are considered legacy and -ftrapv dysfunctional
> in general ...
That's UNDOCUMENTED and therefore irrelevant for your users!
<https://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html>
does NOT contain the word "legacy" at all.
It also misses the synopsis' for __absvti2(), __negvti2(), __addvti3(),
__subvti3() and __mulvti3()
The same holds for the word "dysfunctional", which is not contained in
any of the onlnedocs!
[ source and assembly of __mulvti3() ]
Stefan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-10-27 10:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-25 19:35 [__mulvti3] register allocator plays shell game Stefan Kanthak
2020-10-26 10:19 ` Richard Biener
2020-10-26 22:56 ` Stefan Kanthak
2020-10-27 8:33 ` Richard Biener
2020-10-27 9:59 ` Stefan Kanthak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).