public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/110946] New: 3x perf regression with -Os on M1 Pro
@ 2023-08-08 11:25 dave.rodgman at arm dot com
2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 11:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Bug ID: 110946
Summary: 3x perf regression with -Os on M1 Pro
Product: gcc
Version: 12.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: other
Assignee: unassigned at gcc dot gnu.org
Reporter: dave.rodgman at arm dot com
Target Milestone: ---
Please see
https://github.com/Mbed-TLS/mbedtls/pull/7784/commits/6cfd9b54ae0d06451c1a46a10e57fa099878bb03
for details.
On M1 Pro, under -Os, we see a 3.1x performance regression for AES-XTS, which
can be solved by forcing -O2 for two functions. For comparison, clang -Os gives
around 5% perf regression (which is more in the ballpark that I'd expect). So
it looks to me like gcc is getting something wrong when compiling these two
functions with -Os.
We measured a smaller but still significant difference (20-25%) on x86-64.
Affects all versions of gcc that I was able to test with (9 .. 12).
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
@ 2023-08-08 11:37 ` dave.rodgman at arm dot com
2023-08-08 12:22 ` amonakov at gcc dot gnu.org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 11:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #1 from Dave Rodgman <dave.rodgman at arm dot com> ---
Disassembly under -Os:
000000000000139c <mbedtls_aes_crypt_xts>:
139c: a9b67bfd stp x29, x30, [sp, #-160]!
13a0: 910003fd mov x29, sp
13a4: a9046bf9 stp x25, x26, [sp, #64]
13a8: aa0003f9 mov x25, x0
13ac: 90000000 adrp x0, 0 <__stack_chk_guard>
13b0: a90153f3 stp x19, x20, [sp, #16]
13b4: f9400000 ldr x0, [x0]
13b8: a9025bf5 stp x21, x22, [sp, #32]
13bc: 2a0103f6 mov w22, w1
13c0: a90363f7 stp x23, x24, [sp, #48]
13c4: a90573fb stp x27, x28, [sp, #80]
13c8: f9400001 ldr x1, [x0]
13cc: f9004fe1 str x1, [sp, #152]
13d0: d2800001 mov x1, #0x0 // #0
13d4: 710006df cmp w22, #0x1
13d8: 54000c28 b.hi 155c <mbedtls_aes_crypt_xts+0x1c0> //
b.pmore
13dc: d1004041 sub x1, x2, #0x10
13e0: aa0203f3 mov x19, x2
13e4: b27c4fe0 mov x0, #0xfffff0 //
#16777200
13e8: eb00003f cmp x1, x0
13ec: 54000bc8 b.hi 1564 <mbedtls_aes_crypt_xts+0x1c8> //
b.pmore
13f0: 9101a3f5 add x21, sp, #0x68
13f4: aa0303e2 mov x2, x3
13f8: aa0403f8 mov x24, x4
13fc: aa0503f7 mov x23, x5
1400: aa1503e3 mov x3, x21
1404: 91048320 add x0, x25, #0x120
1408: 52800021 mov w1, #0x1 // #1
140c: 94000000 bl 1210 <mbedtls_aes_crypt_ecb>
1410: 2a0003f4 mov w20, w0
1414: 35000540 cbnz w0, 14bc <mbedtls_aes_crypt_xts+0x120>
1418: 520002db eor w27, w22, #0x1
141c: d344fe7a lsr x26, x19, #4
1420: 1200037b and w27, w27, #0x1
1424: 92400e73 and x19, x19, #0xf
1428: 910223fc add x28, sp, #0x88
142c: d100075a sub x26, x26, #0x1
1430: b100075f cmn x26, #0x1
1434: 54000541 b.ne 14dc <mbedtls_aes_crypt_xts+0x140> //
b.any
1438: b4000433 cbz x19, 14bc <mbedtls_aes_crypt_xts+0x120>
143c: 710002df cmp w22, #0x0
1440: d10042fb sub x27, x23, #0x10
1444: 9101e3fa add x26, sp, #0x78
1448: aa1303e2 mov x2, x19
144c: 9a95035a csel x26, x26, x21, eq // eq = none
1450: aa1b03e1 mov x1, x27
1454: 910223f5 add x21, sp, #0x88
1458: aa1703e0 mov x0, x23
145c: 94000000 bl 0 <memmove>
1460: d2800217 mov x23, #0x10 // #16
1464: aa1303e3 mov x3, x19
1468: aa1a03e2 mov x2, x26
146c: aa1803e1 mov x1, x24
1470: aa1503e0 mov x0, x21
1474: 94000000 bl 0 <mbedtls_xor>
1478: cb1302e3 sub x3, x23, x19
147c: 8b130342 add x2, x26, x19
1480: 8b130361 add x1, x27, x19
1484: 8b1302a0 add x0, x21, x19
1488: 94000000 bl 0 <mbedtls_xor>
148c: aa1503e3 mov x3, x21
1490: aa1503e2 mov x2, x21
1494: 2a1603e1 mov w1, w22
1498: aa1903e0 mov x0, x25
149c: 94000000 bl 1210 <mbedtls_aes_crypt_ecb>
14a0: 2a0003f4 mov w20, w0
14a4: 350000c0 cbnz w0, 14bc <mbedtls_aes_crypt_xts+0x120>
14a8: aa1703e3 mov x3, x23
14ac: aa1a03e2 mov x2, x26
14b0: aa1503e1 mov x1, x21
14b4: aa1b03e0 mov x0, x27
14b8: 94000000 bl 0 <mbedtls_xor>
14bc: 90000000 adrp x0, 0 <__stack_chk_guard>
14c0: f9400000 ldr x0, [x0]
14c4: f9404fe2 ldr x2, [sp, #152]
14c8: f9400001 ldr x1, [x0]
14cc: eb010042 subs x2, x2, x1
14d0: d2800001 mov x1, #0x0 // #0
14d4: 54000500 b.eq 1574 <mbedtls_aes_crypt_xts+0x1d8> //
b.none
14d8: 94000000 bl 0 <__stack_chk_fail>
14dc: f100027f cmp x19, #0x0
14e0: 1a9f07e0 cset w0, ne // ne = any
14e4: 6a1b001f tst w0, w27
14e8: 540000e0 b.eq 1504 <mbedtls_aes_crypt_xts+0x168> //
b.none
14ec: b50000da cbnz x26, 1504 <mbedtls_aes_crypt_xts+0x168>
14f0: a94687e0 ldp x0, x1, [sp, #104]
14f4: a90787e0 stp x0, x1, [sp, #120]
14f8: aa1503e1 mov x1, x21
14fc: aa1503e0 mov x0, x21
1500: 97fffb63 bl 28c <mbedtls_gf128mul_x_ble>
1504: aa1503e2 mov x2, x21
1508: aa1803e1 mov x1, x24
150c: aa1c03e0 mov x0, x28
1510: d2800203 mov x3, #0x10 // #16
1514: 94000000 bl 0 <mbedtls_xor>
1518: aa1c03e3 mov x3, x28
151c: aa1c03e2 mov x2, x28
1520: 2a1603e1 mov w1, w22
1524: aa1903e0 mov x0, x25
1528: 94000000 bl 1210 <mbedtls_aes_crypt_ecb>
152c: 35000200 cbnz w0, 156c <mbedtls_aes_crypt_xts+0x1d0>
1530: aa1503e2 mov x2, x21
1534: d2800203 mov x3, #0x10 // #16
1538: aa1703e0 mov x0, x23
153c: aa1c03e1 mov x1, x28
1540: 94000000 bl 0 <mbedtls_xor>
1544: 910042f7 add x23, x23, #0x10
1548: aa1503e1 mov x1, x21
154c: aa1503e0 mov x0, x21
1550: 91004318 add x24, x24, #0x10
1554: 97fffb4e bl 28c <mbedtls_gf128mul_x_ble>
1558: 17ffffb5 b 142c <mbedtls_aes_crypt_xts+0x90>
155c: 12800414 mov w20, #0xffffffdf // #-33
1560: 17ffffd7 b 14bc <mbedtls_aes_crypt_xts+0x120>
1564: 12800434 mov w20, #0xffffffde // #-34
1568: 17ffffd5 b 14bc <mbedtls_aes_crypt_xts+0x120>
156c: 2a0003f4 mov w20, w0
1570: 17ffffd3 b 14bc <mbedtls_aes_crypt_xts+0x120>
1574: 2a1403e0 mov w0, w20
1578: a94153f3 ldp x19, x20, [sp, #16]
157c: a9425bf5 ldp x21, x22, [sp, #32]
1580: a94363f7 ldp x23, x24, [sp, #48]
1584: a9446bf9 ldp x25, x26, [sp, #64]
1588: a94573fb ldp x27, x28, [sp, #80]
158c: a8ca7bfd ldp x29, x30, [sp], #160
1590: d65f03c0 ret
Disassembly for mbedtls_gf128mul_x_ble:
000000000000028c <mbedtls_gf128mul_x_ble>:
28c: a9be7bfd stp x29, x30, [sp, #-32]!
290: 910003fd mov x29, sp
294: a90153f3 stp x19, x20, [sp, #16]
298: aa0003f3 mov x19, x0
29c: a9400823 ldp x3, x2, [x1]
2a0: 52800101 mov w1, #0x8 // #8
2a4: 93c3fc54 extr x20, x2, x3, #63
2a8: d37ffc42 lsr x2, x2, #63
2ac: 4b020c22 sub w2, w1, w2, lsl #3
2b0: 528010e1 mov w1, #0x87 // #135
2b4: 1ac22821 asr w1, w1, w2
2b8: 93407c21 sxtw x1, w1
2bc: ca030421 eor x1, x1, x3, lsl #1
2c0: 94000000 bl 0 <mbedtls_put_unaligned_uint64>
2c4: aa1403e1 mov x1, x20
2c8: 91002260 add x0, x19, #0x8
2cc: a94153f3 ldp x19, x20, [sp, #16]
2d0: a8c27bfd ldp x29, x30, [sp], #32
2d4: 14000000 b 0 <mbedtls_put_unaligned_uint64>
and under -O2:
Disassembly for mbedtls_aes_crypt_xts:
0000000000001500 <mbedtls_aes_crypt_xts>:
1500: a9b57bfd stp x29, x30, [sp, #-176]!
1504: 90000006 adrp x6, 0 <__stack_chk_guard>
1508: 910003fd mov x29, sp
150c: f94000c6 ldr x6, [x6]
1510: a90153f3 stp x19, x20, [sp, #16]
1514: 2a0103f4 mov w20, w1
1518: f94000c1 ldr x1, [x6]
151c: f90057e1 str x1, [sp, #168]
1520: d2800001 mov x1, #0x0 // #0
1524: 7100069f cmp w20, #0x1
1528: 54001348 b.hi 1790 <mbedtls_aes_crypt_xts+0x290> //
b.pmore
152c: d1004041 sub x1, x2, #0x10
1530: a9025bf5 stp x21, x22, [sp, #32]
1534: aa0003f5 mov x21, x0
1538: aa0203f6 mov x22, x2
153c: b27c4fe0 mov x0, #0xfffff0 //
#16777200
1540: eb00003f cmp x1, x0
1544: 54001208 b.hi 1784 <mbedtls_aes_crypt_xts+0x284> //
b.pmore
1548: aa0503f3 mov x19, x5
154c: a90363f7 stp x23, x24, [sp, #48]
1550: aa0303f7 mov x23, x3
1554: a90573fb stp x27, x28, [sp, #80]
1558: aa0403fb mov x27, x4
155c: 94000000 bl 0 <mbedtls_aesce_has_support>
1560: 910482a3 add x3, x21, #0x120
1564: 9101e3fc add x28, sp, #0x78
1568: 35000ec0 cbnz w0, 1740 <mbedtls_aes_crypt_xts+0x240>
156c: aa1703e1 mov x1, x23
1570: aa0303e0 mov x0, x3
1574: aa1c03e2 mov x2, x28
1578: 94000000 bl 9c0 <mbedtls_internal_aes_encrypt>
157c: 35000760 cbnz w0, 1668 <mbedtls_aes_crypt_xts+0x168>
1580: f2400ec0 ands x0, x22, #0xf
1584: d344fec4 lsr x4, x22, #4
1588: 52000298 eor w24, w20, #0x1
158c: f90037e0 str x0, [sp, #104]
1590: 1a9f07e0 cset w0, ne // ne = any
1594: 52800117 mov w23, #0x8 // #8
1598: 0a000318 and w24, w24, w0
159c: 528010f6 mov w22, #0x87 // #135
15a0: a9046bf9 stp x25, x26, [sp, #64]
15a4: d1000499 sub x25, x4, #0x1
15a8: 910263fa add x26, sp, #0x98
15ac: 14000014 b 15fc <mbedtls_aes_crypt_xts+0xfc>
15b0: 3dc00341 ldr q1, [x26]
15b4: aa1303e3 mov x3, x19
15b8: 3dc00380 ldr q0, [x28]
15bc: d1000739 sub x25, x25, #0x1
15c0: 91004366 add x6, x27, #0x10
15c4: 6e211c00 eor v0.16b, v0.16b, v1.16b
15c8: 3c810460 str q0, [x3], #16
15cc: a94797e1 ldp x1, x5, [sp, #120]
15d0: d37ffca2 lsr x2, x5, #63
15d4: 93c1fca5 extr x5, x5, x1, #63
15d8: 4b020ee2 sub w2, w23, w2, lsl #3
15dc: 1ac22ac2 asr w2, w22, w2
15e0: 93407c42 sxtw x2, w2
15e4: ca010441 eor x1, x2, x1, lsl #1
15e8: a90797e1 stp x1, x5, [sp, #120]
15ec: b100073f cmn x25, #0x1
15f0: 54000440 b.eq 1678 <mbedtls_aes_crypt_xts+0x178> //
b.none
15f4: aa0303f3 mov x19, x3
15f8: aa0603fb mov x27, x6
15fc: f100033f cmp x25, #0x0
1600: 7a400b04 ccmp w24, #0x0, #0x4, eq // eq = none
1604: 54000aa1 b.ne 1758 <mbedtls_aes_crypt_xts+0x258> //
b.any
1608: 3dc00360 ldr q0, [x27]
160c: aa1a03e3 mov x3, x26
1610: 3dc00381 ldr q1, [x28]
1614: aa1a03e2 mov x2, x26
1618: 2a1403e1 mov w1, w20
161c: aa1503e0 mov x0, x21
1620: 6e211c00 eor v0.16b, v0.16b, v1.16b
1624: 3d800340 str q0, [x26]
1628: 94000000 bl 12d0 <mbedtls_aes_crypt_ecb>
162c: 34fffc20 cbz w0, 15b0 <mbedtls_aes_crypt_xts+0xb0>
1630: a9425bf5 ldp x21, x22, [sp, #32]
1634: a94363f7 ldp x23, x24, [sp, #48]
1638: a9446bf9 ldp x25, x26, [sp, #64]
163c: a94573fb ldp x27, x28, [sp, #80]
1640: 90000001 adrp x1, 0 <__stack_chk_guard>
1644: f9400021 ldr x1, [x1]
1648: f94057e3 ldr x3, [sp, #168]
164c: f9400022 ldr x2, [x1]
1650: eb020063 subs x3, x3, x2
1654: d2800002 mov x2, #0x0 // #0
1658: 54000a01 b.ne 1798 <mbedtls_aes_crypt_xts+0x298> //
b.any
165c: a94153f3 ldp x19, x20, [sp, #16]
1660: a8cb7bfd ldp x29, x30, [sp], #176
1664: d65f03c0 ret
1668: a9425bf5 ldp x21, x22, [sp, #32]
166c: a94363f7 ldp x23, x24, [sp, #48]
1670: a94573fb ldp x27, x28, [sp, #80]
1674: 17fffff3 b 1640 <mbedtls_aes_crypt_xts+0x140>
1678: f94037e2 ldr x2, [sp, #104]
167c: b4fffda2 cbz x2, 1630 <mbedtls_aes_crypt_xts+0x130>
1680: 7100029f cmp w20, #0x0
1684: 910223f6 add x22, sp, #0x88
1688: 9a9c02d6 csel x22, x22, x28, eq // eq = none
168c: aa0303e0 mov x0, x3
1690: aa1303e1 mov x1, x19
1694: 91003f7b add x27, x27, #0xf
1698: 94000000 bl 0 <memmove>
169c: d10006c5 sub x5, x22, #0x1
16a0: d2800020 mov x0, #0x1 // #1
16a4: d503201f nop
16a8: 38606b62 ldrb w2, [x27, x0]
16ac: 8b000343 add x3, x26, x0
16b0: 386068a4 ldrb w4, [x5, x0]
16b4: aa0003e1 mov x1, x0
16b8: 91000400 add x0, x0, #0x1
16bc: 4a040042 eor w2, w2, w4
16c0: 381ff062 sturb w2, [x3, #-1]
16c4: f94037e2 ldr x2, [sp, #104]
16c8: eb02003f cmp x1, x2
16cc: 54fffee1 b.ne 16a8 <mbedtls_aes_crypt_xts+0x1a8> //
b.any
16d0: d2800203 mov x3, #0x10 // #16
16d4: 8b020265 add x5, x19, x2
16d8: cb020063 sub x3, x3, x2
16dc: 8b0202c4 add x4, x22, x2
16e0: 8b020359 add x25, x26, x2
16e4: d2800000 mov x0, #0x0 // #0
16e8: 386068a1 ldrb w1, [x5, x0]
16ec: 38606882 ldrb w2, [x4, x0]
16f0: 4a020021 eor w1, w1, w2
16f4: 38206b21 strb w1, [x25, x0]
16f8: 91000400 add x0, x0, #0x1
16fc: eb00007f cmp x3, x0
1700: 54ffff41 b.ne 16e8 <mbedtls_aes_crypt_xts+0x1e8> //
b.any
1704: 2a1403e1 mov w1, w20
1708: aa1503e0 mov x0, x21
170c: aa1a03e3 mov x3, x26
1710: aa1a03e2 mov x2, x26
1714: 94000000 bl 12d0 <mbedtls_aes_crypt_ecb>
1718: 35fff8c0 cbnz w0, 1630 <mbedtls_aes_crypt_xts+0x130>
171c: 3dc002c0 ldr q0, [x22]
1720: 3dc00341 ldr q1, [x26]
1724: 6e211c00 eor v0.16b, v0.16b, v1.16b
1728: 3d800260 str q0, [x19]
172c: a9425bf5 ldp x21, x22, [sp, #32]
1730: a94363f7 ldp x23, x24, [sp, #48]
1734: a9446bf9 ldp x25, x26, [sp, #64]
1738: a94573fb ldp x27, x28, [sp, #80]
173c: 17ffffc1 b 1640 <mbedtls_aes_crypt_xts+0x140>
1740: aa1703e2 mov x2, x23
1744: aa0303e0 mov x0, x3
1748: 52800021 mov w1, #0x1 // #1
174c: aa1c03e3 mov x3, x28
1750: 94000000 bl 0 <mbedtls_aesce_crypt_ecb>
1754: 17ffff8a b 157c <mbedtls_aes_crypt_xts+0x7c>
1758: a94797e1 ldp x1, x5, [sp, #120]
175c: a9478fe2 ldp x2, x3, [sp, #120]
1760: a9088fe2 stp x2, x3, [sp, #136]
1764: d37ffca0 lsr x0, x5, #63
1768: 93c1fca5 extr x5, x5, x1, #63
176c: 4b000ee0 sub w0, w23, w0, lsl #3
1770: 1ac02ac0 asr w0, w22, w0
1774: 93407c00 sxtw x0, w0
1778: ca010401 eor x1, x0, x1, lsl #1
177c: a90797e1 stp x1, x5, [sp, #120]
1780: 17ffffa2 b 1608 <mbedtls_aes_crypt_xts+0x108>
1784: 12800420 mov w0, #0xffffffde // #-34
1788: a9425bf5 ldp x21, x22, [sp, #32]
178c: 17ffffad b 1640 <mbedtls_aes_crypt_xts+0x140>
1790: 12800400 mov w0, #0xffffffdf // #-33
1794: 17ffffab b 1640 <mbedtls_aes_crypt_xts+0x140>
1798: a9025bf5 stp x21, x22, [sp, #32]
179c: a90363f7 stp x23, x24, [sp, #48]
17a0: a9046bf9 stp x25, x26, [sp, #64]
17a4: a90573fb stp x27, x28, [sp, #80]
17a8: 94000000 bl 0 <__stack_chk_fail>
Disassembly for mbedtls_gf128mul_x_ble: (actually, this gets inlined, but I
removed the "static inline" to get this disassembly)
0000000000001500 <mbedtls_gf128mul_x_ble>:
1500: a9401023 ldp x3, x4, [x1]
1504: 52800105 mov w5, #0x8 // #8
1508: 528010e2 mov w2, #0x87 // #135
150c: d37ffc81 lsr x1, x4, #63
1510: 93c3fc84 extr x4, x4, x3, #63
1514: 4b010ca1 sub w1, w5, w1, lsl #3
1518: 1ac12841 asr w1, w2, w1
151c: 93407c21 sxtw x1, w1
1520: ca030423 eor x3, x1, x3, lsl #1
1524: a9001003 stp x3, x4, [x0]
1528: d65f03c0 ret
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
@ 2023-08-08 12:22 ` amonakov at gcc dot gnu.org
2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 12:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
So basically missed inlining at -Os, even memcpy wrappers are not inlined.
Can you provide a reproducible testcase?
Note that inline functions in mbedtls/library/alignment.h all miss the 'static'
qualifier, which affects inlining decisions, and looks like a mistake anyway
(if they are really meant to be non-static inlines, shouldn't there be a
comment?)
Does making them 'static inline' rectify the problem?
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
2023-08-08 12:22 ` amonakov at gcc dot gnu.org
@ 2023-08-08 12:32 ` rguenth at gcc dot gnu.org
2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-08 12:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|other |ipa
Target| |aarch64
Keywords| |missed-optimization
CC| |marxin at gcc dot gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note you shouldn't use -Os if you care about performance. GCC is quite
reasonable with code size increases at -O2 (as compared to other compilers).
Instead I suggest you use -flto with -O2 to decrease the size of the final
executable/library and give GCC better knowledge on unit growth.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (2 preceding siblings ...)
2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
@ 2023-08-08 12:35 ` dave.rodgman at arm dot com
2023-08-08 12:40 ` dave.rodgman at arm dot com
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 12:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Dave Rodgman <dave.rodgman at arm dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords|missed-optimization |
Component|ipa |other
Target|aarch64 |
--- Comment #4 from Dave Rodgman <dave.rodgman at arm dot com> ---
From a quick test, it doesn't look like the unaligned access inlining is the
issue:
Not static inline, -Os:
AES-XTS-128 : 853799 KiB/s, 0 cycles/byte
AES-XTS-256 : 749919 KiB/s, 0 cycles/byte
Static inline, -Os:
AES-XTS-128 : 885380 KiB/s, 0 cycles/byte
AES-XTS-256 : 752995 KiB/s, 0 cycles/byte
Not static inline, -O2:
AES-XTS-128 : 2822656 KiB/s, 0 cycles/byte
AES-XTS-256 : 2425721 KiB/s, 0 cycles/byte
Static inline, -O2:
AES-XTS-128 : 2692321 KiB/s, 0 cycles/byte
AES-XTS-256 : 2446391 KiB/s, 0 cycles/byte
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (3 preceding siblings ...)
2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
@ 2023-08-08 12:40 ` dave.rodgman at arm dot com
2023-08-08 13:27 ` dave.rodgman at arm dot com
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 12:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #5 from Dave Rodgman <dave.rodgman at arm dot com> ---
(In reply to Richard Biener from comment #3)
> Note you shouldn't use -Os if you care about performance. GCC is quite
> reasonable with code size increases at -O2 (as compared to other compilers).
> Instead I suggest you use -flto with -O2 to decrease the size of the final
> executable/library and give GCC better knowledge on unit growth.
Understood, but I think it depends on the magnitude of the perf difference. I'd
expect a smallish perf drop, say 10%, from -Os to be reasonable, but I'd
consider a 3x perf difference to be a compiler issue.(In reply to Alexander
Monakov from comment #2)
> So basically missed inlining at -Os, even memcpy wrappers are not inlined.
>
> Can you provide a reproducible testcase?
>
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)
>
> Does making them 'static inline' rectify the problem?
The easiest way to reproduce is to use the benchmark tool:
make programs/test/benchmark CC=gcc CFLAGS="-Os"
programs/test/benchmark aes_xts
I don't have a compact reproducer, sorry.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (4 preceding siblings ...)
2023-08-08 12:40 ` dave.rodgman at arm dot com
@ 2023-08-08 13:27 ` dave.rodgman at arm dot com
2023-08-08 15:57 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 13:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #6 from Dave Rodgman <dave.rodgman at arm dot com> ---
Under clang, we see that mbedtls_xor being inlined, or not, causes an
equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2
version and not in the gcc Os version.
Not inline mbedtls_xor, -Os clang:
AES-XTS-128 : 834549 KiB/s, 0 cycles/byte
AES-XTS-256 : 674383 KiB/s, 0 cycles/byte
Inline mbedtls_xor, -Os clang:
AES-XTS-128 : 2664799 KiB/s, 0 cycles/byte
AES-XTS-256 : 2278008 KiB/s, 0 cycles/byte
However, if I mark mbedtls_xor as static inline (actually, for testing
purposes, I created a static inline copy in aes.c), gcc still does not inline
it. I am not sure why. If I use "__attribute__((always_inline))" gcc will
inline it.
So it looks like gcc is overly averse to inlining this function, or is getting
the cost/benefit of inline-ing wrong here?
For 3/5 cases, we know at compile time that n == 16, so the function will
compile to four instructions:
139c: 3dc00021 ldr q1, [x1]
13a0: 3dc00040 ldr q0, [x2]
13a4: 6e211c00 eor v0.16b, v0.16b, v1.16b
13a8: 3d800000 str q0, [x0]
so it does seem surprising that gcc doesn't want to inline this.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug other/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (5 preceding siblings ...)
2023-08-08 13:27 ` dave.rodgman at arm dot com
@ 2023-08-08 15:57 ` pinskia at gcc dot gnu.org
2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-08 15:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |92716
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am 99% sure this is basically PR 92716.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716
[Bug 92716] -Os doesn't inline byteswap function even though it's a single
instruction
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (6 preceding siblings ...)
2023-08-08 15:57 ` pinskia at gcc dot gnu.org
@ 2023-08-08 16:48 ` amonakov at gcc dot gnu.org
2023-08-08 17:38 ` amonakov at gcc dot gnu.org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 16:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Why? There's no bswap here, in particular mbedtls_put_unaligned_uint64 is a
straightforward wrapper for memcpy:
inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
{
memcpy(p, &x, sizeof(x));
}
We deciding to not inline this, while inlining its get_unaligned counterpart?
Seems bizarre.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (7 preceding siblings ...)
2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
@ 2023-08-08 17:38 ` amonakov at gcc dot gnu.org
2023-08-08 21:20 ` amonakov at gcc dot gnu.org
2023-08-08 21:47 ` amonakov at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 17:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #2)
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)
Can you address this on the mbedtls side? Even if it doesn't help with the
observed slowdown, it will remain a problem for the future if left unfixed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (8 preceding siblings ...)
2023-08-08 17:38 ` amonakov at gcc dot gnu.org
@ 2023-08-08 21:20 ` amonakov at gcc dot gnu.org
2023-08-08 21:47 ` amonakov at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 21:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #10 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Ah, the non-static inlines are intentional, the corresponding extern
declarations appear in library/platform_util.c. Sorry, I missed that file the
first time around.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
` (9 preceding siblings ...)
2023-08-08 21:20 ` amonakov at gcc dot gnu.org
@ 2023-08-08 21:47 ` amonakov at gcc dot gnu.org
10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 21:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #11 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #8)
> inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
> {
> memcpy(p, &x, sizeof(x));
> }
>
>
> We deciding to not inline this, while inlining its get_unaligned
> counterpart? Seems bizarre.
I can reproduce this part, and on my side it's caused by _FORTIFY_SOURCE: with
fortification, put_unaligned indeed looks bigger during inlining:
mbedtls_put_unaligned_uint32 (void * p, uint32_t x)
{
long unsigned int _3;
<bb 2> [local count: 1073741824]:
_3 = __builtin_object_size (p_2(D), 0);
__builtin___memcpy_chk (p_2(D), &x, 4, _3);
return;
}
mbedtls_get_unaligned_uint64 (const void * p)
{
long unsigned int _3;
<bb 2> [local count: 1073741824]:
_3 = MEM <long unsigned int> [(char * {ref-all})p_2(D)];
return _3;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-08-08 21:47 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
2023-08-08 12:22 ` amonakov at gcc dot gnu.org
2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
2023-08-08 12:40 ` dave.rodgman at arm dot com
2023-08-08 13:27 ` dave.rodgman at arm dot com
2023-08-08 15:57 ` pinskia at gcc dot gnu.org
2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
2023-08-08 17:38 ` amonakov at gcc dot gnu.org
2023-08-08 21:20 ` amonakov at gcc dot gnu.org
2023-08-08 21:47 ` amonakov at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).