public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/110946] New: 3x perf regression with -Os on M1 Pro
@ 2023-08-08 11:25 dave.rodgman at arm dot com
  2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 11:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

            Bug ID: 110946
           Summary: 3x perf regression with -Os on M1 Pro
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dave.rodgman at arm dot com
  Target Milestone: ---

Please see
https://github.com/Mbed-TLS/mbedtls/pull/7784/commits/6cfd9b54ae0d06451c1a46a10e57fa099878bb03
for details.

On M1 Pro, under -Os, we see a 3.1x performance regression for AES-XTS, which
can be solved by forcing -O2 for two functions. For comparison, clang -Os gives
around 5% perf regression (which is more in the ballpark that I'd expect). So
it looks to me like gcc is getting something wrong when compiling these two
functions with -Os.

We measured a smaller but still significant difference (20-25%) on x86-64.

Affects all versions of gcc that I was able to test with (9 .. 12).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
@ 2023-08-08 11:37 ` dave.rodgman at arm dot com
  2023-08-08 12:22 ` amonakov at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 11:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #1 from Dave Rodgman <dave.rodgman at arm dot com> ---
Disassembly under -Os:

000000000000139c <mbedtls_aes_crypt_xts>:
    139c:       a9b67bfd        stp     x29, x30, [sp, #-160]!
    13a0:       910003fd        mov     x29, sp
    13a4:       a9046bf9        stp     x25, x26, [sp, #64]
    13a8:       aa0003f9        mov     x25, x0
    13ac:       90000000        adrp    x0, 0 <__stack_chk_guard>
    13b0:       a90153f3        stp     x19, x20, [sp, #16]
    13b4:       f9400000        ldr     x0, [x0]
    13b8:       a9025bf5        stp     x21, x22, [sp, #32]
    13bc:       2a0103f6        mov     w22, w1
    13c0:       a90363f7        stp     x23, x24, [sp, #48]
    13c4:       a90573fb        stp     x27, x28, [sp, #80]
    13c8:       f9400001        ldr     x1, [x0]
    13cc:       f9004fe1        str     x1, [sp, #152]
    13d0:       d2800001        mov     x1, #0x0                        // #0
    13d4:       710006df        cmp     w22, #0x1
    13d8:       54000c28        b.hi    155c <mbedtls_aes_crypt_xts+0x1c0>  //
b.pmore
    13dc:       d1004041        sub     x1, x2, #0x10
    13e0:       aa0203f3        mov     x19, x2
    13e4:       b27c4fe0        mov     x0, #0xfffff0                   //
#16777200
    13e8:       eb00003f        cmp     x1, x0
    13ec:       54000bc8        b.hi    1564 <mbedtls_aes_crypt_xts+0x1c8>  //
b.pmore
    13f0:       9101a3f5        add     x21, sp, #0x68
    13f4:       aa0303e2        mov     x2, x3
    13f8:       aa0403f8        mov     x24, x4
    13fc:       aa0503f7        mov     x23, x5
    1400:       aa1503e3        mov     x3, x21
    1404:       91048320        add     x0, x25, #0x120
    1408:       52800021        mov     w1, #0x1                        // #1
    140c:       94000000        bl      1210 <mbedtls_aes_crypt_ecb>
    1410:       2a0003f4        mov     w20, w0
    1414:       35000540        cbnz    w0, 14bc <mbedtls_aes_crypt_xts+0x120>
    1418:       520002db        eor     w27, w22, #0x1
    141c:       d344fe7a        lsr     x26, x19, #4
    1420:       1200037b        and     w27, w27, #0x1
    1424:       92400e73        and     x19, x19, #0xf
    1428:       910223fc        add     x28, sp, #0x88
    142c:       d100075a        sub     x26, x26, #0x1
    1430:       b100075f        cmn     x26, #0x1
    1434:       54000541        b.ne    14dc <mbedtls_aes_crypt_xts+0x140>  //
b.any
    1438:       b4000433        cbz     x19, 14bc <mbedtls_aes_crypt_xts+0x120>
    143c:       710002df        cmp     w22, #0x0
    1440:       d10042fb        sub     x27, x23, #0x10
    1444:       9101e3fa        add     x26, sp, #0x78
    1448:       aa1303e2        mov     x2, x19
    144c:       9a95035a        csel    x26, x26, x21, eq  // eq = none
    1450:       aa1b03e1        mov     x1, x27
    1454:       910223f5        add     x21, sp, #0x88
    1458:       aa1703e0        mov     x0, x23
    145c:       94000000        bl      0 <memmove>
    1460:       d2800217        mov     x23, #0x10                      // #16
    1464:       aa1303e3        mov     x3, x19
    1468:       aa1a03e2        mov     x2, x26
    146c:       aa1803e1        mov     x1, x24
    1470:       aa1503e0        mov     x0, x21
    1474:       94000000        bl      0 <mbedtls_xor>
    1478:       cb1302e3        sub     x3, x23, x19
    147c:       8b130342        add     x2, x26, x19
    1480:       8b130361        add     x1, x27, x19
    1484:       8b1302a0        add     x0, x21, x19
    1488:       94000000        bl      0 <mbedtls_xor>
    148c:       aa1503e3        mov     x3, x21
    1490:       aa1503e2        mov     x2, x21
    1494:       2a1603e1        mov     w1, w22
    1498:       aa1903e0        mov     x0, x25
    149c:       94000000        bl      1210 <mbedtls_aes_crypt_ecb>
    14a0:       2a0003f4        mov     w20, w0
    14a4:       350000c0        cbnz    w0, 14bc <mbedtls_aes_crypt_xts+0x120>
    14a8:       aa1703e3        mov     x3, x23
    14ac:       aa1a03e2        mov     x2, x26
    14b0:       aa1503e1        mov     x1, x21
    14b4:       aa1b03e0        mov     x0, x27
    14b8:       94000000        bl      0 <mbedtls_xor>
    14bc:       90000000        adrp    x0, 0 <__stack_chk_guard>
    14c0:       f9400000        ldr     x0, [x0]
    14c4:       f9404fe2        ldr     x2, [sp, #152]
    14c8:       f9400001        ldr     x1, [x0]
    14cc:       eb010042        subs    x2, x2, x1
    14d0:       d2800001        mov     x1, #0x0                        // #0
    14d4:       54000500        b.eq    1574 <mbedtls_aes_crypt_xts+0x1d8>  //
b.none
    14d8:       94000000        bl      0 <__stack_chk_fail>
    14dc:       f100027f        cmp     x19, #0x0
    14e0:       1a9f07e0        cset    w0, ne  // ne = any
    14e4:       6a1b001f        tst     w0, w27
    14e8:       540000e0        b.eq    1504 <mbedtls_aes_crypt_xts+0x168>  //
b.none
    14ec:       b50000da        cbnz    x26, 1504 <mbedtls_aes_crypt_xts+0x168>
    14f0:       a94687e0        ldp     x0, x1, [sp, #104]
    14f4:       a90787e0        stp     x0, x1, [sp, #120]
    14f8:       aa1503e1        mov     x1, x21
    14fc:       aa1503e0        mov     x0, x21
    1500:       97fffb63        bl      28c <mbedtls_gf128mul_x_ble>
    1504:       aa1503e2        mov     x2, x21
    1508:       aa1803e1        mov     x1, x24
    150c:       aa1c03e0        mov     x0, x28
    1510:       d2800203        mov     x3, #0x10                       // #16
    1514:       94000000        bl      0 <mbedtls_xor>
    1518:       aa1c03e3        mov     x3, x28
    151c:       aa1c03e2        mov     x2, x28
    1520:       2a1603e1        mov     w1, w22
    1524:       aa1903e0        mov     x0, x25
    1528:       94000000        bl      1210 <mbedtls_aes_crypt_ecb>
    152c:       35000200        cbnz    w0, 156c <mbedtls_aes_crypt_xts+0x1d0>
    1530:       aa1503e2        mov     x2, x21
    1534:       d2800203        mov     x3, #0x10                       // #16
    1538:       aa1703e0        mov     x0, x23
    153c:       aa1c03e1        mov     x1, x28
    1540:       94000000        bl      0 <mbedtls_xor>
    1544:       910042f7        add     x23, x23, #0x10
    1548:       aa1503e1        mov     x1, x21
    154c:       aa1503e0        mov     x0, x21
    1550:       91004318        add     x24, x24, #0x10
    1554:       97fffb4e        bl      28c <mbedtls_gf128mul_x_ble>
    1558:       17ffffb5        b       142c <mbedtls_aes_crypt_xts+0x90>
    155c:       12800414        mov     w20, #0xffffffdf                // #-33
    1560:       17ffffd7        b       14bc <mbedtls_aes_crypt_xts+0x120>
    1564:       12800434        mov     w20, #0xffffffde                // #-34
    1568:       17ffffd5        b       14bc <mbedtls_aes_crypt_xts+0x120>
    156c:       2a0003f4        mov     w20, w0
    1570:       17ffffd3        b       14bc <mbedtls_aes_crypt_xts+0x120>
    1574:       2a1403e0        mov     w0, w20
    1578:       a94153f3        ldp     x19, x20, [sp, #16]
    157c:       a9425bf5        ldp     x21, x22, [sp, #32]
    1580:       a94363f7        ldp     x23, x24, [sp, #48]
    1584:       a9446bf9        ldp     x25, x26, [sp, #64]
    1588:       a94573fb        ldp     x27, x28, [sp, #80]
    158c:       a8ca7bfd        ldp     x29, x30, [sp], #160
    1590:       d65f03c0        ret

    Disassembly for mbedtls_gf128mul_x_ble:
000000000000028c <mbedtls_gf128mul_x_ble>:
     28c:       a9be7bfd        stp     x29, x30, [sp, #-32]!
     290:       910003fd        mov     x29, sp
     294:       a90153f3        stp     x19, x20, [sp, #16]
     298:       aa0003f3        mov     x19, x0
     29c:       a9400823        ldp     x3, x2, [x1]
     2a0:       52800101        mov     w1, #0x8                        // #8
     2a4:       93c3fc54        extr    x20, x2, x3, #63
     2a8:       d37ffc42        lsr     x2, x2, #63
     2ac:       4b020c22        sub     w2, w1, w2, lsl #3
     2b0:       528010e1        mov     w1, #0x87                       // #135
     2b4:       1ac22821        asr     w1, w1, w2
     2b8:       93407c21        sxtw    x1, w1
     2bc:       ca030421        eor     x1, x1, x3, lsl #1
     2c0:       94000000        bl      0 <mbedtls_put_unaligned_uint64>
     2c4:       aa1403e1        mov     x1, x20
     2c8:       91002260        add     x0, x19, #0x8
     2cc:       a94153f3        ldp     x19, x20, [sp, #16]
     2d0:       a8c27bfd        ldp     x29, x30, [sp], #32
     2d4:       14000000        b       0 <mbedtls_put_unaligned_uint64>


and under -O2:

    Disassembly for mbedtls_aes_crypt_xts:
0000000000001500 <mbedtls_aes_crypt_xts>:
    1500:       a9b57bfd        stp     x29, x30, [sp, #-176]!
    1504:       90000006        adrp    x6, 0 <__stack_chk_guard>
    1508:       910003fd        mov     x29, sp
    150c:       f94000c6        ldr     x6, [x6]
    1510:       a90153f3        stp     x19, x20, [sp, #16]
    1514:       2a0103f4        mov     w20, w1
    1518:       f94000c1        ldr     x1, [x6]
    151c:       f90057e1        str     x1, [sp, #168]
    1520:       d2800001        mov     x1, #0x0                        // #0
    1524:       7100069f        cmp     w20, #0x1
    1528:       54001348        b.hi    1790 <mbedtls_aes_crypt_xts+0x290>  //
b.pmore
    152c:       d1004041        sub     x1, x2, #0x10
    1530:       a9025bf5        stp     x21, x22, [sp, #32]
    1534:       aa0003f5        mov     x21, x0
    1538:       aa0203f6        mov     x22, x2
    153c:       b27c4fe0        mov     x0, #0xfffff0                   //
#16777200
    1540:       eb00003f        cmp     x1, x0
    1544:       54001208        b.hi    1784 <mbedtls_aes_crypt_xts+0x284>  //
b.pmore
    1548:       aa0503f3        mov     x19, x5
    154c:       a90363f7        stp     x23, x24, [sp, #48]
    1550:       aa0303f7        mov     x23, x3
    1554:       a90573fb        stp     x27, x28, [sp, #80]
    1558:       aa0403fb        mov     x27, x4
    155c:       94000000        bl      0 <mbedtls_aesce_has_support>
    1560:       910482a3        add     x3, x21, #0x120
    1564:       9101e3fc        add     x28, sp, #0x78
    1568:       35000ec0        cbnz    w0, 1740 <mbedtls_aes_crypt_xts+0x240>
    156c:       aa1703e1        mov     x1, x23
    1570:       aa0303e0        mov     x0, x3
    1574:       aa1c03e2        mov     x2, x28
    1578:       94000000        bl      9c0 <mbedtls_internal_aes_encrypt>
    157c:       35000760        cbnz    w0, 1668 <mbedtls_aes_crypt_xts+0x168>
    1580:       f2400ec0        ands    x0, x22, #0xf
    1584:       d344fec4        lsr     x4, x22, #4
    1588:       52000298        eor     w24, w20, #0x1
    158c:       f90037e0        str     x0, [sp, #104]
    1590:       1a9f07e0        cset    w0, ne  // ne = any
    1594:       52800117        mov     w23, #0x8                       // #8
    1598:       0a000318        and     w24, w24, w0
    159c:       528010f6        mov     w22, #0x87                      // #135
    15a0:       a9046bf9        stp     x25, x26, [sp, #64]
    15a4:       d1000499        sub     x25, x4, #0x1
    15a8:       910263fa        add     x26, sp, #0x98
    15ac:       14000014        b       15fc <mbedtls_aes_crypt_xts+0xfc>
    15b0:       3dc00341        ldr     q1, [x26]
    15b4:       aa1303e3        mov     x3, x19
    15b8:       3dc00380        ldr     q0, [x28]
    15bc:       d1000739        sub     x25, x25, #0x1
    15c0:       91004366        add     x6, x27, #0x10
    15c4:       6e211c00        eor     v0.16b, v0.16b, v1.16b
    15c8:       3c810460        str     q0, [x3], #16
    15cc:       a94797e1        ldp     x1, x5, [sp, #120]
    15d0:       d37ffca2        lsr     x2, x5, #63
    15d4:       93c1fca5        extr    x5, x5, x1, #63
    15d8:       4b020ee2        sub     w2, w23, w2, lsl #3
    15dc:       1ac22ac2        asr     w2, w22, w2
    15e0:       93407c42        sxtw    x2, w2
    15e4:       ca010441        eor     x1, x2, x1, lsl #1
    15e8:       a90797e1        stp     x1, x5, [sp, #120]
    15ec:       b100073f        cmn     x25, #0x1
    15f0:       54000440        b.eq    1678 <mbedtls_aes_crypt_xts+0x178>  //
b.none
    15f4:       aa0303f3        mov     x19, x3
    15f8:       aa0603fb        mov     x27, x6
    15fc:       f100033f        cmp     x25, #0x0
    1600:       7a400b04        ccmp    w24, #0x0, #0x4, eq  // eq = none
    1604:       54000aa1        b.ne    1758 <mbedtls_aes_crypt_xts+0x258>  //
b.any
    1608:       3dc00360        ldr     q0, [x27]
    160c:       aa1a03e3        mov     x3, x26
    1610:       3dc00381        ldr     q1, [x28]
    1614:       aa1a03e2        mov     x2, x26
    1618:       2a1403e1        mov     w1, w20
    161c:       aa1503e0        mov     x0, x21
    1620:       6e211c00        eor     v0.16b, v0.16b, v1.16b
    1624:       3d800340        str     q0, [x26]
    1628:       94000000        bl      12d0 <mbedtls_aes_crypt_ecb>
    162c:       34fffc20        cbz     w0, 15b0 <mbedtls_aes_crypt_xts+0xb0>
    1630:       a9425bf5        ldp     x21, x22, [sp, #32]
    1634:       a94363f7        ldp     x23, x24, [sp, #48]
    1638:       a9446bf9        ldp     x25, x26, [sp, #64]
    163c:       a94573fb        ldp     x27, x28, [sp, #80]
    1640:       90000001        adrp    x1, 0 <__stack_chk_guard>
    1644:       f9400021        ldr     x1, [x1]
    1648:       f94057e3        ldr     x3, [sp, #168]
    164c:       f9400022        ldr     x2, [x1]
    1650:       eb020063        subs    x3, x3, x2
    1654:       d2800002        mov     x2, #0x0                        // #0
    1658:       54000a01        b.ne    1798 <mbedtls_aes_crypt_xts+0x298>  //
b.any
    165c:       a94153f3        ldp     x19, x20, [sp, #16]
    1660:       a8cb7bfd        ldp     x29, x30, [sp], #176
    1664:       d65f03c0        ret
    1668:       a9425bf5        ldp     x21, x22, [sp, #32]
    166c:       a94363f7        ldp     x23, x24, [sp, #48]
    1670:       a94573fb        ldp     x27, x28, [sp, #80]
    1674:       17fffff3        b       1640 <mbedtls_aes_crypt_xts+0x140>
    1678:       f94037e2        ldr     x2, [sp, #104]
    167c:       b4fffda2        cbz     x2, 1630 <mbedtls_aes_crypt_xts+0x130>
    1680:       7100029f        cmp     w20, #0x0
    1684:       910223f6        add     x22, sp, #0x88
    1688:       9a9c02d6        csel    x22, x22, x28, eq  // eq = none
    168c:       aa0303e0        mov     x0, x3
    1690:       aa1303e1        mov     x1, x19
    1694:       91003f7b        add     x27, x27, #0xf
    1698:       94000000        bl      0 <memmove>
    169c:       d10006c5        sub     x5, x22, #0x1
    16a0:       d2800020        mov     x0, #0x1                        // #1
    16a4:       d503201f        nop
    16a8:       38606b62        ldrb    w2, [x27, x0]
    16ac:       8b000343        add     x3, x26, x0
    16b0:       386068a4        ldrb    w4, [x5, x0]
    16b4:       aa0003e1        mov     x1, x0
    16b8:       91000400        add     x0, x0, #0x1
    16bc:       4a040042        eor     w2, w2, w4
    16c0:       381ff062        sturb   w2, [x3, #-1]
    16c4:       f94037e2        ldr     x2, [sp, #104]
    16c8:       eb02003f        cmp     x1, x2
    16cc:       54fffee1        b.ne    16a8 <mbedtls_aes_crypt_xts+0x1a8>  //
b.any
    16d0:       d2800203        mov     x3, #0x10                       // #16
    16d4:       8b020265        add     x5, x19, x2
    16d8:       cb020063        sub     x3, x3, x2
    16dc:       8b0202c4        add     x4, x22, x2
    16e0:       8b020359        add     x25, x26, x2
    16e4:       d2800000        mov     x0, #0x0                        // #0
    16e8:       386068a1        ldrb    w1, [x5, x0]
    16ec:       38606882        ldrb    w2, [x4, x0]
    16f0:       4a020021        eor     w1, w1, w2
    16f4:       38206b21        strb    w1, [x25, x0]
    16f8:       91000400        add     x0, x0, #0x1
    16fc:       eb00007f        cmp     x3, x0
    1700:       54ffff41        b.ne    16e8 <mbedtls_aes_crypt_xts+0x1e8>  //
b.any
    1704:       2a1403e1        mov     w1, w20
    1708:       aa1503e0        mov     x0, x21
    170c:       aa1a03e3        mov     x3, x26
    1710:       aa1a03e2        mov     x2, x26
    1714:       94000000        bl      12d0 <mbedtls_aes_crypt_ecb>
    1718:       35fff8c0        cbnz    w0, 1630 <mbedtls_aes_crypt_xts+0x130>
    171c:       3dc002c0        ldr     q0, [x22]
    1720:       3dc00341        ldr     q1, [x26]
    1724:       6e211c00        eor     v0.16b, v0.16b, v1.16b
    1728:       3d800260        str     q0, [x19]
    172c:       a9425bf5        ldp     x21, x22, [sp, #32]
    1730:       a94363f7        ldp     x23, x24, [sp, #48]
    1734:       a9446bf9        ldp     x25, x26, [sp, #64]
    1738:       a94573fb        ldp     x27, x28, [sp, #80]
    173c:       17ffffc1        b       1640 <mbedtls_aes_crypt_xts+0x140>
    1740:       aa1703e2        mov     x2, x23
    1744:       aa0303e0        mov     x0, x3
    1748:       52800021        mov     w1, #0x1                        // #1
    174c:       aa1c03e3        mov     x3, x28
    1750:       94000000        bl      0 <mbedtls_aesce_crypt_ecb>
    1754:       17ffff8a        b       157c <mbedtls_aes_crypt_xts+0x7c>
    1758:       a94797e1        ldp     x1, x5, [sp, #120]
    175c:       a9478fe2        ldp     x2, x3, [sp, #120]
    1760:       a9088fe2        stp     x2, x3, [sp, #136]
    1764:       d37ffca0        lsr     x0, x5, #63
    1768:       93c1fca5        extr    x5, x5, x1, #63
    176c:       4b000ee0        sub     w0, w23, w0, lsl #3
    1770:       1ac02ac0        asr     w0, w22, w0
    1774:       93407c00        sxtw    x0, w0
    1778:       ca010401        eor     x1, x0, x1, lsl #1
    177c:       a90797e1        stp     x1, x5, [sp, #120]
    1780:       17ffffa2        b       1608 <mbedtls_aes_crypt_xts+0x108>
    1784:       12800420        mov     w0, #0xffffffde                 // #-34
    1788:       a9425bf5        ldp     x21, x22, [sp, #32]
    178c:       17ffffad        b       1640 <mbedtls_aes_crypt_xts+0x140>
    1790:       12800400        mov     w0, #0xffffffdf                 // #-33
    1794:       17ffffab        b       1640 <mbedtls_aes_crypt_xts+0x140>
    1798:       a9025bf5        stp     x21, x22, [sp, #32]
    179c:       a90363f7        stp     x23, x24, [sp, #48]
    17a0:       a9046bf9        stp     x25, x26, [sp, #64]
    17a4:       a90573fb        stp     x27, x28, [sp, #80]
    17a8:       94000000        bl      0 <__stack_chk_fail>

    Disassembly for mbedtls_gf128mul_x_ble: (actually, this gets inlined, but I
removed the "static inline" to get this disassembly)
0000000000001500 <mbedtls_gf128mul_x_ble>:
    1500:       a9401023        ldp     x3, x4, [x1]
    1504:       52800105        mov     w5, #0x8                        // #8
    1508:       528010e2        mov     w2, #0x87                       // #135
    150c:       d37ffc81        lsr     x1, x4, #63
    1510:       93c3fc84        extr    x4, x4, x3, #63
    1514:       4b010ca1        sub     w1, w5, w1, lsl #3
    1518:       1ac12841        asr     w1, w2, w1
    151c:       93407c21        sxtw    x1, w1
    1520:       ca030423        eor     x3, x1, x3, lsl #1
    1524:       a9001003        stp     x3, x4, [x0]
    1528:       d65f03c0        ret

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
  2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
@ 2023-08-08 12:22 ` amonakov at gcc dot gnu.org
  2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 12:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
So basically missed inlining at -Os, even memcpy wrappers are not inlined.

Can you provide a reproducible testcase?

Note that inline functions in mbedtls/library/alignment.h all miss the 'static'
qualifier, which affects inlining decisions, and looks like a mistake anyway
(if they are really meant to be non-static inlines, shouldn't there be a
comment?)

Does making them 'static inline' rectify the problem?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
  2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
  2023-08-08 12:22 ` amonakov at gcc dot gnu.org
@ 2023-08-08 12:32 ` rguenth at gcc dot gnu.org
  2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-08 12:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|other                       |ipa
             Target|                            |aarch64
           Keywords|                            |missed-optimization
                 CC|                            |marxin at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note you shouldn't use -Os if you care about performance.  GCC is quite
reasonable with code size increases at -O2 (as compared to other compilers). 
Instead I suggest you use -flto with -O2 to decrease the size of the final
executable/library and give GCC better knowledge on unit growth.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (2 preceding siblings ...)
  2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
@ 2023-08-08 12:35 ` dave.rodgman at arm dot com
  2023-08-08 12:40 ` dave.rodgman at arm dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Dave Rodgman <dave.rodgman at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|missed-optimization         |
          Component|ipa                         |other
             Target|aarch64                     |

--- Comment #4 from Dave Rodgman <dave.rodgman at arm dot com> ---
From a quick test, it doesn't look like the unaligned access inlining is the
issue:

Not static inline, -Os:
  AES-XTS-128              :     853799 KiB/s,          0 cycles/byte
  AES-XTS-256              :     749919 KiB/s,          0 cycles/byte

Static inline, -Os:

  AES-XTS-128              :     885380 KiB/s,          0 cycles/byte
  AES-XTS-256              :     752995 KiB/s,          0 cycles/byte

Not static inline, -O2:
  AES-XTS-128              :    2822656 KiB/s,          0 cycles/byte
  AES-XTS-256              :    2425721 KiB/s,          0 cycles/byte

Static inline, -O2:
  AES-XTS-128              :    2692321 KiB/s,          0 cycles/byte
  AES-XTS-256              :    2446391 KiB/s,          0 cycles/byte

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (3 preceding siblings ...)
  2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
@ 2023-08-08 12:40 ` dave.rodgman at arm dot com
  2023-08-08 13:27 ` dave.rodgman at arm dot com
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 12:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #5 from Dave Rodgman <dave.rodgman at arm dot com> ---
(In reply to Richard Biener from comment #3)
> Note you shouldn't use -Os if you care about performance.  GCC is quite
> reasonable with code size increases at -O2 (as compared to other compilers).
> Instead I suggest you use -flto with -O2 to decrease the size of the final
> executable/library and give GCC better knowledge on unit growth.

Understood, but I think it depends on the magnitude of the perf difference. I'd
expect a smallish perf drop, say 10%, from -Os to be reasonable, but I'd
consider a 3x perf difference to be a compiler issue.(In reply to Alexander
Monakov from comment #2)
> So basically missed inlining at -Os, even memcpy wrappers are not inlined.
> 
> Can you provide a reproducible testcase?
> 
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)
> 
> Does making them 'static inline' rectify the problem?

The easiest way to reproduce is to use the benchmark tool:

make programs/test/benchmark CC=gcc CFLAGS="-Os"
programs/test/benchmark aes_xts

I don't have a compact reproducer, sorry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (4 preceding siblings ...)
  2023-08-08 12:40 ` dave.rodgman at arm dot com
@ 2023-08-08 13:27 ` dave.rodgman at arm dot com
  2023-08-08 15:57 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: dave.rodgman at arm dot com @ 2023-08-08 13:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #6 from Dave Rodgman <dave.rodgman at arm dot com> ---
Under clang, we see that mbedtls_xor being inlined, or not, causes an
equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2
version and not in the gcc Os version.

Not inline mbedtls_xor, -Os clang:
  AES-XTS-128              :     834549 KiB/s,          0 cycles/byte
  AES-XTS-256              :     674383 KiB/s,          0 cycles/byte

Inline mbedtls_xor, -Os clang:
  AES-XTS-128              :    2664799 KiB/s,          0 cycles/byte
  AES-XTS-256              :    2278008 KiB/s,          0 cycles/byte


However, if I mark mbedtls_xor as static inline (actually, for testing
purposes, I created a static inline copy in aes.c), gcc still does not inline
it. I am not sure why. If I use "__attribute__((always_inline))" gcc will
inline it.

So it looks like gcc is overly averse to inlining this function, or is getting
the cost/benefit of inline-ing wrong here?

For 3/5 cases, we know at compile time that n == 16, so the function will
compile to four instructions:

    139c:       3dc00021        ldr     q1, [x1]
    13a0:       3dc00040        ldr     q0, [x2]
    13a4:       6e211c00        eor     v0.16b, v0.16b, v1.16b
    13a8:       3d800000        str     q0, [x0]

so it does seem surprising that gcc doesn't want to inline this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug other/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (5 preceding siblings ...)
  2023-08-08 13:27 ` dave.rodgman at arm dot com
@ 2023-08-08 15:57 ` pinskia at gcc dot gnu.org
  2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-08 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |92716

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am 99% sure this is basically PR 92716.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716
[Bug 92716] -Os doesn't inline byteswap function even though it's a single
instruction

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (6 preceding siblings ...)
  2023-08-08 15:57 ` pinskia at gcc dot gnu.org
@ 2023-08-08 16:48 ` amonakov at gcc dot gnu.org
  2023-08-08 17:38 ` amonakov at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 16:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Why? There's no bswap here, in particular mbedtls_put_unaligned_uint64 is a
straightforward wrapper for memcpy:

inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
{
    memcpy(p, &x, sizeof(x));
}


We deciding to not inline this, while inlining its get_unaligned counterpart?
Seems bizarre.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (7 preceding siblings ...)
  2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
@ 2023-08-08 17:38 ` amonakov at gcc dot gnu.org
  2023-08-08 21:20 ` amonakov at gcc dot gnu.org
  2023-08-08 21:47 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 17:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #2)
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)

Can you address this on the mbedtls side? Even if it doesn't help with the
observed slowdown, it will remain a problem for the future if left unfixed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (8 preceding siblings ...)
  2023-08-08 17:38 ` amonakov at gcc dot gnu.org
@ 2023-08-08 21:20 ` amonakov at gcc dot gnu.org
  2023-08-08 21:47 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 21:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #10 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Ah, the non-static inlines are intentional, the corresponding extern
declarations appear in library/platform_util.c. Sorry, I missed that file the
first time around.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug ipa/110946] 3x perf regression with -Os on M1 Pro
  2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
                   ` (9 preceding siblings ...)
  2023-08-08 21:20 ` amonakov at gcc dot gnu.org
@ 2023-08-08 21:47 ` amonakov at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-08 21:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #11 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #8)
> inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
> {
>     memcpy(p, &x, sizeof(x));
> }
> 
> 
> We deciding to not inline this, while inlining its get_unaligned
> counterpart? Seems bizarre.

I can reproduce this part, and on my side it's caused by _FORTIFY_SOURCE: with
fortification, put_unaligned indeed looks bigger during inlining:

mbedtls_put_unaligned_uint32 (void * p, uint32_t x)
{
  long unsigned int _3;

  <bb 2> [local count: 1073741824]:
  _3 = __builtin_object_size (p_2(D), 0);
  __builtin___memcpy_chk (p_2(D), &x, 4, _3);
  return;

}

mbedtls_get_unaligned_uint64 (const void * p)
{
  long unsigned int _3;

  <bb 2> [local count: 1073741824]:
  _3 = MEM <long unsigned int> [(char * {ref-all})p_2(D)];
  return _3;

}

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-08-08 21:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-08 11:25 [Bug other/110946] New: 3x perf regression with -Os on M1 Pro dave.rodgman at arm dot com
2023-08-08 11:37 ` [Bug other/110946] " dave.rodgman at arm dot com
2023-08-08 12:22 ` amonakov at gcc dot gnu.org
2023-08-08 12:32 ` [Bug ipa/110946] " rguenth at gcc dot gnu.org
2023-08-08 12:35 ` [Bug other/110946] " dave.rodgman at arm dot com
2023-08-08 12:40 ` dave.rodgman at arm dot com
2023-08-08 13:27 ` dave.rodgman at arm dot com
2023-08-08 15:57 ` pinskia at gcc dot gnu.org
2023-08-08 16:48 ` [Bug ipa/110946] " amonakov at gcc dot gnu.org
2023-08-08 17:38 ` amonakov at gcc dot gnu.org
2023-08-08 21:20 ` amonakov at gcc dot gnu.org
2023-08-08 21:47 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).