public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/38453]  New: Output code optimisation excessive use of builtins
@ 2008-12-09 14:51 vince at simtec dot co dot uk
  2008-12-09 14:52 ` [Bug c/38453] " vince at simtec dot co dot uk
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: vince at simtec dot co dot uk @ 2008-12-09 14:51 UTC (permalink / raw)
  To: gcc-bugs

While compiling compression code for LZMA for use with an embedded ARM target I
have discovered a regression from previous editions of GCC.

I have pared this down to a trivial example (attached) which boils down to a
application specific modulus operation (please note this is the *minimal* test
case and obviously is a bit more complex buried in the middle of the
compression system. The behavior exhibited remains the same in both the large
and small systems.

The simple test case is compiled with  
arm-unknown-linux-gnu-gcc -Os -o foo test.c

and the resulting objdump is:

000083fc <foo>:
    83fc:       e92d4010        push    {r4, lr}
    8400:       e5d11000        ldrb    r1, [r1]
    8404:       e1a04000        mov     r4, r0
    8408:       e1a02001        mov     r2, r1
    840c:       ea000002        b       841c <foo+0x20>
    8410:       e5943004        ldr     r3, [r4, #4]
    8414:       e2833001        add     r3, r3, #1      ; 0x1
    8418:       e5843004        str     r3, [r4, #4]
    841c:       e242302d        sub     r3, r2, #45     ; 0x2d
    8420:       e352002c        cmp     r2, #44 ; 0x2c
    8424:       e20320ff        and     r2, r3, #255    ; 0xff
    8428:       8afffff8        bhi     8410 <foo+0x14>
    842c:       e1a00001        mov     r0, r1
    8430:       e3a0102d        mov     r1, #45 ; 0x2d
    8434:       eb000003        bl      8448 <__umodsi3>
    8438:       e20000ff        and     r0, r0, #255    ; 0xff
    843c:       e5840000        str     r0, [r4]
    8440:       e8bd8010        pop     {r4, pc}

if a differing optimisation is used:

arm-unknown-linux-gnu-gcc -O2 -o foo test.c

000083fc <foo>:
    83fc:       e92d4070        push    {r4, r5, r6, lr}
    8400:       e5d14000        ldrb    r4, [r1]
    8404:       e354002c        cmp     r4, #44 ; 0x2c
    8408:       e1a06000        mov     r6, r0
    840c:       9a00000e        bls     844c <foo+0x50>
    8410:       e244402d        sub     r4, r4, #45     ; 0x2d
    8414:       e20440ff        and     r4, r4, #255    ; 0xff
    8418:       e5905004        ldr     r5, [r0, #4]
    841c:       e3a0102d        mov     r1, #45 ; 0x2d
    8420:       e1a00004        mov     r0, r4
    8424:       eb00004f        bl      8568 <__umodsi3>
    8428:       e3a0102d        mov     r1, #45 ; 0x2d
    842c:       e1a03000        mov     r3, r0
    8430:       e1a00004        mov     r0, r4
    8434:       e20340ff        and     r4, r3, #255    ; 0xff
    8438:       eb000006        bl      8458 <__aeabi_uidiv>
    843c:       e2855001        add     r5, r5, #1      ; 0x1
    8440:       e20000ff        and     r0, r0, #255    ; 0xff
    8444:       e0855000        add     r5, r5, r0
    8448:       e5865004        str     r5, [r6, #4]
    844c:       e5864000        str     r4, [r6]
    8450:       e8bd8070        pop     {r4, r5, r6, pc}

Actually several optimization levels were tried and all produced similar output

GCC 4.2.2 and 4.2.4 (which are our current compliers) 
arm-unknown-linux-gnueabi-gcc -Os -o foo test.c
produce:

00008328 <foo>:
    8328:       e5d12000        ldrb    r2, [r1]
    832c:       ea000003        b       8340 <foo+0x18>
    8330:       e5903004        ldr     r3, [r0, #4]
    8334:       e20120ff        and     r2, r1, #255    ; 0xff
    8338:       e2833001        add     r3, r3, #1      ; 0x1
    833c:       e5803004        str     r3, [r0, #4]
    8340:       e352002c        cmp     r2, #44 ; 0x2c
    8344:       e242102d        sub     r1, r2, #45     ; 0x2d
    8348:       8afffff8        bhi     8330 <foo+0x8>
    834c:       e5802000        str     r2, [r0]
    8350:       e12fff1e        bx      lr



As can be seen the trivial loop is performed and the divisor and remainder
found but then the __umodsi3 builtin is called to do the operation *again* and
that used to assign the result which is already available from the loop!

This odd behavior is seen in cross built (and native) GCC 4.3.2 but not in
4.2.4 it seems to be present in current development builds however I have
issues building those reliably so cannot give definite results.

The behavior is especially obvious with large performance and code size
degradation in compression code on small embedded system. Also the additional
need to link in the __umodsi3 implementation causes more space to be lost. 

This has also been observed in some circumstances within ARM kernels when using
modulous on powers of two! the obvious optimisation using shifts is performed
and then the value recomputed using __modsi3

Just for completeness here is the GCC 4.3.2 compiler used for the tests (the
4.3.4 produces identical compiled output but has other undesirable behaviors
not relevant to this report)

arm-unknown-linux-gnu-gcc -v
Using built-in specs.
Target: arm-unknown-linux-gnu
Configured with: /opt/simtec/crosstool-ng/targets/src/gcc-4.3.2/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=arm-unknown-linux-gnu --prefix=/opt/simtec/arm-unknown-linux-gnu
--with-sysroot=/opt/simtec/arm-unknown-linux-gnu/arm-unknown-linux-gnu/sys-root
--enable-languages=c,c++,fortran,java --disable-multilib --with-float=soft
--with-gmp=/opt/simtec/arm-unknown-linux-gnu
--with-mpfr=/opt/simtec/arm-unknown-linux-gnu
--with-pkgversion=crosstool-NG-1.3.0 --enable-__cxa_atexit
--with-local-prefix=/opt/simtec/arm-unknown-linux-gnu/arm-unknown-linux-gnu/sys-root
--disable-nls --enable-threads=posix --enable-symvers=gnu --enable-c99
--enable-long-long --enable-target-optspace
Thread model: posix
gcc version 4.3.2 (crosstool-NG-1.3.0)


-- 
           Summary: Output code optimisation excessive use of builtins
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: vince at simtec dot co dot uk
 GCC build triplet: x86_64-build_unknown-linux-gnu
  GCC host triplet: x86_64-build_unknown-linux-gnu
GCC target triplet: arm-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38453


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-12-10 11:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-09 14:51 [Bug c/38453] New: Output code optimisation excessive use of builtins vince at simtec dot co dot uk
2008-12-09 14:52 ` [Bug c/38453] " vince at simtec dot co dot uk
2008-12-10  0:27 ` pinskia at gcc dot gnu dot org
2008-12-10 10:56 ` steven at gcc dot gnu dot org
2008-12-10 11:20   ` Andrew Thomas Pinski
2008-12-10 11:21 ` pinskia at gmail dot com
2008-12-10 11:26 ` steven at gcc dot gnu dot org
2008-12-10 11:29 ` steven at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).