From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10953 invoked by alias); 14 Oct 2013 20:01:51 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 10913 invoked by uid 48); 14 Oct 2013 20:01:48 -0000 From: "niels at penneman dot org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence Date: Mon, 14 Oct 2013 20:01:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 4.9.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: minor X-Bugzilla-Who: niels at penneman dot org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2013-10/txt/msg00828.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727 Bug ID: 58727 Summary: Sub-optimal code for bit clear/set sequence Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: niels at penneman dot org GCC generates sub-optimal but functionally correct code when successively clearing a bit and setting another bit in an integral value, as shown in the "clear_set" and "set_clear" functions below. To be more specific, at all O-levels except "-O1", it emits code to clear the bit being set, prior to setting it. When splitting the clear and set operations with exactly the same bits, and calling them one after the other, the problem does not occur. It looks like this behaviour is caused by an optimization pass prior to inlining. On ARM this may cause a superfluous instruction to be emitted. In most instructions, the 32-bit ARM ISA can only encode 8-bit immediate values. Most of those instructions provide 4 extra bits in their encoding to specify a rotation. This implies that when the bit being set ("SET" in the test case below) is close enough to the bit being cleared ("CLEAR"), the code generation phase merges the superflous clear for "SET" with the one for "CLEAR". When the bits are too far apart, however, an extra clear instruction is generated, even when optimizing with "-Os". It looks like the root cause is a problem in the optimizer and not the code generator, since the same behaviour can be observed on x86_64. However, in the latter case this "bug" does not alter the number of instructions required. My knowledge of the x86 architecture is limited so I have no clue whether it causes any inefficiencies at all (instruction width, perhaps?). The test case below contains six functions: - "clear_set" and "set_clear" suffer from the problem described above: the compiler usually emits a superfluous clear for the bit being set; - "clear" and "set" perform the individual operations; - "clear_set_inline" and "set_clear_inline" are used to demonstrate that the problem does not occur when splitting and inlining the operations. The tests were performed on a Gentoo x86_64 Linux system; GCC versions for both native and cross-compiler are given below. ## testcase.c ################################################################# enum masks { CLEAR = 0x400000, SET = 0x02 }; unsigned clear_set(unsigned a) { return (a & ~CLEAR) | SET; } unsigned set_clear(unsigned a) { return (a | SET) & ~CLEAR; } unsigned clear(unsigned a) { return a & ~CLEAR; } unsigned set(unsigned a) { return a | SET; } __attribute__((flatten)) unsigned clear_set_inline(unsigned a) { return set(clear(a)); } __attribute__((flatten)) unsigned set_clear_inline(unsigned a) { return clear(set(a)); } ## GCC version (ARM) ########################################################## $ arm-none-eabi-gcc -### Using built-in specs. COLLECT_GCC=/path/to/toolchain/bin/arm-none-eabi-gcc COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/arm-none-eabi/4.8.1/lto-wrapper Target: arm-none-eabi Configured with: /path/to/builddir/src/gcc-4.8.1/configure --build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu --target=arm-none-eabi --prefix=/path/to/toolchain --with-local-prefix=/path/to/toolchain/arm-none-eabi/sysroot --disable-libmudflap --with-sysroot=/path/to/toolchain/arm-none-eabi/sysroot --with-newlib --enable-threads=no --disable-shared --with-pkgversion='crosstool-NG 1.19.0' --with-float=hard --disable-__cxa_atexit --with-gmp=/path/to/builddir/arm-none-eabi/buildtools --with-mpfr=/path/to/builddir/arm-none-eabi/buildtools --with-mpc=/path/to/builddir/arm-none-eabi/buildtools --with-ppl=no --with-isl=no --with-cloog=no --with-libelf=no --disable-lto --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls --disable-multilib --enable-languages=c Thread model: single gcc version 4.8.1 (crosstool-NG 1.19.0) ## GCC version (x86_64) ####################################################### $ gcc -### Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-4.9.0_alpha20131013/work/gcc-4.9-20131013/configure --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --with-cloog --disable-isl-version-check --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --enable-multilib --with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/python --enable-checking=release --disable-libgcj --enable-libstdcxx-time --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.9.0_alpha20131013' Thread model: posix gcc version 4.9.0-alpha20131013 20131013 (experimental) (Gentoo 4.9.0_alpha20131013) ## Compiler invocation / steps to reproduce ################################### arm-none-eabi-gcc -O0 -S -fverbose-asm -o olevel0.s -c testcase.c arm-none-eabi-gcc -O1 -fno-aggressive-loop-optimizations -fno-auto-inc-dec -fno-branch-count-reg -fno-combine-stack-adjustments -fno-common -fno-compare-elim -fno-cprop-registers -fno-defer-pop -fno-delete-null-pointer-checks -fno-dwarf2-cfi-asm -fno-early-inlining -fno-eliminate-unused-debug-types -fno-forward-propagate -fno-function-cse -fno-gcse-lm -fno-guess-branch-probability -fno-ident -fno-if-conversion -fno-if-conversion2 -fno-inline-atomics -fno-ipa-profile -fno-ipa-pure-const -fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-share-save-slots -fno-ira-share-spill-slots -fno-ivopts -fno-keep-static-consts -fno-leading-underscore -fno-math-errno -fno-merge-constants -fno-merge-debug-strings -fno-move-loop-invariants -fno-peephole -fno-prefetch-loop-arrays -fno-reg-struct-return -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic -fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic -fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec -fno-sched-spec-insn-heuristic -fno-sched-stalled-insns-dep -fno-section-anchors -fno-show-column -fno-shrink-wrap -fno-signed-zeros -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-volatile-bitfields -fno-sync-libcalls -fno-toplevel-reorder -fno-trapping-math -fno-tree-bit-ccp -fno-tree-ccp -fno-tree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop -fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse -fno-tree-forwprop -fno-tree-fre -fno-tree-loop-if-convert -fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize -fno-tree-phiprop -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop -fno-tree-sink -fno-tree-slp-vectorize -fno-tree-slsr -fno-tree-sra -fno-tree-ter -fno-tree-vect-loop-version -fno-unit-at-a-time -fno-zero-initialized-in-bss -S -fverbose-asm -o olevel1-all-disabled.s -c testcase.c arm-none-eabi-gcc -O1 -S -fverbose-asm -o olevel1.s -c testcase.c arm-none-eabi-gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2.s -c testcase.c arm-none-eabi-gcc -Os -S -fverbose-asm -o olevels.s -c testcase.c gcc -O0 -S -fverbose-asm -o olevel0-x86.s -c testcase.c gcc -O1 -S -fverbose-asm -o olevel1-x86.s -c testcase.c gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2-x86.s -c testcase.c ## Observations & expected results ############################################ When adding "-Wall -Wextra" the compiler does not emit any warnings for any of the commands above. The above commands generate 5 assembly files with different optimization levels: 1) olevel0.s at -O0 2) olevel1-all-disabled.s at -O1 with -fno-* for every -f* listed in olevel1.s 3) olevel1.s at -O1 4) olevel2.s at -O1 with -fexpensive-optimizations because it is the only -f* option additionally enabled with -O2 that causes the superflous clear to be emitted 5) olevels.s at -Os The first interesting observation is that the superfluous clear is generated at -O0 ("olevel0.s"). The generated code for "clear_set" and "set_clear" is exactly the same, excluding comments/annotations (stripped for clarity): 1: str fp, [sp, #-4]! 2: add fp, sp, #0 3: sub sp, sp, #12 4: str r0, [fp, #-8] 5: ldr r3, [fp, #-8] 6: bic r3, r3, #4194304 7: bic r3, r3, #2 8: orr r3, r3, #2 9: mov r0, r3 10: sub sp, fp, #0 11: ldr fp, [sp], #4 12: bx lr Just for the record: bic = clear bits in the specified mask, 0x400000 = 4194304. If we take a look at lines 6-8, the following happens: 6: A &= ~CLEAR 7: A &= ~SET 8: A |= SET The clear instruction at line 7 is superfluous. Looking at the "*_inline" versions at "-O0" doesn't make sense since the compiler does not perform inlining at "-O0". At "-O1", the code does not contain any superfluous instructions ("cat olevel1.s | grep -vE '^\s*[@\.]' | sed -E 's/\s*@.*$//'"): clear_set: bic r0, r0, #4194304 orr r0, r0, #2 bx lr set_clear: bic r0, r0, #4194304 orr r0, r0, #2 bx lr clear: bic r0, r0, #4194304 bx lr set: orr r0, r0, #2 bx lr clear_set_inline: bic r0, r0, #4194304 orr r0, r0, #2 bx lr set_clear_inline: bic r0, r0, #4194304 orr r0, r0, #2 bx lr Note that "clear_set", "set_clear", "clear_set_inline" and "set_clear_inline" are all identical. I have been unable to find out which particular optimization prevents the superfluous instruction from being emitted. I have tried to reverse all "-f*" optimization parameters enabled by "-O1" as found in "olevel1.s", but as should be visible from comparing "olevel1-all-disabled.s" and "olevel1.s" there is no difference in "clear_set" and "set_clear" other than in comments/annotations. At "-O2" the superfluous clear is emitted once again. I have gone through the optimizations enabled by -O2 which are not enabled by -O1 by comparing the (verbose) assembly output. I have tracked down the cause to "-fexpensive-optimizations". Enabling this optimization flag in combination with -O1 results in the following output ("cat olevel2.s | grep -vE '^\s*[@\.]' | sed -E 's/\s*@.*$//'"): clear_set: bic r0, r0, #4194304 bic r0, r0, #2 orr r0, r0, #2 bx lr set_clear: bic r0, r0, #4194304 bic r0, r0, #2 orr r0, r0, #2 bx lr clear: bic r0, r0, #4194304 bx lr set: orr r0, r0, #2 bx lr clear_set_inline: bic r0, r0, #4194304 orr r0, r0, #2 bx lr set_clear_inline: bic r0, r0, #4194304 orr r0, r0, #2 bx lr Note that the "*_inline" functions do not contain the superfluous clear instruction. The "-Os" case generates exactly the same instructions. Since "-Os" also enables "-fexpensive-optimizations" as can be seen from "olevels.s", the cause of the extra instruction at "-Os" is most likely the same as for "-O2". The above examples all use the ARM 32-bit ISA. By providing the arguments "-march=armv7-m -mthumb" to the cross-compiler the exact same behaviour can be observed for Thumb. On x86_64, the following happens with "-O0" ("cat olevel0-x86.s | grep -vE '^\s*[@\.]' | sed -E 's/\s*#.*$//'") in both "clear_set" and "set_clear": 1: pushq %rbp 2: movq %rsp, %rbp 3: movl %edi, -4(%rbp) 4: movl -4(%rbp), %eax 5: andl $-4194307, %eax 6: orl $2, %eax 7: popq %rbp 8: ret At line 5, the constant -4194307 is 0xFFBFFFFD in hex, so ~(CLEAR | SET). Once again, the bit that is set at line 6 with the "orl" instruction is first cleared in the "andl" instruction. Surprisingly, at "-O1" on x86_64, the superfluous clear remains in place. In the inlined version the mask is correct, however ("cat olevel1-x86.s | grep -vE '^\s*[@\.]' | sed -E 's/\s*#.*$//'"): clear_set: movl %edi, %eax andl $-4194307, %eax orl $2, %eax ret set_clear: movl %edi, %eax andl $-4194307, %eax orl $2, %eax ret clear: movl %edi, %eax andl $-4194305, %eax ret set: movl %edi, %eax orl $2, %eax ret clear_set_inline: movl %edi, %eax andl $-4194305, %eax orl $2, %eax ret set_clear_inline: movl %edi, %eax andl $-4194305, %eax orl $2, %eax ret The same behaviour is observed with "-Os", "-O2" and "-O3". I guess something internal in GCC is causing the bit to be cleared before it is set even though there is absolutely no valid reason to do so from the perspective of the program. While on x86_64 it probably does not affect code performance, on ARM it adversely affects code size and perhaps performance (untested).