[Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence
@ 2013-10-14 20:01 niels at penneman dot org
  2013-10-14 20:18 ` [Bug rtl-optimization/58727] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: niels at penneman dot org @ 2013-10-14 20:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

            Bug ID: 58727
           Summary: Sub-optimal code for bit clear/set sequence
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: niels at penneman dot org

GCC generates sub-optimal but functionally correct code when successively
clearing a bit and setting another bit in an integral value, as shown in the
"clear_set" and "set_clear" functions below. To be more specific, at all
O-levels except "-O1", it emits code to clear the bit being set, prior to
setting it. When splitting the clear and set operations with exactly the same
bits, and calling them one after the other, the problem does not occur. It
looks like this behaviour is caused by an optimization pass prior to inlining.

On ARM this may cause a superfluous instruction to be emitted. In most
instructions, the 32-bit ARM ISA can only encode 8-bit immediate values. Most
of those instructions provide 4 extra bits in their encoding to specify a
rotation. This implies that when the bit being set ("SET" in the test case
below) is close enough to the bit being cleared ("CLEAR"), the code generation
phase merges the superflous clear for "SET" with the one for "CLEAR". When the
bits are too far apart, however, an extra clear instruction is generated, even
when optimizing with "-Os".

It looks like the root cause is a problem in the optimizer and not the code
generator, since the same behaviour can be observed on x86_64. However, in the
latter case this "bug" does not alter the number of instructions required. My
knowledge of the x86 architecture is limited so I have no clue whether it
causes any inefficiencies at all (instruction width, perhaps?).

The test case below contains six functions:
- "clear_set" and "set_clear" suffer from the problem described above: the
compiler usually emits a superfluous clear for the bit being set;
- "clear" and "set" perform the individual operations;
- "clear_set_inline" and "set_clear_inline" are used to demonstrate that the
problem does not occur when splitting and inlining the operations.

The tests were performed on a Gentoo x86_64 Linux system; GCC versions for both
native and cross-compiler are given below.

## testcase.c #################################################################

enum masks { CLEAR = 0x400000, SET = 0x02 };
unsigned clear_set(unsigned a) { return (a & ~CLEAR) | SET; }
unsigned set_clear(unsigned a) { return (a | SET) & ~CLEAR; }
unsigned clear(unsigned a) { return a & ~CLEAR; }
unsigned set(unsigned a) { return a | SET; }
__attribute__((flatten)) unsigned clear_set_inline(unsigned a) { return
set(clear(a)); }
__attribute__((flatten)) unsigned set_clear_inline(unsigned a) { return
clear(set(a)); }

## GCC version (ARM) ##########################################################

$ arm-none-eabi-gcc -###
Using built-in specs.
COLLECT_GCC=/path/to/toolchain/bin/arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/arm-none-eabi/4.8.1/lto-wrapper
Target: arm-none-eabi
Configured with: /path/to/builddir/src/gcc-4.8.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=arm-none-eabi --prefix=/path/to/toolchain
--with-local-prefix=/path/to/toolchain/arm-none-eabi/sysroot
--disable-libmudflap --with-sysroot=/path/to/toolchain/arm-none-eabi/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.19.0' --with-float=hard
--disable-__cxa_atexit --with-gmp=/path/to/builddir/arm-none-eabi/buildtools
--with-mpfr=/path/to/builddir/arm-none-eabi/buildtools
--with-mpc=/path/to/builddir/arm-none-eabi/buildtools --with-ppl=no
--with-isl=no --with-cloog=no --with-libelf=no --disable-lto
--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
--enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls
--disable-multilib --enable-languages=c
Thread model: single
gcc version 4.8.1 (crosstool-NG 1.19.0)

## GCC version (x86_64) #######################################################

$ gcc -###
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-4.9.0_alpha20131013/work/gcc-4.9-20131013/configure
--prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include/g++-v4
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
--disable-fixed-point --with-cloog --disable-isl-version-check --enable-lto
--enable-nls --without-included-gettext --with-system-zlib --enable-obsolete
--disable-werror --enable-secureplt --enable-multilib
--with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp
--enable-libgomp
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/python
--enable-checking=release --disable-libgcj --enable-libstdcxx-time
--enable-languages=c,c++,fortran --enable-shared --enable-threads=posix
--enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all
--with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo
4.9.0_alpha20131013'
Thread model: posix
gcc version 4.9.0-alpha20131013 20131013 (experimental) (Gentoo
4.9.0_alpha20131013)

## Compiler invocation / steps to reproduce ###################################

arm-none-eabi-gcc -O0 -S -fverbose-asm -o olevel0.s -c testcase.c
arm-none-eabi-gcc -O1 -fno-aggressive-loop-optimizations -fno-auto-inc-dec
-fno-branch-count-reg -fno-combine-stack-adjustments -fno-common
-fno-compare-elim -fno-cprop-registers -fno-defer-pop
-fno-delete-null-pointer-checks -fno-dwarf2-cfi-asm -fno-early-inlining
-fno-eliminate-unused-debug-types -fno-forward-propagate -fno-function-cse
-fno-gcse-lm -fno-guess-branch-probability -fno-ident -fno-if-conversion
-fno-if-conversion2 -fno-inline-atomics -fno-ipa-profile -fno-ipa-pure-const
-fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-share-save-slots
-fno-ira-share-spill-slots -fno-ivopts -fno-keep-static-consts
-fno-leading-underscore -fno-math-errno -fno-merge-constants
-fno-merge-debug-strings -fno-move-loop-invariants -fno-peephole
-fno-prefetch-loop-arrays -fno-reg-struct-return
-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
-fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic
-fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec
-fno-sched-spec-insn-heuristic -fno-sched-stalled-insns-dep
-fno-section-anchors -fno-show-column -fno-shrink-wrap -fno-signed-zeros
-fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-volatile-bitfields
-fno-sync-libcalls -fno-toplevel-reorder -fno-trapping-math -fno-tree-bit-ccp
-fno-tree-ccp -fno-tree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop
-fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts
-fno-tree-dse -fno-tree-forwprop -fno-tree-fre -fno-tree-loop-if-convert
-fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize
-fno-tree-phiprop -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop
-fno-tree-sink -fno-tree-slp-vectorize -fno-tree-slsr -fno-tree-sra
-fno-tree-ter -fno-tree-vect-loop-version -fno-unit-at-a-time
-fno-zero-initialized-in-bss -S -fverbose-asm -o olevel1-all-disabled.s -c
testcase.c
arm-none-eabi-gcc -O1 -S -fverbose-asm -o olevel1.s -c testcase.c
arm-none-eabi-gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2.s
-c testcase.c
arm-none-eabi-gcc -Os -S -fverbose-asm -o olevels.s -c testcase.c

gcc -O0 -S -fverbose-asm -o olevel0-x86.s -c testcase.c
gcc -O1 -S -fverbose-asm -o olevel1-x86.s -c testcase.c
gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2-x86.s -c
testcase.c

## Observations & expected results ############################################

When adding "-Wall -Wextra" the compiler does not emit any warnings for any of
the commands above.

The above commands generate 5 assembly files with different optimization
levels:
1) olevel0.s at -O0
2) olevel1-all-disabled.s at -O1 with -fno-* for every -f* listed in olevel1.s
3) olevel1.s at -O1
4) olevel2.s at -O1 with -fexpensive-optimizations because it is the only -f*
option additionally enabled with -O2 that causes the superflous clear to be
emitted
5) olevels.s at -Os

The first interesting observation is that the superfluous clear is generated at
-O0 ("olevel0.s"). The generated code for "clear_set" and "set_clear" is
exactly the same, excluding comments/annotations (stripped for clarity):

 1:     str     fp, [sp, #-4]!
 2:     add     fp, sp, #0
 3:     sub     sp, sp, #12
 4:     str     r0, [fp, #-8]
 5:     ldr     r3, [fp, #-8]
 6:     bic     r3, r3, #4194304
 7:     bic     r3, r3, #2
 8:     orr     r3, r3, #2
 9:     mov     r0, r3
10:     sub     sp, fp, #0
11:     ldr     fp, [sp], #4
12:     bx      lr

Just for the record: bic = clear bits in the specified mask, 0x400000 =
4194304. If we take a look at lines 6-8, the following happens:
6: A &= ~CLEAR
7: A &= ~SET
8: A |= SET
The clear instruction at line 7 is superfluous.

Looking at the "*_inline" versions at "-O0" doesn't make sense since the
compiler does not perform inlining at "-O0".

At "-O1", the code does not contain any superfluous instructions ("cat
olevel1.s  | grep -vE '^\s*[@\.]' | sed -E 's/\s*@.*$//'"):

clear_set:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
clear:
        bic     r0, r0, #4194304
        bx      lr
set:
        orr     r0, r0, #2
        bx      lr
clear_set_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr

Note that "clear_set", "set_clear", "clear_set_inline" and "set_clear_inline"
are all identical. I have been unable to find out which particular optimization
prevents the superfluous instruction from being emitted. I have tried to
reverse all "-f*" optimization parameters enabled by "-O1" as found in
"olevel1.s", but as should be visible from comparing "olevel1-all-disabled.s"
and "olevel1.s" there is no difference in "clear_set" and "set_clear" other
than in comments/annotations.

At "-O2" the superfluous clear is emitted once again. I have gone through the
optimizations enabled by -O2 which are not enabled by -O1 by comparing the
(verbose) assembly output. I have tracked down the cause to
"-fexpensive-optimizations". Enabling this optimization flag in combination
with -O1 results in the following output ("cat olevel2.s | grep -vE '^\s*[@\.]'
| sed -E 's/\s*@.*$//'"):

clear_set:
        bic     r0, r0, #4194304
        bic     r0, r0, #2
        orr     r0, r0, #2
        bx      lr
set_clear:
        bic     r0, r0, #4194304
        bic     r0, r0, #2
        orr     r0, r0, #2
        bx      lr
clear:
        bic     r0, r0, #4194304
        bx      lr
set:
        orr     r0, r0, #2
        bx      lr
clear_set_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr

Note that the "*_inline" functions do not contain the superfluous clear
instruction. The "-Os" case generates exactly the same instructions. Since
"-Os" also enables "-fexpensive-optimizations" as can be seen from "olevels.s",
the cause of the extra instruction at "-Os" is most likely the same as for
"-O2".

The above examples all use the ARM 32-bit ISA. By providing the arguments
"-march=armv7-m -mthumb" to the cross-compiler the exact same behaviour can be
observed for Thumb.

On x86_64, the following happens with "-O0" ("cat olevel0-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'") in both "clear_set" and "set_clear":

1:      pushq   %rbp
2:      movq    %rsp, %rbp
3:      movl    %edi, -4(%rbp)
4:      movl    -4(%rbp), %eax
5:      andl    $-4194307, %eax
6:      orl     $2, %eax
7:      popq    %rbp
8:      ret

At line 5, the constant -4194307 is 0xFFBFFFFD in hex, so ~(CLEAR | SET). Once
again, the bit that is set at line 6 with the "orl" instruction is first
cleared in the "andl" instruction.

Surprisingly, at "-O1" on x86_64, the superfluous clear remains in place. In
the inlined version the mask is correct, however ("cat olevel1-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'"):

clear_set:
        movl    %edi, %eax
        andl    $-4194307, %eax
        orl     $2, %eax
        ret
set_clear:
        movl    %edi, %eax
        andl    $-4194307, %eax
        orl     $2, %eax
        ret
clear:
        movl    %edi, %eax
        andl    $-4194305, %eax
        ret
set:
        movl    %edi, %eax
        orl     $2, %eax
        ret
clear_set_inline:
        movl    %edi, %eax
        andl    $-4194305, %eax
        orl     $2, %eax
        ret
set_clear_inline:
        movl    %edi, %eax
        andl    $-4194305, %eax
        orl     $2, %eax
        ret

The same behaviour is observed with "-Os", "-O2" and "-O3".

I guess something internal in GCC is causing the bit to be cleared before it is
set even though there is absolutely no valid reason to do so from the
perspective of the program. While on x86_64 it probably does not affect code
performance, on ARM it adversely affects code size and perhaps performance
(untested).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
@ 2013-10-14 20:18 ` pinskia at gcc dot gnu.org
  2013-10-14 21:43 ` niels at penneman dot org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2013-10-14 20:18 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|tree-optimization           |rtl-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think the ARM and x86 issues are really two different issues and most likely
two target specific issues.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
  2013-10-14 20:18 ` [Bug rtl-optimization/58727] " pinskia at gcc dot gnu.org
@ 2013-10-14 21:43 ` niels at penneman dot org
  2013-10-14 21:49 ` [Bug tree-optimization/58727] " pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: niels at penneman dot org @ 2013-10-14 21:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

--- Comment #2 from Niels Penneman <niels at penneman dot org> ---
You could be right about x86 being a different issue, since the superfluous
clear is there for every single optimization level that I have tested.

In that case, for the sake of completeness:
- ARM results have also been verified on 4.6.1 and 4.7.2;
- x86_64 results have also been verified on 4.8.1.

Looking at the list of GCC primary targets, I also see MIPS and PowerPC. Here
is what happens on MIPS.

## GCC version (MIPS) #########################################################

$ mips-none-elf-gcc -###
Using built-in specs.
COLLECT_GCC=mips-none-elf-gcc
COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/mips-none-elf/4.8.1/lto-wrapper
Target: mips-none-elf
Configured with: /path/to/builddir/src/gcc-4.8.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=mips-none-elf --prefix=/path/to/toolchain
--with-local-prefix=/path/to/toolchain/mips-none-elf/sysroot
--disable-libmudflap --with-sysroot=/path/to/toolchain/mips-none-elf/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.19.0' --with-abi=32 --with-float=soft
--disable-__cxa_atexit --with-gmp=/path/to/builddir/mips-none-elf/buildtools
--with-mpfr=/path/to/builddir/mips-none-elf/buildtools
--with-mpc=/path/to/builddir/mips-none-elf/buildtools --with-ppl=no
--with-isl=no --with-cloog=no --with-libelf=no --disable-lto
--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
--enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls
--disable-multilib --enable-languages=c
Thread model: single
gcc version 4.8.1 (crosstool-NG 1.19.0) 

$ mips64-none-elf-gcc -###
Using built-in specs.
COLLECT_GCC=./mips64-none-elf-gcc
COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/mips64-none-elf/4.8.1/lto-wrapper
Target: mips64-none-elf
Configured with: /path/to/builddir/src/gcc-4.8.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=mips64-none-elf --prefix=/path/to/toolchain
--with-local-prefix=/path/to/toolchain/mips64-none-elf/sysroot
--disable-libmudflap --with-sysroot=/path/to/toolchain/mips64-none-elf/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.19.0' --with-abi=64 --with-float=soft
--disable-__cxa_atexit --with-gmp=/path/to/builddir/mips64-none-elf/buildtools
--with-mpfr=/path/to/builddir/mips64-none-elf/buildtools
--with-mpc=/path/to/builddir/mips64-none-elf/buildtools --with-ppl=no
--with-isl=no --with-cloog=no --with-libelf=no --disable-lto
--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
--enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls
--disable-multilib --enable-languages=c
Thread model: single
gcc version 4.8.1 (crosstool-NG 1.19.0) 

## Compiler invocation / steps to reproduce ###################################

mips-none-elf-gcc -O1 -S -o - -c testcase.c | grep -vE '^\s*\.'
mips64-none-elf-gcc -O1 -S -o - -c testcase.c | grep -vE '^\s*\.'

## Observations & expected results ############################################

The instructions generated for the "clear_set" and "set_clear" functions are
identical:

MIPS32 (o32 ABI):
        li      $2,-4259840                     # 0xffffffffffbf0000
        ori     $2,$2,0xfffd
        and     $2,$4,$2
        j       $31
        ori     $2,$2,0x2

MIPS64 (n64 ABI):
        li      $2,-4259840                     # 0xffffffffffbf0000
        ori     $2,$2,0xfffd
        and     $4,$4,$2
        j       $31
        ori     $2,$4,0x2

Just to compare to the "*_inline" variants (also both identical):

MIPS32:
        li      $2,-4259840                     # 0xffffffffffbf0000
        ori     $2,$2,0xffff
        and     $2,$4,$2
        j       $31
        ori     $2,$2,0x2

MIPS64:
        li      $2,-4259840                     # 0xffffffffffbf0000
        ori     $2,$2,0xffff
        and     $4,$4,$2
        j       $31
        ori     $2,$4,0x2

Compare non-inlined the mask used to clear bits is 0xffbffffd while with
inlined split operations the mask is 0xffbfffff. The SET bit also appears in
the mask to be cleared when using "-O0", "-O2", "-O3" and "-Os", much like on
x86. Once again, from the looks of the clear function I doubt any superfluous
instruction is emitted here, so there are probably no run-time effects. I only
have limited knowledge of MIPS so correct me if I am wrong.

For your reference, here is the clear function (same code at all optimization
levels different from "-O0"):

MIPS32/64:
        li      $2,-4259840                     # 0xffffffffffbf0000
        ori     $2,$2,0xffff
        j       $31
        and     $2,$4,$2

It apparently needs the extra "ori" instruction regardless of whether "SET"
appears in the mask or not.

In the end, the problem with ARM does look different since it does not occur
with -O1; perhaps some optimization is able to undo the superfluous clear, but
this optimization is in its turn undone by -fexpensive-optimizations.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
  2013-10-14 20:18 ` [Bug rtl-optimization/58727] " pinskia at gcc dot gnu.org
  2013-10-14 21:43 ` niels at penneman dot org
@ 2013-10-14 21:49 ` pinskia at gcc dot gnu.org
  2013-10-14 22:08 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2013-10-14 21:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |tree-optimization

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I see the issue:
clear_set produces:
  D.1381_2 = a_1(D) & 0xffbffffd;
  D.1380_3 = D.1381_2 | 2;

While clear_set_inline produces:
  D.1381_2 = a_1(D) & 0xffbfffff;
  D.1380_3 = D.1381_2 | 2;

The lower two bits make GCC optimize it easier.
I think the issue is in fold-const.c as .original is already includes the ~0x2
in the and part.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
                   ` (2 preceding siblings ...)
  2013-10-14 21:49 ` [Bug tree-optimization/58727] " pinskia at gcc dot gnu.org
@ 2013-10-14 22:08 ` pinskia at gcc dot gnu.org
  2013-10-15  7:06 ` jakub at gcc dot gnu.org
  2013-10-15  8:35 ` niels at penneman dot org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2013-10-14 22:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2013-10-14
     Ever confirmed|0                           |1

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
It comes from:

      /* Minimize the number of bits set in C1, i.e. C1 := C1 & ~C2,
         unless (C1 & ~C2) | (C2 & C3) for some C3 is a mask of some
         mode which allows further optimizations.  */


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
                   ` (3 preceding siblings ...)
  2013-10-14 22:08 ` pinskia at gcc dot gnu.org
@ 2013-10-15  7:06 ` jakub at gcc dot gnu.org
  2013-10-15  8:35 ` niels at penneman dot org
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-10-15  7:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I guess once we move (or duplicate) this optimization to be performed at GIMPLE
in some pass (gimple_fold, forwprop, something else), perhaps after IPA we
could add some target hook how expensive a constant is and we could try
different variants and see what would be cheapest.  Without that what GCC does
now is right, generally for most targets the fewer bits set in a constant the
better if the target cares at all.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
                   ` (4 preceding siblings ...)
  2013-10-15  7:06 ` jakub at gcc dot gnu.org
@ 2013-10-15  8:35 ` niels at penneman dot org
  5 siblings, 0 replies; 7+ messages in thread
From: niels at penneman dot org @ 2013-10-15  8:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

--- Comment #6 from Niels Penneman <niels at penneman dot org> ---
This may sound like a silly question but I'd like to fully understand what's
going on.

Are you saying that the problem is that on most targets the instruction emitted
for the clear is a bitwise AND, such that the parameter to that instruction is
a mask with all bits set except the ones to be cleared. To reduce the size of
the mask it is better to unset as many bits as possible. However, on ARM the
clear operation is mapped to "bitwise clear" (BIC) rather than to a bitwise
AND, such that the mask is inverted and the reasoning should be the opposite:
clear as few bits as possible. Is that it?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-10-15  8:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-14 20:01 [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence niels at penneman dot org
2013-10-14 20:18 ` [Bug rtl-optimization/58727] " pinskia at gcc dot gnu.org
2013-10-14 21:43 ` niels at penneman dot org
2013-10-14 21:49 ` [Bug tree-optimization/58727] " pinskia at gcc dot gnu.org
2013-10-14 22:08 ` pinskia at gcc dot gnu.org
2013-10-15  7:06 ` jakub at gcc dot gnu.org
2013-10-15  8:35 ` niels at penneman dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).