From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18314 invoked by alias); 21 Jun 2009 16:11:32 -0000 Received: (qmail 18260 invoked by uid 48); 21 Jun 2009 16:11:15 -0000 Date: Sun, 21 Jun 2009 16:11:00 -0000 Message-ID: <20090621161115.18259.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug target/30354] -Os doesn't optimize a/CONST even if it saves size. In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "vda dot linux at googlemail dot com" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2009-06/txt/msg01506.txt.bz2 ------- Comment #8 from vda dot linux at googlemail dot com 2009-06-21 16:11 ------- (In reply to comment #7) > It seems to make sense to bump cost of idiv a bit, given the fact that there > are register pressure implications. > > I would like to however understand what code sequences we produce that are > estimated to be long but ends up being shorter in practice. Would be possible > to try to give me some examples of constants where it is important to bump cost > to 8? It is possible we can simply fix cost estimation in divmod expansion > instead. Attached t.c.bz2 is a good source file to experiment with. With last month's svn snapshot of gcc, I did the following: /usr/app/gcc-4.4.svn.20090528/bin/gcc -g0 -Os -fomit-frame-pointer -ffunction-sections -c t.c objdump -dr t.o >t.asm with and without the patch, and compared results. (-ffunction-sections are used merely because they make "objdump -dr" output much more suitable for diffing). Here is the diff between unpatched and patched gcc's code generated for int_x / 16: Disassembly of section .text.id_x_16: 0000000000000000 : - 0: 89 f8 mov %edi,%eax - 2: ba 10 00 00 00 mov $0x10,%edx - 7: 89 d1 mov %edx,%ecx - 9: 99 cltd - a: f7 f9 idiv %ecx - c: c3 retq + 0: 8d 47 0f lea 0xf(%rdi),%eax + 3: 85 ff test %edi,%edi + 5: 0f 49 c7 cmovns %edi,%eax + 8: c1 f8 04 sar $0x4,%eax + b: c3 retq int_x / 2: Disassembly of section .text.id_x_2: 0000000000000000 : 0: 89 f8 mov %edi,%eax - 2: ba 02 00 00 00 mov $0x2,%edx - 7: 89 d1 mov %edx,%ecx - 9: 99 cltd - a: f7 f9 idiv %ecx - c: c3 retq + 2: c1 e8 1f shr $0x1f,%eax + 5: 01 f8 add %edi,%eax + 7: d1 f8 sar %eax + 9: c3 retq As you can see, code become smaller and *much* faster (not even mul insn is used now). Here is an example of unsigned_x / 641. In this case, code size is the same, but the code is faster: Disassembly of section .text.ud_x_641: 0000000000000000 : - 0: ba 81 02 00 00 mov $0x281,%edx - 5: 89 f8 mov %edi,%eax - 7: 89 d1 mov %edx,%ecx - 9: 31 d2 xor %edx,%edx - b: f7 f1 div %ecx + 0: 89 f8 mov %edi,%eax + 2: 48 69 c0 81 3d 66 00 imul $0x663d81,%rax,%rax + 9: 48 c1 e8 20 shr $0x20,%rax d: c3 retq There is not a single instance of code growth. Either newer gcc is better or maybe code growth cases are in 32-bit code only. I will attach t64.asm.diff, take a look if you want to see all changes in generated code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354