* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
@ 2009-04-28 13:42 ` pinskia at gcc dot gnu dot org
2009-04-28 17:05 ` vvv at ru dot ru
` (50 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 13:42 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from pinskia at gcc dot gnu dot org 2009-04-28 13:42 -------
Can you provide the preprocessed source which contains set_blitting_type?
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |WAITING
Component|c |target
GCC target triplet| |x86_64-linux-gnu
Keywords| |missed-optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
@ 2009-04-28 17:05 ` vvv at ru dot ru
2009-04-28 17:10 ` vvv at ru dot ru
` (49 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:05 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from vvv at ru dot ru 2009-04-28 17:04 -------
Created an attachment (id=17776)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view)
Source file from Linx Kernel 2.6.29.1
See static void set_blitting_type
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
2009-04-28 17:05 ` vvv at ru dot ru
@ 2009-04-28 17:10 ` vvv at ru dot ru
2009-04-28 17:15 ` vvv at ru dot ru
` (48 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:10 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from vvv at ru dot ru 2009-04-28 17:10 -------
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!)
=================
linux/drivers/video/console/bitblit.c
void fbcon_set_bitops(struct fbcon_ops *ops)
{
ops->bmove = bit_bmove;
ops->clear = bit_clear;
ops->putcs = bit_putcs;
ops->clear_margins = bit_clear_margins;
ops->cursor = bit_cursor;
ops->update_start = bit_update_start;
ops->rotate_font = NULL;
if (ops->rotate)
fbcon_set_rotate(ops);
}
================
ffffffff8020a5e0 <disable_TSC>:
ffffffff8020a5e0: 55 push %rbp
ffffffff8020a5e1: bf 01 00 00 00 mov $0x1,%edi
ffffffff8020a5e6: 48 89 e5 mov %rsp,%rbp
ffffffff8020a5e9: e8 c2 fd 35 00 callq ffffffff8056a3b0
<add_preempt_count>
ffffffff8020a5ee: 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
ffffffff8020a5f5: 00 00
ffffffff8020a5f7: 48 2d c8 1f 00 00 sub $0x1fc8,%rax
ffffffff8020a5fd: f0 0f ba 28 10 lock btsl $0x10,(%rax)
ffffffff8020a602: 19 d2 sbb %edx,%edx
ffffffff8020a604: 85 d2 test %edx,%edx
ffffffff8020a606: 75 0a jne ffffffff8020a612
<disable_TSC+0x32>
ffffffff8020a608: 0f 20 e0 mov %cr4,%rax
ffffffff8020a60b: 48 83 c8 04 or $0x4,%rax
ffffffff8020a60f: 0f 22 e0 mov %rax,%cr4
ffffffff8020a612: bf 01 00 00 00 mov $0x1,%edi
ffffffff8020a617: e8 e4 fc 35 00 callq ffffffff8056a300
<sub_preempt_count>
ffffffff8020a61c: 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
ffffffff8020a623: 00 00
ffffffff8020a625: f6 80 38 e0 ff ff 08 testb $0x8,-0x1fc8(%rax)
ffffffff8020a62c: 75 02 jne ffffffff8020a630
<disable_TSC+0x50>
ffffffff8020a62e: c9 leaveq
ffffffff8020a62f: c3 retq
ffffffff8020a630: e8 2b 99 35 00 callq ffffffff80563f60
<preempt_schedule>
ffffffff8020a635: c9 leaveq
ffffffff8020a636: 66 90 xchg %ax,%ax
ffffffff8020a638: c3 retq
==================
/arch/x86/kernel/io_delay.c
void native_io_delay(void)
{
switch (io_delay_type) {
default:
case CONFIG_IO_DELAY_TYPE_0X80:
asm volatile ("outb %al, $0x80");
break;
case CONFIG_IO_DELAY_TYPE_0XED:
asm volatile ("outb %al, $0xed");
break;
case CONFIG_IO_DELAY_TYPE_UDELAY:
/*
* 2 usecs is an upper-bound for the outb delay but
* note that udelay doesn't have the bus-level
* side-effects that outb does, nor does udelay() have
* precise timings during very early bootup (the delays
* are shorter until calibrated):
*/
udelay(2);
case CONFIG_IO_DELAY_TYPE_NONE:
break;
}
}
EXPORT_SYMBOL(native_io_delay);
ffffffff802131e0 <native_io_delay>:
ffffffff802131e0: 55 push %rbp
ffffffff802131e1: 8b 05 3d b3 54 00 mov 0x54b33d(%rip),%eax
# ffffffff8075e524 <io_delay_type>
ffffffff802131e7: 48 89 e5 mov %rsp,%rbp
ffffffff802131ea: 83 f8 02 cmp $0x2,%eax
ffffffff802131ed: 74 29 je ffffffff80213218
<native_io_delay+0x38>
ffffffff802131ef: 83 f8 03 cmp $0x3,%eax
ffffffff802131f2: 74 06 je ffffffff802131fa
<native_io_delay+0x1a>
ffffffff802131f4: ff c8 dec %eax
ffffffff802131f6: 74 10 je ffffffff80213208
<native_io_delay+0x28>
ffffffff802131f8: e6 80 out %al,$0x80
ffffffff802131fa: c9 leaveq
ffffffff802131fb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffff80213200: c3 retq
ffffffff80213201: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff80213208: e6 ed out %al,$0xed
ffffffff8021320a: c9 leaveq
ffffffff8021320b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffff80213210: c3 retq
ffffffff80213211: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff80213218: bf 8e 21 00 00 mov $0x218e,%edi
ffffffff8021321d: 0f 1f 00 nopl (%rax)
ffffffff80213220: e8 fb ac 1e 00 callq ffffffff803fdf20
<__const_udelay>
ffffffff80213225: c9 leaveq
ffffffff80213226: 66 90 xchg %ax,%ax
ffffffff80213228: c3 retq
===============
arch/x86/mm/ioremap.c
int ioremap_change_attr(unsigned long vaddr, unsigned long size,
unsigned long prot_val)
{
unsigned long nrpages = size >> PAGE_SHIFT;
int err;
switch (prot_val) {
case _PAGE_CACHE_UC:
default:
err = _set_memory_uc(vaddr, nrpages);
break;
case _PAGE_CACHE_WC:
err = _set_memory_wc(vaddr, nrpages);
break;
case _PAGE_CACHE_WB:
err = _set_memory_wb(vaddr, nrpages);
break;
}
return err;
}
ffffffff8022df60 <ioremap_change_attr>:
ffffffff8022df60: 55 push %rbp
ffffffff8022df61: 48 c1 ee 0c shr $0xc,%rsi
ffffffff8022df65: 48 89 e5 mov %rsp,%rbp
ffffffff8022df68: 48 85 d2 test %rdx,%rdx
ffffffff8022df6b: 75 0b jne ffffffff8022df78
<ioremap_change_attr+0x18>
ffffffff8022df6d: e8 2e 18 00 00 callq ffffffff8022f7a0
<_set_memory_wb>
ffffffff8022df72: c9 leaveq
ffffffff8022df73: c3 retq
ffffffff8022df74: 0f 1f 40 00 nopl 0x0(%rax)
ffffffff8022df78: 48 83 fa 08 cmp $0x8,%rdx
ffffffff8022df7c: 0f 1f 40 00 nopl 0x0(%rax)
ffffffff8022df80: 74 16 je ffffffff8022df98
<ioremap_change_attr+0x38>
ffffffff8022df82: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
ffffffff8022df88: e8 53 19 00 00 callq ffffffff8022f8e0
<_set_memory_uc>
ffffffff8022df8d: c9 leaveq
ffffffff8022df8e: 66 90 xchg %ax,%ax
ffffffff8022df90: c3 retq
ffffffff8022df91: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff8022df98: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
ffffffff8022df9f: 00
ffffffff8022dfa0: e8 0b 19 00 00 callq ffffffff8022f8b0
<_set_memory_wc>
ffffffff8022dfa5: c9 leaveq
ffffffff8022dfa6: 66 90 xchg %ax,%ax
ffffffff8022dfa8: c3 retq
ffffffff8022dfa9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
==============
kernel/sched.c
int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
{
u64 rt_runtime, rt_period;
rt_period = (u64)rt_period_us * NSEC_PER_USEC;
rt_runtime = tg->rt_bandwidth.rt_runtime;
if (rt_period == 0)
return -EINVAL;
return tg_set_bandwidth(tg, rt_period, rt_runtime);
}
ffffffff8023f810 <sched_group_set_rt_period>:
ffffffff8023f810: 55 push %rbp
ffffffff8023f811: 48 69 f6 e8 03 00 00 imul $0x3e8,%rsi,%rsi
ffffffff8023f818: 48 89 e5 mov %rsp,%rbp
ffffffff8023f81b: 48 8b 57 50 mov 0x50(%rdi),%rdx
ffffffff8023f81f: b8 ea ff ff ff mov $0xffffffea,%eax
ffffffff8023f824: 48 85 f6 test %rsi,%rsi
ffffffff8023f827: 75 07 jne ffffffff8023f830
<sched_group_set_rt_period+0x20>
ffffffff8023f829: c9 leaveq
ffffffff8023f82a: c3 retq
ffffffff8023f82b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffff8023f830: e8 bb fe ff ff callq ffffffff8023f6f0
<tg_set_bandwidth>
ffffffff8023f835: c9 leaveq
ffffffff8023f836: 66 90 xchg %ax,%ax
ffffffff8023f838: c3 retq
ffffffff8023f839: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
==============
kernel/sched_rt.c
static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
{
/* Try to pull RT tasks here if we lower this rq's prio */
if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio)
pull_rt_task(rq);
}
static void switched_from_rt(struct rq *rq, struct task_struct *p,
int running)
{
/*
* If there are other RT tasks then we will reschedule
* and the scheduling of the other RT tasks will handle
* the balancing. But if we are the last RT task
* we may need to handle the pulling of RT tasks
* now.
*/
if (!rq->rt.rt_nr_running)
pull_rt_task(rq);
}
ffffffff802452b0 <switched_from_rt>:
ffffffff802452b0: 55 push %rbp
ffffffff802452b1: 48 89 e5 mov %rsp,%rbp
ffffffff802452b4: 48 83 bf 70 07 00 00 cmpq $0x0,0x770(%rdi)
ffffffff802452bb: 00
ffffffff802452bc: 74 02 je ffffffff802452c0
<switched_from_rt+0x10>
ffffffff802452be: c9 leaveq
ffffffff802452bf: c3 retq
ffffffff802452c0: e8 6b fd ff ff callq ffffffff80245030
<pull_rt_task>
ffffffff802452c5: c9 leaveq
ffffffff802452c6: 66 90 xchg %ax,%ax
ffffffff802452c8: c3 retq
ffffffff802452c9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff802452d0 <pre_schedule_rt>:
ffffffff802452d0: 55 push %rbp
ffffffff802452d1: 8b 46 20 mov 0x20(%rsi),%eax
ffffffff802452d4: 48 89 e5 mov %rsp,%rbp
ffffffff802452d7: 83 f8 63 cmp $0x63,%eax
ffffffff802452da: 7f 08 jg ffffffff802452e4
<pre_schedule_rt+0x14>
ffffffff802452dc: 39 87 78 07 00 00 cmp %eax,0x778(%rdi)
ffffffff802452e2: 7f 0c jg ffffffff802452f0
<pre_schedule_rt+0x20>
ffffffff802452e4: c9 leaveq
ffffffff802452e5: c3 retq
ffffffff802452e6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
ffffffff802452ed: 00 00 00
ffffffff802452f0: e8 3b fd ff ff callq ffffffff80245030
<pull_rt_task>
ffffffff802452f5: c9 leaveq
ffffffff802452f6: 66 90 xchg %ax,%ax
ffffffff802452f8: c3 retq
ffffffff802452f9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
=============
linux/kernel/timer.c
void msleep(unsigned int msecs)
{
unsigned long timeout = msecs_to_jiffies(msecs) + 1;
while (timeout)
timeout = schedule_timeout_uninterruptible(timeout);
}
EXPORT_SYMBOL(msleep);
ffffffff80256120 <msleep>:
ffffffff80256120: 55 push %rbp
ffffffff80256121: 48 89 e5 mov %rsp,%rbp
ffffffff80256124: e8 a7 94 ff ff callq ffffffff8024f5d0
<msecs_to_jiffies>
ffffffff80256129: 48 89 c7 mov %rax,%rdi
ffffffff8025612c: 48 ff c7 inc %rdi
ffffffff8025612f: 74 14 je ffffffff80256145
<msleep+0x25>
ffffffff80256131: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff80256138: e8 c3 ed 30 00 callq ffffffff80564f00
<schedule_timeout_uninterruptible>
ffffffff8025613d: 48 89 c7 mov %rax,%rdi
ffffffff80256140: 48 85 c0 test %rax,%rax
ffffffff80256143: 75 f3 jne ffffffff80256138
<msleep+0x18>
ffffffff80256145: c9 leaveq
ffffffff80256146: 66 90 xchg %ax,%ax
ffffffff80256148: c3 retq
ffffffff80256149: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
================
mm/shemem.c
static int shmem_xattr_security_get(struct inode *inode, const char *name,
void *buffer, size_t size)
{
if (strcmp(name, "") == 0)
return -EINVAL;
return xattr_getsecurity(inode, name, buffer, size);
}
static int shmem_xattr_security_set(struct inode *inode, const char *name,
const void *value, size_t size, int flags)
{
if (strcmp(name, "") == 0)
return -EINVAL;
return security_inode_setsecurity(inode, name, value, size, flags);
}
ffffffff802b9ff0 <shmem_xattr_security_set>:
ffffffff802b9ff0: 55 push %rbp
ffffffff802b9ff1: b8 ea ff ff ff mov $0xffffffea,%eax
ffffffff802b9ff6: 48 89 e5 mov %rsp,%rbp
ffffffff802b9ff9: 80 3e 00 cmpb $0x0,(%rsi)
ffffffff802b9ffc: 75 02 jne ffffffff802ba000
<shmem_xattr_security_set+0x10>
ffffffff802b9ffe: c9 leaveq
ffffffff802b9fff: c3 retq
ffffffff802ba000: e8 ab b1 0f 00 callq ffffffff803b51b0
<security_inode_setsecurity>
ffffffff802ba005: c9 leaveq
ffffffff802ba006: 66 90 xchg %ax,%ax
ffffffff802ba008: c3 retq
ffffffff802ba009: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
ffffffff802ba010 <shmem_xattr_security_get>:
ffffffff802ba010: 55 push %rbp
ffffffff802ba011: b8 ea ff ff ff mov $0xffffffea,%eax
ffffffff802ba016: 48 89 e5 mov %rsp,%rbp
ffffffff802ba019: 80 3e 00 cmpb $0x0,(%rsi)
ffffffff802ba01c: 75 02 jne ffffffff802ba020
<shmem_xattr_security_get+0x10>
ffffffff802ba01e: c9 leaveq
ffffffff802ba01f: c3 retq
ffffffff802ba020: e8 2b b5 04 00 callq ffffffff80305550
<xattr_getsecurity>
ffffffff802ba025: c9 leaveq
ffffffff802ba026: 66 90 xchg %ax,%ax
ffffffff802ba028: c3 retq
ffffffff802ba029: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
==========
linux/fs/file_table.c
void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count))
__fput(file);
}
EXPORT_SYMBOL(fput);
ffffffff802e8da0 <fput>:
ffffffff802e8da0: 55 push %rbp
ffffffff802e8da1: 48 8d 47 28 lea 0x28(%rdi),%rax
ffffffff802e8da5: 48 89 e5 mov %rsp,%rbp
ffffffff802e8da8: f0 48 ff 08 lock decq (%rax)
ffffffff802e8dac: 0f 94 c2 sete %dl
ffffffff802e8daf: 84 d2 test %dl,%dl
ffffffff802e8db1: 75 05 jne ffffffff802e8db8
<fput+0x18>
ffffffff802e8db3: c9 leaveq
ffffffff802e8db4: c3 retq
ffffffff802e8db5: 0f 1f 00 nopl (%rax)
ffffffff802e8db8: e8 03 fe ff ff callq ffffffff802e8bc0
<__fput>
ffffffff802e8dbd: c9 leaveq
ffffffff802e8dbe: 66 90 xchg %ax,%ax
ffffffff802e8dc0: c3 retq
ffffffff802e8dc1: 66 66 66 66 66 66 2e nopw %cs:0x0(%rax,%rax,1)
ffffffff802e8dc8: 0f 1f 84 00 00 00 00
ffffffff802e8dcf: 00
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (2 preceding siblings ...)
2009-04-28 17:10 ` vvv at ru dot ru
@ 2009-04-28 17:15 ` vvv at ru dot ru
2009-04-28 17:37 ` ubizjak at gmail dot com
` (47 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:15 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from vvv at ru dot ru 2009-04-28 17:15 -------
Created an attachment (id=17777)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17777&action=view)
Simple example from Linux
See two functons:
static void pre_schedule_rt
static void switched_from_rt
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (3 preceding siblings ...)
2009-04-28 17:15 ` vvv at ru dot ru
@ 2009-04-28 17:37 ` ubizjak at gmail dot com
2009-04-28 21:19 ` vvv at ru dot ru
` (46 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 17:37 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from ubizjak at gmail dot com 2009-04-28 17:37 -------
Unfortunately, all code snippets and dumps are of no use. Please see
http://gcc.gnu.org/bugs.html for the reason why.
As an exercise, please compile *standalone* _preprocessed_ source you will
create with -S added to your compile flags and count the number of .p2align
directives in the code stream.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (4 preceding siblings ...)
2009-04-28 17:37 ` ubizjak at gmail dot com
@ 2009-04-28 21:19 ` vvv at ru dot ru
2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
` (45 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 21:19 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from vvv at ru dot ru 2009-04-28 21:18 -------
Let's compile file test.c
//#file test.c
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# gcc -o t.out test.c -c -O2
# objdump -d t.out
t.out: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <func>:
0: 48 83 ec 08 sub $0x8,%rsp
4: e8 00 00 00 00 callq 9 <func+0x9>
9: 85 c0 test %eax,%eax
b: 89 c7 mov %eax,%edi
d: 74 0e je 1d <func+0x1d>
f: 90 nop
10: e8 00 00 00 00 callq 15 <func+0x15>
15: 8d 7c 40 01 lea 0x1(%rax,%rax,2),%edi
19: 85 ff test %edi,%edi
1b: 75 f3 jne 10 <func+0x10>
1d: 48 83 c4 08 add $0x8,%rsp
21: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) <---- nonoptimal
28: c3 retq
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (5 preceding siblings ...)
2009-04-28 21:19 ` vvv at ru dot ru
@ 2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
2009-04-28 21:47 ` ubizjak at gmail dot com
` (44 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 21:23 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-28 21:23 -------
4.1.2 produces:
.L4:
addq $8, %rsp
.p2align 4,,2
ret
While the trunk produces:
.L1:
addq $8, %rsp
.p2align 4,,2
.p2align 3
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (6 preceding siblings ...)
2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
@ 2009-04-28 21:47 ` ubizjak at gmail dot com
2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
` (43 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 21:47 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from ubizjak at gmail dot com 2009-04-28 21:47 -------
>From config/i386/i386.c:
/* AMD Athlon works faster
when RET is not destination of conditional jump or directly preceded
by other jump instruction. We avoid the penalty by inserting NOP just
before the RET instructions in such cases. */
static void
ix86_pad_returns (void)
...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (7 preceding siblings ...)
2009-04-28 21:47 ` ubizjak at gmail dot com
@ 2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
2009-04-28 21:54 ` ubizjak at gmail dot com
` (42 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 21:53 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from pinskia at gcc dot gnu dot org 2009-04-28 21:52 -------
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |RESOLVED
Resolution| |INVALID
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (8 preceding siblings ...)
2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
@ 2009-04-28 21:54 ` ubizjak at gmail dot com
2009-04-29 7:46 ` vvv at ru dot ru
` (41 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 21:54 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from ubizjak at gmail dot com 2009-04-28 21:53 -------
Actually, alignment is from ix86_avoid_jump_misspredicts, where:
/* Look for all minimal intervals of instructions containing 4 jumps.
The intervals are bounded by START and INSN. NBYTES is the total
size of instructions in the interval including INSN and not including
START. When the NBYTES is smaller than 16 bytes, it is possible
that the end of START and INSN ends up in the same 16byte page.
The smallest offset in the page INSN can start is the case where START
ends on the offset 0. Offset of INSN is then NBYTES - sizeof (INSN).
We add p2align to 16byte window with maxskip 17 - NBYTES + sizeof (INSN).
*/
So, this is by design. Use -Os if code size is important.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (9 preceding siblings ...)
2009-04-28 21:54 ` ubizjak at gmail dot com
@ 2009-04-29 7:46 ` vvv at ru dot ru
2009-04-29 7:55 ` vvv at ru dot ru
` (40 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29 7:46 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from vvv at ru dot ru 2009-04-29 07:46 -------
(In reply to comment #8)
> From config/i386/i386.c:
> /* AMD Athlon works faster
> when RET is not destination of conditional jump or directly preceded
> by other jump instruction. We avoid the penalty by inserting NOP just
> before the RET instructions in such cases. */
> static void
> ix86_pad_returns (void)
> ...
But I am using Core 2 Duo.
Why we see multibyte nop, not single byte nop?
Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops
changed?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (10 preceding siblings ...)
2009-04-29 7:46 ` vvv at ru dot ru
@ 2009-04-29 7:55 ` vvv at ru dot ru
2009-04-29 9:32 ` jakub at gcc dot gnu dot org
` (39 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29 7:55 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from vvv at ru dot ru 2009-04-29 07:55 -------
(In reply to comment #9)
> So that explains it, Use -Os or attribute cold if you want NOPs to be gone.
But my measurements on Core 2 Duo P8600 show that
push %ebp
mov %esp,%ebp
leave
ret
_faster_ then
push %ebp
mov %esp,%ebp
leave
xchg %ax,%ax
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (11 preceding siblings ...)
2009-04-29 7:55 ` vvv at ru dot ru
@ 2009-04-29 9:32 ` jakub at gcc dot gnu dot org
2009-04-29 10:13 ` jakub at gcc dot gnu dot org
` (38 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-29 9:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #13 from jakub at gcc dot gnu dot org 2009-04-29 09:32 -------
You are benchmarking something completely unrelated.
What really matters is how code that has 4 branches/calls in one 16-byte block
is able to predict all those branches. And Core2 similarly to various AMD CPUs
is not able to predict them well.
In the #c6 testcase it considers the je, call, jne and ret whether they can be
in a 16 byte block or not. They can't, je is 2 bytes, call 5 bytes, leal 4
bytes (but gcc uses min_insn_size, which is 2 in this case), testl 2, jne 2,
addq 4 (but again, min_insn_size is 2 in this case).
min_insn_size seems to be very conservative, I guess teaching it about a bunch
of prefixes couldn't hurt, for non-jump/call insns ATM it estimates just the
displacement size, doesn't consider any prefixes (even those that really can't
change after machine reorg), etc.
--
jakub at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hubicka at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (12 preceding siblings ...)
2009-04-29 9:32 ` jakub at gcc dot gnu dot org
@ 2009-04-29 10:13 ` jakub at gcc dot gnu dot org
2009-04-29 19:17 ` vvv at ru dot ru
` (37 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-29 10:13 UTC (permalink / raw)
To: gcc-bugs
------- Comment #14 from jakub at gcc dot gnu dot org 2009-04-29 10:12 -------
Also, couldn't we use the information computed by compute_alignments and
assume CODE_LABELs are aligned?
Probably would need to add label_to_max_skip (rtx) function to final.c,
so that not just label_to_alignment, but also LABEL_TO_MAX_SKIP is available to
backends. Then when we know for the label in the testcase that
.p2align 4,,10
.p2align 3
then we know the 16 byte boundary is either at that label, or at most 5 bytes
before it, so all we need is consider any jumps/calls in the last 5 bytes
before the label.
For min_insn_size, is it possible to find out for which non-jump/call insns
get_attr_length might not be exact (i.e. be a maximum guess rather than
guaranteed size (though of course, this is also just an optimization, so 100%
guarantees aren't needed either))? If so, we could use get_attr_length for the
insns where it is known to be exact...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (13 preceding siblings ...)
2009-04-29 10:13 ` jakub at gcc dot gnu dot org
@ 2009-04-29 19:17 ` vvv at ru dot ru
2009-04-30 9:07 ` jakub at gcc dot gnu dot org
` (36 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29 19:17 UTC (permalink / raw)
To: gcc-bugs
------- Comment #15 from vvv at ru dot ru 2009-04-29 19:16 -------
One more example 5-bytes nop between leaveq and retq.
# cat test.c
void wait_for_enter()
{
int u = getchar();
while (!u)
u = getchar()-13;
}
main()
{
wait_for_enter();
}
# gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer
# objdump -d t.out
...
0000000000400540 <wait_for_enter>:
400540: 55 push %rbp
400541: 31 c0 xor %eax,%eax
400543: 48 89 e5 mov %rsp,%rbp
400546: e8 f5 fe ff ff callq 400440 <getchar@plt>
40054b: 85 c0 test %eax,%eax
40054d: 75 13 jne 400562 <wait_for_enter+0x22>
40054f: 90 nop
400550: 31 c0 xor %eax,%eax
400552: e8 e9 fe ff ff callq 400440 <getchar@plt>
400557: 83 f8 0d cmp $0xd,%eax
40055a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400560: 74 ee je 400550 <wait_for_enter+0x10>
400562: c9 leaveq
400563: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <--NONOPTIMAL!
400568: c3 retq
400569: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000400570 <main>:
400570: 55 push %rbp
400571: 31 c0 xor %eax,%eax
400573: 48 89 e5 mov %rsp,%rbp
400576: e8 c5 ff ff ff callq 400540 <wait_for_enter>
40057b: c9 leaveq
40057c: c3 retq
40057d: 90 nop
40057e: 90 nop
40057f: 90 nop
So bug unresolved.
--
vvv at ru dot ru changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |UNCONFIRMED
Resolution|INVALID |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (14 preceding siblings ...)
2009-04-29 19:17 ` vvv at ru dot ru
@ 2009-04-30 9:07 ` jakub at gcc dot gnu dot org
2009-05-12 16:41 ` vvv at ru dot ru
` (35 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-30 9:07 UTC (permalink / raw)
To: gcc-bugs
------- Comment #16 from jakub at gcc dot gnu dot org 2009-04-30 09:07 -------
Created an attachment (id=17783)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view)
gcc45-pr39942.patch
Patch that attempts to take into account .p2align directives that are emitted
for (some) CODE_LABELs and also the gen_align insns that the pass itself
inserts. For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
directive starts a new 16 byte page (then insns before it are never
interesting), or nothing was skipped because more than 10 bytes would need to
be skipped. But that means the current group could contain only 5 or less
bytes of instructions before the label, so again, we don't have to look at
instructions not in the last 5 bytes.
Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
the second .p2align 3, which might (and often does) skip more than MAX_SKIP
bytes (up to 7).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (15 preceding siblings ...)
2009-04-30 9:07 ` jakub at gcc dot gnu dot org
@ 2009-05-12 16:41 ` vvv at ru dot ru
2009-05-13 8:31 ` jakub at gcc dot gnu dot org
` (34 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-12 16:41 UTC (permalink / raw)
To: gcc-bugs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3551 bytes --]
------- Comment #17 from vvv at ru dot ru 2009-05-12 16:40 -------
(In reply to comment #16)
> Created an attachment (id=17783)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit]
> gcc45-pr39942.patch
> Patch that attempts to take into account .p2align directives that are emitted
> for (some) CODE_LABELs and also the gen_align insns that the pass itself
> inserts. For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
> directive starts a new 16 byte page (then insns before it are never
> interesting), or nothing was skipped because more than 10 bytes would need to
> be skipped. But that means the current group could contain only 5 or less
> bytes of instructions before the label, so again, we don't have to look at
> instructions not in the last 5 bytes.
> Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
> the second .p2align 3, which might (and often does) skip more than MAX_SKIP
> bytes (up to 7).
Nice path. Code looks better. It checked on Linux kernel 2.6.29.2.
But 2 notes:
1.There is no garanty that .p2align will be translated to NOPs. Example:
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
# gcc -o test.s test.c -O2 -S
# cat test.s
.file "test.c"
.text
.p2align 4,,15
.globl f
.type f, @function
f:
.LFB0:
.cfi_startproc
cmpl $1, %edi
je .L7
cmpl $2, %edi
je .L7
cmpl $3, %edi
je .L7
cmpl $4, %edi
.p2align 4,,5 <------- attempt of padding
je .L7
cmpl $5, %edi
je .L7
rep
ret
.p2align 4,,10
.p2align 3
.L7:
xorl %eax, %eax
jmp F
.cfi_endproc
.LFE0:
.size f, .-f
.ident "GCC: (GNU) 4.5.0 20090512 (experimental)"
.section .note.GNU-stack,"",@progbits
# gcc -o test.out test.s -O2 -c
# objdump -d test.out
0000000000000000 <f>:
0: 83 ff 01 cmp $0x1,%edi
3: 74 1b je 20 <f+0x20>
5: 83 ff 02 cmp $0x2,%edi
8: 74 16 je 20 <f+0x20>
a: 83 ff 03 cmp $0x3,%edi
d: 74 11 je 20 <f+0x20>
f: 83 ff 04 cmp $0x4,%edi
12: 74 0c je 20 <f+0x20> <---- no NOP here
14: 83 ff 05 cmp $0x5,%edi
17: 74 07 je 20 <f+0x20>
19: f3 c3 repz retq
IMHO, better to insert not .p2align, but NOPs directly. ( I mean line -
emit_insn_before (gen_align (GEN_INT (padsize)), insn); )
2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp.
Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual
>> 3.4.2.2 Optimizing for Macro-fusion
>> Macro-fusion merges two instructions to a single μop. Intel Core Microarchitecture
>> performs this hardware optimization under limited circumstances.
>> The first instruction of the macro-fused pair must be a CMP or TEST instruction. This
>> instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The
>> second instruction (adjacent in the instruction stream) should be a conditional
>> branch.
So if we need to insert NOPs, better to do it _before_ CMP.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (16 preceding siblings ...)
2009-05-12 16:41 ` vvv at ru dot ru
@ 2009-05-13 8:31 ` jakub at gcc dot gnu dot org
2009-05-13 11:43 ` vvv at ru dot ru
` (33 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-13 8:31 UTC (permalink / raw)
To: gcc-bugs
------- Comment #18 from jakub at gcc dot gnu dot org 2009-05-13 08:30 -------
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can decrease the size doesn't have it until the
shortening branch phase which is too late for this machine reorg, and in other
cases the lengths are just upper bounds). Say .p2align 16,,5 says
insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
otherwise don't insert anything. But that necessarily means that there were
less than 11 bytes in the same 16 byte page and if the lower bound insn size
estimation determined that in 11 bytes you can't have 3 branch changing
instructions, you are fine. Breaking of fused compare and jump (32-bit code
only) is unfortunate, but inserting it before the cmp would mean often
unnecessarily large padding.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (17 preceding siblings ...)
2009-05-13 8:31 ` jakub at gcc dot gnu dot org
@ 2009-05-13 11:43 ` vvv at ru dot ru
2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
` (32 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 11:43 UTC (permalink / raw)
To: gcc-bugs
------- Comment #19 from vvv at ru dot ru 2009-05-13 11:42 -------
(In reply to comment #18)
> No, .p2align is the right thing to do, given that GCC doesn't have 100%
> accurate information about instruction sizes (for e.g. inline asms it can't
> have, for
> stuff where branch shortening can decrease the size doesn't have it until the
> shortening branch phase which is too late for this machine reorg, and in other
> cases the lengths are just upper bounds). Say .p2align 16,,5 says
> insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
> otherwise don't insert anything. But that necessarily means that there were
> less than 11 bytes in the same 16 byte page and if the lower bound insn size
> estimation determined that in 11 bytes you can't have 3 branch changing
> instructions, you are fine. Breaking of fused compare and jump (32-bit code
> only) is unfortunate, but inserting it before the cmp would mean often
> unnecessarily large padding.
You are rigth, if padding required for every 16-byte page with 4 branches on
it. But Intel writes about "16-byte chunk", not "16-byte page".
Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual:
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put
more than four branches in a 16-byte chunk.
IMHO, here chunk - memory range from x to x+10h, where x - _any_ address.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (18 preceding siblings ...)
2009-05-13 11:43 ` vvv at ru dot ru
@ 2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
2009-05-13 17:13 ` vvv at ru dot ru
` (31 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-05-13 13:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #20 from rguenth at gcc dot gnu dot org 2009-05-13 13:31 -------
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (19 preceding siblings ...)
2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
@ 2009-05-13 17:13 ` vvv at ru dot ru
2009-05-13 18:22 ` ubizjak at gmail dot com
` (30 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 17:13 UTC (permalink / raw)
To: gcc-bugs
------- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 -------
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF), but
Intel limitation for 16-bytes chunk (memory range XXXXXXXX - XXXXXXXX+10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel - maximum of _FOUR_ branches!
Quotation from Software Optimization Guide for AMD64 Processors
6.1 Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.
The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.
The following branches are limited to three per 16-byte window:
jcc rel8
jcc rel32
jmp rel8
jmp rel32
jmp reg
jmp WORD PTR
jmp DWORD PTR
call rel16
call r/m16
call rel32
call r/m32
Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
call label1 ; 1st branch in 16-byte code window
jc label3 ; 2nd branch in 16-byte code window
call label2 ; 3rd branch in 16-byte code window
jnz label4 ; 4th branch in 16-byte code window
; Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (20 preceding siblings ...)
2009-05-13 17:13 ` vvv at ru dot ru
@ 2009-05-13 18:22 ` ubizjak at gmail dot com
2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
` (29 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-05-13 18:22 UTC (permalink / raw)
To: gcc-bugs
------- Comment #22 from ubizjak at gmail dot com 2009-05-13 18:22 -------
(In reply to comment #21)
> I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
> processors, but it is nonoptimal for Intel processors. Because:
...
CCing H.J for Intel optimization issues.
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl dot tools at gmail dot
| |com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (21 preceding siblings ...)
2009-05-13 18:22 ` ubizjak at gmail dot com
@ 2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
2009-05-13 18:45 ` hjl dot tools at gmail dot com
` (28 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-05-13 18:34 UTC (permalink / raw)
To: gcc-bugs
------- Comment #23 from rguenth at gcc dot gnu dot org 2009-05-13 18:34 -------
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (22 preceding siblings ...)
2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
@ 2009-05-13 18:45 ` hjl dot tools at gmail dot com
2009-05-13 18:57 ` vvv at ru dot ru
` (27 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-13 18:45 UTC (permalink / raw)
To: gcc-bugs
------- Comment #24 from hjl dot tools at gmail dot com 2009-05-13 18:45 -------
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (23 preceding siblings ...)
2009-05-13 18:45 ` hjl dot tools at gmail dot com
@ 2009-05-13 18:57 ` vvv at ru dot ru
2009-05-13 19:06 ` vvv at ru dot ru
` (26 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 18:57 UTC (permalink / raw)
To: gcc-bugs
------- Comment #25 from vvv at ru dot ru 2009-05-13 18:56 -------
(In reply to comment #22)
> CCing H.J for Intel optimization issues.
VVV> 1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF),
but
VVV> Intel limitation for 16-bytes chunk (memory range XXXXXXXX -
XXXXXXXX+10h)
I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am
going to make measurements for check it for Core2.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (24 preceding siblings ...)
2009-05-13 18:57 ` vvv at ru dot ru
@ 2009-05-13 19:06 ` vvv at ru dot ru
2009-05-13 19:09 ` jakub at gcc dot gnu dot org
` (25 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 19:06 UTC (permalink / raw)
To: gcc-bugs
------- Comment #26 from vvv at ru dot ru 2009-05-13 19:05 -------
(In reply to comment #23)
> Note that we need something that works for the generic model as well, which in
> this case looks like it is the same as for AMD models.
There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one -
TARGET_FIVE_JUMP_LIMIT?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (25 preceding siblings ...)
2009-05-13 19:06 ` vvv at ru dot ru
@ 2009-05-13 19:09 ` jakub at gcc dot gnu dot org
2009-05-13 19:19 ` vvv at ru dot ru
` (24 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-13 19:09 UTC (permalink / raw)
To: gcc-bugs
------- Comment #27 from jakub at gcc dot gnu dot org 2009-05-13 19:08 -------
If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped
from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see
SPEC numbers backing that up. Similarly for AMD CPUs, and if on at least one
of these it is beneficial, probably m_GENERIC should keep it, though with far
improved min_insn_size so that it doesn't trigger unnecessarily so often.
Vlad/Honza, could one of you SPEC test say current 4.4 with one (or both of):
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
on top of it (no need to compare clearly unneeded paddings to no paddings,
better compare only somewhat needed paddings against no paddings) compared with
X86_TUNE_FOUR_JUMP_LIMIT cleared for the used CPU?
--
jakub at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vmakarov at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (26 preceding siblings ...)
2009-05-13 19:09 ` jakub at gcc dot gnu dot org
@ 2009-05-13 19:19 ` vvv at ru dot ru
2009-05-13 21:44 ` hjl dot tools at gmail dot com
` (23 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 19:19 UTC (permalink / raw)
To: gcc-bugs
------- Comment #28 from vvv at ru dot ru 2009-05-13 19:18 -------
(In reply to comment #24)
> Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
> it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit
mode.
BTW, it's difficult to understand, what Intel mean ander term "branch". Is it
CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and
conditional branches. I beleave last case right.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (27 preceding siblings ...)
2009-05-13 19:19 ` vvv at ru dot ru
@ 2009-05-13 21:44 ` hjl dot tools at gmail dot com
2009-05-14 9:01 ` vvv at ru dot ru
` (22 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-13 21:44 UTC (permalink / raw)
To: gcc-bugs
------- Comment #29 from hjl dot tools at gmail dot com 2009-05-13 21:44 -------
Created an attachment (id=17858)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view)
Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K
This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn and Nehalem.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (28 preceding siblings ...)
2009-05-13 21:44 ` hjl dot tools at gmail dot com
@ 2009-05-14 9:01 ` vvv at ru dot ru
2009-05-14 15:16 ` jakub at gcc dot gnu dot org
` (21 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-14 9:01 UTC (permalink / raw)
To: gcc-bugs
------- Comment #30 from vvv at ru dot ru 2009-05-14 09:01 -------
Created an attachment (id=17863)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view)
Testing tool.
Here is results of my testing.
Code:
align 128
test_cikl:
rept 14 ; 14 if SH=0, 15 if SH=1, 16 if SH=2
{
nop
}
cmp al,0 ; 2 bytes
jz $+10h+NOPS ; 2 bytes offset=xxxx0
cmp al,1 ; 2 bytes offset=xxxx2
jz $+0Ch+NOPS ; 2 bytes offset=xxxx4
cmp al,2 ; 2 bytes offset=xxxx6
jz $+08h+NOPS ; 2 bytes offset=xxxx8
cmp al,3 ; 2 bytes offset=xxxxA
match =1, NOPS
{
nop
}
match =2, NOPS
{
xchg eax,eax ; 2-bytes NOP
}
jz $+04h ; 2 bytes offset=xxxxC
ja $+02h ; 2 bytes offset=xxxxE
mov eax,ecx
and eax,7h
loop test_cikl
This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks.
; Core 2 Duo
; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max
; SH=0 0/571/729 1/306/594 2/315/630
; SH=1 0/338/612 1/338/648 2/339/648
; SH=2 0/339/666 1/339/675 2/333/693
; Xeon 3110
; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max
; SH=0 0/586/693 1/310/675 2/310/675
; SH=1 0/333/657 1/330/648 2/464/630
; SH=2 0/333/657 1/470/594 2/474/603
; P4
; NOPS/tick/Max NOPS/tick/Max NOPS/tick/Max
; SH=0 0/1027/1317 1/1094/1258 2/1028/1207
; SH=1 0/1151/1377 1/1068/1352 2/902/1275
; SH=2 0/1124/1275 1/1148/1335 2/979/1139
Conclusion:
1. Core2 and Xeon - similar results. P4 - something strange.
For Core2 & Xeon padding very effective. Code with padding almoust 2 times
faster. No sence for P4?
2. My previous sentence
VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but
VVV> Intel limitation for 16-bytes chunk (memory range XXXX - XXXX+10h)
is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means
memory range XXX0 - XXXF.
Unfortunately, I can't test AMD.
PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode,
switch to 64-bit mode and measure rdtsc ticks for test code.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (29 preceding siblings ...)
2009-05-14 9:01 ` vvv at ru dot ru
@ 2009-05-14 15:16 ` jakub at gcc dot gnu dot org
2009-05-14 15:58 ` hjl dot tools at gmail dot com
` (20 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-14 15:16 UTC (permalink / raw)
To: gcc-bugs
------- Comment #31 from jakub at gcc dot gnu dot org 2009-05-14 15:15 -------
Some -O2 code size data from today's trunk bootstraps. The first .text line
is always vanilla bootstrap, the second one with
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
only, the third one with that patch and
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
and the last with additional:
--- i386.c.jj3 2009-05-14 12:41:24.000000000 +0200
+++ i386.c 2009-05-14 14:48:24.000000000 +0200
@@ -27202,7 +27202,7 @@ x86_function_profiler (FILE *file, int l
static int
min_insn_size (rtx insn)
{
- int l = 0;
+ int l = 0, len;
if (!INSN_P (insn) || !active_insn_p (insn))
return 0;
@@ -27222,7 +27222,8 @@ min_insn_size (rtx insn)
&& symbolic_reference_mentioned_p (PATTERN (insn))
&& !SIBLING_CALL_P (insn))
return 5;
- if (get_attr_length (insn) <= 1)
+ len = get_attr_length (insn);
+ if (len <= 1)
return 1;
/* For normal instructions we may rely on the sizes of addresses
@@ -27230,6 +27231,9 @@ min_insn_size (rtx insn)
This is not the case for jumps where references are PC relative. */
if (!JUMP_P (insn))
{
+ if (get_attr_type (insn) != TYPE_MULTI)
+ return len;
+
l = get_attr_length_address (insn);
if (l < 4 && symbolic_reference_mentioned_p (PATTERN (insn)))
l = 4;
to see how the code size changes with much more accurate (though sometimes not
minimum but maximum bound) insn sizing for the algorithm.
64-bit cc1plus
[12] .text PROGBITS 000000000047f990 07f990 8c3ba8 00 AX
0 0 16
[12] .text PROGBITS 000000000047f990 07f990 89b1e8 00 AX
0 0 16
[12] .text PROGBITS 000000000047f9c0 07f9c0 899f78 00 AX
0 0 16
[12] .text PROGBITS 000000000047f9c0 07f9c0 88eaf8 00 AX
0 0 16
32-bit cc1plus
[12] .text PROGBITS 080b24e0 06a4e0 8f8cac 00 AX 0 0
16
[12] .text PROGBITS 080b24e0 06a4e0 8d516c 00 AX 0 0
16
[12] .text PROGBITS 080b2510 06a510 8d507c 00 AX 0 0
16
[12] .text PROGBITS 080b2510 06a510 8cbd7c 00 AX 0 0
16
For 64-bit cc1plus that's 1.8%, 1.86%, 2.36% smaller binary with the 1, 2 resp.
3 patches, for 32-bit cc1plus 1.55%, 1.56%, 1.96% smaller binary.
So the first patch is the most important and something like the third one,
perhaps with more exceptions, also makes a difference. I'll now try to update
my awk script to check for the AMD rules, namely that the last byte of the
branch insn counts rather than the first.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (30 preceding siblings ...)
2009-05-14 15:16 ` jakub at gcc dot gnu dot org
@ 2009-05-14 15:58 ` hjl dot tools at gmail dot com
2009-05-14 18:37 ` hjl dot tools at gmail dot com
` (19 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-14 15:58 UTC (permalink / raw)
To: gcc-bugs
------- Comment #32 from hjl dot tools at gmail dot com 2009-05-14 15:58 -------
(In reply to comment #30)
> Created an attachment (id=17863)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit]
> Testing tool.
>
Please make sure that you only test nop paddings for branch insns,
not nop paddings for branch targets, which prefer 16byte alignment.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (31 preceding siblings ...)
2009-05-14 15:58 ` hjl dot tools at gmail dot com
@ 2009-05-14 18:37 ` hjl dot tools at gmail dot com
2009-05-14 19:44 ` vvv at ru dot ru
` (18 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-14 18:37 UTC (permalink / raw)
To: gcc-bugs
------- Comment #33 from hjl dot tools at gmail dot com 2009-05-14 18:37 -------
(In reply to comment #20)
> Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
> very very likely refers to a cache-line.
>
That is true. For Intel CPUs, "16-bytes chunk" means memory range XXX0 - XXXF.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (32 preceding siblings ...)
2009-05-14 18:37 ` hjl dot tools at gmail dot com
@ 2009-05-14 19:44 ` vvv at ru dot ru
2009-05-15 2:23 ` hjl dot tools at gmail dot com
` (17 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-14 19:44 UTC (permalink / raw)
To: gcc-bugs
------- Comment #34 from vvv at ru dot ru 2009-05-14 19:43 -------
(In reply to comment #32)
> Please make sure that you only test nop paddings for branch insns,
> not nop paddings for branch targets, which prefer 16byte alignment.
Additional tests (for Core2) results:
1. Execution time don't depend on paddings for branch target.
2. Execution time don't depend on position of NOP within 16-byte chunk with 4
branch. Even if NOP inserted between CMP and conditional jump.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (33 preceding siblings ...)
2009-05-14 19:44 ` vvv at ru dot ru
@ 2009-05-15 2:23 ` hjl dot tools at gmail dot com
2009-05-15 4:32 ` hjl dot tools at gmail dot com
` (16 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 2:23 UTC (permalink / raw)
To: gcc-bugs
------- Comment #35 from hjl dot tools at gmail dot com 2009-05-15 02:23 -------
Created an attachment (id=17870)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view)
A patch
This patch limits 3 branches per 16byte page.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (34 preceding siblings ...)
2009-05-15 2:23 ` hjl dot tools at gmail dot com
@ 2009-05-15 4:32 ` hjl dot tools at gmail dot com
2009-05-15 7:56 ` jakub at gcc dot gnu dot org
` (15 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 4:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #36 from hjl dot tools at gmail dot com 2009-05-15 04:32 -------
Created an attachment (id=17871)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view)
An updated patch
A few comments:
1. 3 branch limit is per 16byte page, not 16byte window.
2. We should allow 3 branches in a 16byte page.
3. When we have 4 branches in a 16byte page, we only need to
pad to the 16byte page boundary before the 4th branch, which
will start at the next 16byte page.
--
hjl dot tools at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #17870|0 |1
is obsolete| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (35 preceding siblings ...)
2009-05-15 4:32 ` hjl dot tools at gmail dot com
@ 2009-05-15 7:56 ` jakub at gcc dot gnu dot org
2009-05-15 12:11 ` jakub at gcc dot gnu dot org
` (14 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 7:56 UTC (permalink / raw)
To: gcc-bugs
------- Comment #37 from jakub at gcc dot gnu dot org 2009-05-15 07:56 -------
This patch looks very wrong. It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take into account label alignments nor branch shortening
which can change the insn sizes afterwards and assumes that a p2align always
aligns to 16 bytes (it does not).
While the previous algorithm works with estimated 16 consecutive bytes rather
than 16 byte pages 0xXXXX0 ... 0xXXXF, that's because during machine reorg
you simply can't know in most cases where exactly the 16 byte page will start,
so you assume it can start (almost) anywhere (and use .p2align behavior to
align when needed).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (36 preceding siblings ...)
2009-05-15 7:56 ` jakub at gcc dot gnu dot org
@ 2009-05-15 12:11 ` jakub at gcc dot gnu dot org
2009-05-15 12:12 ` jakub at gcc dot gnu dot org
` (13 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 12:11 UTC (permalink / raw)
To: gcc-bugs
------- Comment #38 from jakub at gcc dot gnu dot org 2009-05-15 12:11 -------
To extend #c31, I've also built the same tree with another patch which made
sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&&
optimize > 4" in ix86_reorg). cc1plus sizes were then
0x88d6d8 bytes for 64-bit cc1plus and 0x8c980c bytes for 32-bit cc1plus.
That is 2.42% smaller than vanilla resp. 2.06%.
I've also changed my testing script, so that it (hopefully, point me to errors)
follows the AMD rules more closely (namely that the last byte in the branch
insn counts, not the first one), will attach.
With that script, I got following number of violations for 64-bit cc1plus
(vanilla, +1patch, +2patches, +3patches, +4patches):
6, 7, 6, 51, 138
and for 32-bit cc1plus:
1, 2, 3, 34, 159.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (37 preceding siblings ...)
2009-05-15 12:11 ` jakub at gcc dot gnu dot org
@ 2009-05-15 12:12 ` jakub at gcc dot gnu dot org
2009-05-15 14:35 ` hjl dot tools at gmail dot com
` (12 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 12:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #39 from jakub at gcc dot gnu dot org 2009-05-15 12:12 -------
Created an attachment (id=17874)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view)
test4jmp.sh
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (38 preceding siblings ...)
2009-05-15 12:12 ` jakub at gcc dot gnu dot org
@ 2009-05-15 14:35 ` hjl dot tools at gmail dot com
2009-05-15 16:25 ` jakub at gcc dot gnu dot org
` (11 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 14:35 UTC (permalink / raw)
To: gcc-bugs
------- Comment #40 from hjl dot tools at gmail dot com 2009-05-15 14:35 -------
(In reply to comment #37)
> This patch looks very wrong. It assumes that min_insn_size gives exact insn
> sizes (current min_insn_size is very far from that, but even get_attr_length
> isn't exact), doesn't take into account label alignments nor branch shortening
> which can change the insn sizes afterwards and assumes that a p2align always
> aligns to 16 bytes (it does not).
> While the previous algorithm works with estimated 16 consecutive bytes rather
> than 16 byte pages 0xXXXX0 ... 0xXXXF, that's because during machine reorg
> you simply can't know in most cases where exactly the 16 byte page will start,
> so you assume it can start (almost) anywhere (and use .p2align behavior to
> align when needed).
>
There is no perfect solution here. Let's list pros/cons:
The current algorithm:
pros:
1. Very conservative. Catch most of 4 branches in 16byte windows.
Cons:
1. It works on 16byte window, not 16byte page.
2. When it gets wrong, it increases code sizes by adding unnecessary
nops.
My proposal:
Pros:
1. Work on 16byte page.
2. Even if it gets wrong, it doesn't increase code size.
Cons:
1. Rely on inaccurate instruction length data.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (39 preceding siblings ...)
2009-05-15 14:35 ` hjl dot tools at gmail dot com
@ 2009-05-15 16:25 ` jakub at gcc dot gnu dot org
2009-05-15 18:18 ` jakub at gcc dot gnu dot org
` (10 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 16:25 UTC (permalink / raw)
To: gcc-bugs
------- Comment #41 from jakub at gcc dot gnu dot org 2009-05-15 16:24 -------
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
thinks it has modrm byte when it has not.
Testing:
--- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.000000000 +0200
+++ gcc/config/i386/i386.md2009-05-15 18:06:40.000000000 +0200
@@ -504,6 +504,9 @@
(and (eq_attr "type" "callv")
(match_operand 1 "constant_call_address_operand" ""))
(const_int 0)
+ (and (eq_attr "type" "alu,alu1,icmp,test")
+ (match_operand 0 "ax_reg_operand" ""))
+ (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) !=
MODE_QI))")
]
(const_int 1)))
now on top of the 3 patches (without the 4th) to see what it does to code size
and number of 4+ branches in 16 byte page.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (40 preceding siblings ...)
2009-05-15 16:25 ` jakub at gcc dot gnu dot org
@ 2009-05-15 18:18 ` jakub at gcc dot gnu dot org
2009-05-15 18:23 ` jakub at gcc dot gnu dot org
` (9 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 18:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #42 from jakub at gcc dot gnu dot org 2009-05-15 18:18 -------
Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are:
0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in
32-bit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (41 preceding siblings ...)
2009-05-15 18:18 ` jakub at gcc dot gnu dot org
@ 2009-05-15 18:23 ` jakub at gcc dot gnu dot org
2009-05-15 23:06 ` hjl dot tools at gmail dot com
` (8 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 18:23 UTC (permalink / raw)
To: gcc-bugs
------- Comment #43 from jakub at gcc dot gnu dot org 2009-05-15 18:23 -------
Some code size growth is from enlarged get_attr_modrm though, 292 bytes for
64-bit, 1338 bytes for 32-bit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (42 preceding siblings ...)
2009-05-15 18:23 ` jakub at gcc dot gnu dot org
@ 2009-05-15 23:06 ` hjl dot tools at gmail dot com
2009-05-16 6:38 ` jakub at gcc dot gnu dot org
` (7 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 23:06 UTC (permalink / raw)
To: gcc-bugs
------- Comment #44 from hjl dot tools at gmail dot com 2009-05-15 23:05 -------
(In reply to comment #41)
> The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
> look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
> has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
> thinks it has modrm byte when it has not.
> Testing:
> --- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.000000000 +0200
> +++ gcc/config/i386/i386.md2009-05-15 18:06:40.000000000 +0200
> @@ -504,6 +504,9 @@
> (and (eq_attr "type" "callv")
> (match_operand 1 "constant_call_address_operand" ""))
> (const_int 0)
> + (and (eq_attr "type" "alu,alu1,icmp,test")
> + (match_operand 0 "ax_reg_operand" ""))
> + (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) !=
> MODE_QI))")
> ]
> (const_int 1)))
>
"cmp imm,%al/%ax/%eax/%rax" doesn't have the modrm byte. I think
this patch works better:
--- i386.md.branch 2009-05-15 11:30:42.000000000 -0700
+++ i386.md 2009-05-15 14:44:11.000000000 -0700
@@ -504,6 +504,10 @@
(and (eq_attr "type" "callv")
(match_operand 1 "constant_call_address_operand" ""))
(const_int 0)
+ (and (eq_attr "type" "alu,alu1,icmp,test")
+ (match_operand 0 "ax_reg_operand" "")
+ (match_operand 1 "immediate_operand" ""))
+ (const_int 0)
]
(const_int 1)))
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (43 preceding siblings ...)
2009-05-15 23:06 ` hjl dot tools at gmail dot com
@ 2009-05-16 6:38 ` jakub at gcc dot gnu dot org
2009-05-16 7:10 ` jakub at gcc dot gnu dot org
` (6 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16 6:38 UTC (permalink / raw)
To: gcc-bugs
------- Comment #45 from jakub at gcc dot gnu dot org 2009-05-16 06:37 -------
cmpl $1, %eax does have the modrm byte:
83 f8 01 cmp $0x1,%eax
compared to cmpl $0xdeadbeef, $eax which doesn't have it:
3d ef be ad de cmp $0xdeadbeef,%eax
So I think what I wrote is more precise. modrm byte is there if the insn has
ax_reg_operand destination and immediate source which hasn't been shortened.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (44 preceding siblings ...)
2009-05-16 6:38 ` jakub at gcc dot gnu dot org
@ 2009-05-16 7:10 ` jakub at gcc dot gnu dot org
2009-05-16 7:12 ` jakub at gcc dot gnu dot org
` (5 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16 7:10 UTC (permalink / raw)
To: gcc-bugs
------- Comment #46 from jakub at gcc dot gnu dot org 2009-05-16 07:10 -------
Subject: Bug 39942
Author: jakub
Date: Sat May 16 07:09:52 2009
New Revision: 147606
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/linux.h
trunk/gcc/config/i386/x86-64.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (45 preceding siblings ...)
2009-05-16 7:10 ` jakub at gcc dot gnu dot org
@ 2009-05-16 7:12 ` jakub at gcc dot gnu dot org
2009-05-18 17:21 ` hjl at gcc dot gnu dot org
` (4 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16 7:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #47 from jakub at gcc dot gnu dot org 2009-05-16 07:12 -------
Subject: Bug 39942
Author: jakub
Date: Sat May 16 07:12:02 2009
New Revision: 147607
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607
Log:
PR target/39942
* final.c (label_to_max_skip): New function.
(label_to_alignment): Only use LABEL_TO_ALIGNMENT if
CODE_LABEL_NUMBER <= max_labelno.
* output.h (label_to_max_skip): New prototype.
* config/i386/i386.c (ix86_avoid_jump_misspredicts): Renamed to...
(ix86_avoid_jump_mispredicts): ... this. Don't define if
ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined. Update comment.
Handle CODE_LABELs with >= 16 byte alignment or with
max_skip == (1 << align) - 1.
(ix86_reorg): Don't call ix86_avoid_jump_mispredicts if
ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/final.c
trunk/gcc/output.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (46 preceding siblings ...)
2009-05-16 7:12 ` jakub at gcc dot gnu dot org
@ 2009-05-18 17:21 ` hjl at gcc dot gnu dot org
2009-05-20 21:38 ` vvv at ru dot ru
` (3 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: hjl at gcc dot gnu dot org @ 2009-05-18 17:21 UTC (permalink / raw)
To: gcc-bugs
------- Comment #48 from hjl at gcc dot gnu dot org 2009-05-18 17:21 -------
Subject: Bug 39942
Author: hjl
Date: Mon May 18 17:21:13 2009
New Revision: 147671
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671
Log:
2009-05-18 H.J. Lu <hongjiu.lu@intel.com>
PR target/39942
* config/i386/i386.c (ix86_avoid_jump_misspredicts): Replace
gen_align with gen_pad.
(ix86_reorg): Check ASM_OUTPUT_MAX_SKIP_PAD instead of
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN.
* config/i386/i386.h (ASM_OUTPUT_MAX_SKIP_PAD): New.
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_PAD): Likewise.
* config/i386/i386.md (align): Renamed to ...
(pad): This. Replace ASM_OUTPUT_MAX_SKIP_ALIGN with
ASM_OUTPUT_MAX_SKIP_PAD.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/i386.h
trunk/gcc/config/i386/i386.md
trunk/gcc/config/i386/x86-64.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (47 preceding siblings ...)
2009-05-18 17:21 ` hjl at gcc dot gnu dot org
@ 2009-05-20 21:38 ` vvv at ru dot ru
2009-05-20 22:09 ` jakub at gcc dot gnu dot org
` (2 subsequent siblings)
51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-20 21:38 UTC (permalink / raw)
To: gcc-bugs
------- Comment #49 from vvv at ru dot ru 2009-05-20 21:38 -------
(In reply to comment #48)
How this patches work? Is it required some special options?
# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
if (i == 1) F(1);
if (i == 2) F(2);
if (i == 3) F(3);
if (i == 4) F(4);
if (i == 5) F(5);
}
extern int F(int m);
void func(int x)
{
int u = F(x);
while (u)
u = F(u)*3+1;
}
# /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8
# objdump -d t
0000000000000000 <f>:
0: 83 ff 01 cmp $0x1,%edi
3: 74 1b je 20 <f+0x20>
5: 83 ff 02 cmp $0x2,%edi
8: 74 16 je 20 <f+0x20>
a: 83 ff 03 cmp $0x3,%edi
d: 74 11 je 20 <f+0x20>
f: 83 ff 04 cmp $0x4,%edi
12: 74 0c je 20 <f+0x20>
14: 83 ff 05 cmp $0x5,%edi
17: 74 07 je 20 <f+0x20>
19: f3 c3 repz retq
1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
20: 31 c0 xor %eax,%eax
22: e9 00 00 00 00 jmpq 27 <f+0x27>
27: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
2e: 00 00
0000000000000030 <func>:
30: 48 83 ec 08 sub $0x8,%rsp
34: e8 00 00 00 00 callq 39 <func+0x9>
39: 85 c0 test %eax,%eax
3b: 89 c7 mov %eax,%edi
3d: 74 0e je 4d <func+0x1d>
3f: 90 nop
40: e8 00 00 00 00 callq 45 <func+0x15>
45: 8d 7c 40 01 lea 0x1(%rax,%rax,2),%edi
49: 85 ff test %edi,%edi
4b: 75 f3 jne 40 <func+0x10>
4d: 48 83 c4 08 add $0x8,%rsp
51: c3 retq
I can't see any padding in function f :(
PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts)
/* Look for all minimal intervals of instructions containing 4 jumps.
...
Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns)
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (48 preceding siblings ...)
2009-05-20 21:38 ` vvv at ru dot ru
@ 2009-05-20 22:09 ` jakub at gcc dot gnu dot org
2009-05-21 13:22 ` jakub at gcc dot gnu dot org
2009-05-21 13:26 ` jakub at gcc dot gnu dot org
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-20 22:09 UTC (permalink / raw)
To: gcc-bugs
------- Comment #50 from jakub at gcc dot gnu dot org 2009-05-20 22:09 -------
nopl 0x0(%rax,%rax,1) and nopw 0x0(%rax,%rax,1) aren't padding (though, it
has been added in this case for label alignment or function entry alignment,
not to avoid 4+ jumps in one 16byte page)?
Anyway, you want to look at both -S output and objdump -d of -c output, there
you'll see needed .p2align added.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (49 preceding siblings ...)
2009-05-20 22:09 ` jakub at gcc dot gnu dot org
@ 2009-05-21 13:22 ` jakub at gcc dot gnu dot org
2009-05-21 13:26 ` jakub at gcc dot gnu dot org
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-21 13:22 UTC (permalink / raw)
To: gcc-bugs
------- Comment #51 from jakub at gcc dot gnu dot org 2009-05-21 13:21 -------
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:21:30 2009
New Revision: 147765
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.
Modified:
branches/gcc-4_4-branch/gcc/ChangeLog
branches/gcc-4_4-branch/gcc/config/i386/linux.h
branches/gcc-4_4-branch/gcc/config/i386/x86-64.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread
* [Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
` (50 preceding siblings ...)
2009-05-21 13:22 ` jakub at gcc dot gnu dot org
@ 2009-05-21 13:26 ` jakub at gcc dot gnu dot org
51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-21 13:26 UTC (permalink / raw)
To: gcc-bugs
------- Comment #52 from jakub at gcc dot gnu dot org 2009-05-21 13:26 -------
Subject: Bug 39942
Author: jakub
Date: Thu May 21 13:26:13 2009
New Revision: 147766
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766
Log:
PR target/39942
* config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
.p2align 3 if MAX_SKIP is smaller than 7.
* config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.
Modified:
branches/gcc-4_3-branch/gcc/ChangeLog
branches/gcc-4_3-branch/gcc/config/i386/linux.h
branches/gcc-4_3-branch/gcc/config/i386/x86-64.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
^ permalink raw reply [flat|nested] 54+ messages in thread