[Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/39942]  New: Nonoptimal code - leaveq; xchg   %ax,%ax; retq
@ 2009-04-28 12:20 vvv at ru dot ru
  2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
                   ` (51 more replies)
  0 siblings, 52 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 12:20 UTC (permalink / raw)
  To: gcc-bugs

Sometimes we can see 2 bytes nop (xchg %ax,%ax) between leaveq and retq.
IMHO, better to remove xchg %ax,%ax

Examples from Kernel 2.6.29.1:

> gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]
> objdump vmlinux
...
ffffffff804262e0 <set_blitting_type>:
ffffffff804262e0:       55                      push   %rbp
ffffffff804262e1:       0f b7 07                movzwl (%rdi),%eax
ffffffff804262e4:       4c 8b 86 d0 03 00 00    mov    0x3d0(%rsi),%r8
ffffffff804262eb:       48 c1 e0 07             shl    $0x7,%rax
ffffffff804262ef:       48 89 e5                mov    %rsp,%rbp
ffffffff804262f2:       48 05 40 1f 9c 80       add    $0xffffffff809c1f40,%rax
ffffffff804262f8:       49 89 80 90 01 00 00    mov    %rax,0x190(%r8)
ffffffff804262ff:       8b 46 04                mov    0x4(%rsi),%eax
ffffffff80426302:       89 c1                   mov    %eax,%ecx
ffffffff80426304:       81 e1 00 00 02 00       and    $0x20000,%ecx
ffffffff8042630a:       75 2c                   jne    ffffffff80426338
<set_blitting_type+0x58>
ffffffff8042630c:       48 8b 86 d0 03 00 00    mov    0x3d0(%rsi),%rax
ffffffff80426313:       4c 89 c7                mov    %r8,%rdi
ffffffff80426316:       48 8b 90 90 01 00 00    mov    0x190(%rax),%rdx
ffffffff8042631d:       8b 52 1c                mov    0x1c(%rdx),%edx
ffffffff80426320:       83 fa 03                cmp    $0x3,%edx
ffffffff80426323:       0f 4e ca                cmovle %edx,%ecx
ffffffff80426326:       89 88 b0 01 00 00       mov    %ecx,0x1b0(%rax)
ffffffff8042632c:       e8 2f 4d 00 00          callq  ffffffff8042b060
<fbcon_set_bitops>
ffffffff80426331:       c9                      leaveq 
ffffffff80426332:       c3                      retq   
ffffffff80426333:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff80426338:       e8 23 61 00 00          callq  ffffffff8042c460
<fbcon_set_tileops>
ffffffff8042633d:       c9                      leaveq 
ffffffff8042633e:       66 90                   xchg   %ax,%ax
ffffffff80426340:       c3                      retq   

...
...

ffffffff8042b060 <fbcon_set_bitops>:
ffffffff8042b060:       55                      push   %rbp
ffffffff8042b061:       48 c7 07 d0 ad 42 80    movq  
$0xffffffff8042add0,(%rdi)
ffffffff8042b068:       8b 87 b0 01 00 00       mov    0x1b0(%rdi),%eax
ffffffff8042b06e:       48 89 e5                mov    %rsp,%rbp
ffffffff8042b071:       48 c7 47 08 30 ae 42    movq  
$0xffffffff8042ae30,0x8(%rdi)
ffffffff8042b078:       80 
ffffffff8042b079:       48 c7 47 10 c0 b7 42    movq  
$0xffffffff8042b7c0,0x10(%rdi)
ffffffff8042b080:       80 
ffffffff8042b081:       48 c7 47 18 10 af 42    movq  
$0xffffffff8042af10,0x18(%rdi)
ffffffff8042b088:       80 
ffffffff8042b089:       48 c7 47 20 10 b1 42    movq  
$0xffffffff8042b110,0x20(%rdi)
ffffffff8042b090:       80 
ffffffff8042b091:       48 c7 47 28 c0 b0 42    movq  
$0xffffffff8042b0c0,0x28(%rdi)
ffffffff8042b098:       80 
ffffffff8042b099:       48 c7 47 30 00 00 00    movq   $0x0,0x30(%rdi)
ffffffff8042b0a0:       00 
ffffffff8042b0a1:       85 c0                   test   %eax,%eax
ffffffff8042b0a3:       75 0b                   jne    ffffffff8042b0b0
<fbcon_set_bitops+0x50>
ffffffff8042b0a5:       c9                      leaveq 
ffffffff8042b0a6:       c3                      retq   
ffffffff8042b0a7:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
ffffffff8042b0ae:       00 00 
ffffffff8042b0b0:       e8 4b 15 00 00          callq  ffffffff8042c600
<fbcon_set_rotate>
ffffffff8042b0b5:       c9                      leaveq 
ffffffff8042b0b6:       66 90                   xchg   %ax,%ax
ffffffff8042b0b8:       c3                      retq


-- 
           Summary: Nonoptimal code - leaveq; xchg   %ax,%ax; retq
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: vvv at ru dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
@ 2009-04-28 13:42 ` pinskia at gcc dot gnu dot org
  2009-04-28 17:05 ` vvv at ru dot ru
                   ` (50 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 13:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from pinskia at gcc dot gnu dot org  2009-04-28 13:42 -------
Can you provide the preprocessed source which contains set_blitting_type?


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
          Component|c                           |target
 GCC target triplet|                            |x86_64-linux-gnu
           Keywords|                            |missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
  2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
@ 2009-04-28 17:05 ` vvv at ru dot ru
  2009-04-28 17:10 ` vvv at ru dot ru
                   ` (49 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from vvv at ru dot ru  2009-04-28 17:04 -------
Created an attachment (id=17776)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17776&action=view)
Source file from Linx Kernel 2.6.29.1

See static void set_blitting_type


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
  2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
  2009-04-28 17:05 ` vvv at ru dot ru
@ 2009-04-28 17:10 ` vvv at ru dot ru
  2009-04-28 17:15 ` vvv at ru dot ru
                   ` (48 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from vvv at ru dot ru  2009-04-28 17:10 -------
Additional examples from Linux Kernel 2.6.29.1:
(Note: conditional statement at the end of all fuctions!) 
=================
linux/drivers/video/console/bitblit.c

void fbcon_set_bitops(struct fbcon_ops *ops)
{
        ops->bmove = bit_bmove;
        ops->clear = bit_clear;
        ops->putcs = bit_putcs;
        ops->clear_margins = bit_clear_margins;
        ops->cursor = bit_cursor;
        ops->update_start = bit_update_start;
        ops->rotate_font = NULL;

        if (ops->rotate)
                fbcon_set_rotate(ops);
}

================

ffffffff8020a5e0 <disable_TSC>:
ffffffff8020a5e0:       55                      push   %rbp
ffffffff8020a5e1:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff8020a5e6:       48 89 e5                mov    %rsp,%rbp
ffffffff8020a5e9:       e8 c2 fd 35 00          callq  ffffffff8056a3b0
<add_preempt_count>
ffffffff8020a5ee:       65 48 8b 04 25 10 00    mov    %gs:0x10,%rax
ffffffff8020a5f5:       00 00 
ffffffff8020a5f7:       48 2d c8 1f 00 00       sub    $0x1fc8,%rax
ffffffff8020a5fd:       f0 0f ba 28 10          lock btsl $0x10,(%rax)
ffffffff8020a602:       19 d2                   sbb    %edx,%edx
ffffffff8020a604:       85 d2                   test   %edx,%edx
ffffffff8020a606:       75 0a                   jne    ffffffff8020a612
<disable_TSC+0x32>
ffffffff8020a608:       0f 20 e0                mov    %cr4,%rax
ffffffff8020a60b:       48 83 c8 04             or     $0x4,%rax
ffffffff8020a60f:       0f 22 e0                mov    %rax,%cr4
ffffffff8020a612:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff8020a617:       e8 e4 fc 35 00          callq  ffffffff8056a300
<sub_preempt_count>
ffffffff8020a61c:       65 48 8b 04 25 10 00    mov    %gs:0x10,%rax
ffffffff8020a623:       00 00 
ffffffff8020a625:       f6 80 38 e0 ff ff 08    testb  $0x8,-0x1fc8(%rax)
ffffffff8020a62c:       75 02                   jne    ffffffff8020a630
<disable_TSC+0x50>
ffffffff8020a62e:       c9                      leaveq 
ffffffff8020a62f:       c3                      retq   
ffffffff8020a630:       e8 2b 99 35 00          callq  ffffffff80563f60
<preempt_schedule>
ffffffff8020a635:       c9                      leaveq 
ffffffff8020a636:       66 90                   xchg   %ax,%ax
ffffffff8020a638:       c3                      retq   

==================
/arch/x86/kernel/io_delay.c

void native_io_delay(void)
{
        switch (io_delay_type) {
        default:
        case CONFIG_IO_DELAY_TYPE_0X80:
                asm volatile ("outb %al, $0x80");
                break;
        case CONFIG_IO_DELAY_TYPE_0XED:
                asm volatile ("outb %al, $0xed");
                break;
        case CONFIG_IO_DELAY_TYPE_UDELAY:
                /*
                 * 2 usecs is an upper-bound for the outb delay but
                 * note that udelay doesn't have the bus-level
                 * side-effects that outb does, nor does udelay() have
                 * precise timings during very early bootup (the delays
                 * are shorter until calibrated):
                 */
                udelay(2);
        case CONFIG_IO_DELAY_TYPE_NONE:
                break;
        }
}
EXPORT_SYMBOL(native_io_delay);

ffffffff802131e0 <native_io_delay>:
ffffffff802131e0:       55                      push   %rbp
ffffffff802131e1:       8b 05 3d b3 54 00       mov    0x54b33d(%rip),%eax     
  # ffffffff8075e524 <io_delay_type>
ffffffff802131e7:       48 89 e5                mov    %rsp,%rbp
ffffffff802131ea:       83 f8 02                cmp    $0x2,%eax
ffffffff802131ed:       74 29                   je     ffffffff80213218
<native_io_delay+0x38>
ffffffff802131ef:       83 f8 03                cmp    $0x3,%eax
ffffffff802131f2:       74 06                   je     ffffffff802131fa
<native_io_delay+0x1a>
ffffffff802131f4:       ff c8                   dec    %eax
ffffffff802131f6:       74 10                   je     ffffffff80213208
<native_io_delay+0x28>
ffffffff802131f8:       e6 80                   out    %al,$0x80
ffffffff802131fa:       c9                      leaveq 
ffffffff802131fb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff80213200:       c3                      retq   
ffffffff80213201:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
ffffffff80213208:       e6 ed                   out    %al,$0xed
ffffffff8021320a:       c9                      leaveq 
ffffffff8021320b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff80213210:       c3                      retq   
ffffffff80213211:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
ffffffff80213218:       bf 8e 21 00 00          mov    $0x218e,%edi
ffffffff8021321d:       0f 1f 00                nopl   (%rax)
ffffffff80213220:       e8 fb ac 1e 00          callq  ffffffff803fdf20
<__const_udelay>
ffffffff80213225:       c9                      leaveq 
ffffffff80213226:       66 90                   xchg   %ax,%ax
ffffffff80213228:       c3                      retq   

===============
arch/x86/mm/ioremap.c

int ioremap_change_attr(unsigned long vaddr, unsigned long size,
                               unsigned long prot_val)
{
        unsigned long nrpages = size >> PAGE_SHIFT;
        int err;

        switch (prot_val) {
        case _PAGE_CACHE_UC:
        default:
                err = _set_memory_uc(vaddr, nrpages);
                break;
        case _PAGE_CACHE_WC:
                err = _set_memory_wc(vaddr, nrpages);
                break;
        case _PAGE_CACHE_WB:
                err = _set_memory_wb(vaddr, nrpages);
                break;
        }

        return err;
}


ffffffff8022df60 <ioremap_change_attr>:
ffffffff8022df60:       55                      push   %rbp
ffffffff8022df61:       48 c1 ee 0c             shr    $0xc,%rsi
ffffffff8022df65:       48 89 e5                mov    %rsp,%rbp
ffffffff8022df68:       48 85 d2                test   %rdx,%rdx
ffffffff8022df6b:       75 0b                   jne    ffffffff8022df78
<ioremap_change_attr+0x18>
ffffffff8022df6d:       e8 2e 18 00 00          callq  ffffffff8022f7a0
<_set_memory_wb>
ffffffff8022df72:       c9                      leaveq 
ffffffff8022df73:       c3                      retq   
ffffffff8022df74:       0f 1f 40 00             nopl   0x0(%rax)
ffffffff8022df78:       48 83 fa 08             cmp    $0x8,%rdx
ffffffff8022df7c:       0f 1f 40 00             nopl   0x0(%rax)
ffffffff8022df80:       74 16                   je     ffffffff8022df98
<ioremap_change_attr+0x38>
ffffffff8022df82:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
ffffffff8022df88:       e8 53 19 00 00          callq  ffffffff8022f8e0
<_set_memory_uc>
ffffffff8022df8d:       c9                      leaveq 
ffffffff8022df8e:       66 90                   xchg   %ax,%ax
ffffffff8022df90:       c3                      retq   
ffffffff8022df91:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
ffffffff8022df98:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
ffffffff8022df9f:       00 
ffffffff8022dfa0:       e8 0b 19 00 00          callq  ffffffff8022f8b0
<_set_memory_wc>
ffffffff8022dfa5:       c9                      leaveq 
ffffffff8022dfa6:       66 90                   xchg   %ax,%ax
ffffffff8022dfa8:       c3                      retq   
ffffffff8022dfa9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

==============
kernel/sched.c

int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
{
        u64 rt_runtime, rt_period;

        rt_period = (u64)rt_period_us * NSEC_PER_USEC;
        rt_runtime = tg->rt_bandwidth.rt_runtime;

        if (rt_period == 0)
                return -EINVAL;

        return tg_set_bandwidth(tg, rt_period, rt_runtime);
}


ffffffff8023f810 <sched_group_set_rt_period>:
ffffffff8023f810:       55                      push   %rbp
ffffffff8023f811:       48 69 f6 e8 03 00 00    imul   $0x3e8,%rsi,%rsi
ffffffff8023f818:       48 89 e5                mov    %rsp,%rbp
ffffffff8023f81b:       48 8b 57 50             mov    0x50(%rdi),%rdx
ffffffff8023f81f:       b8 ea ff ff ff          mov    $0xffffffea,%eax
ffffffff8023f824:       48 85 f6                test   %rsi,%rsi
ffffffff8023f827:       75 07                   jne    ffffffff8023f830
<sched_group_set_rt_period+0x20>
ffffffff8023f829:       c9                      leaveq 
ffffffff8023f82a:       c3                      retq   
ffffffff8023f82b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff8023f830:       e8 bb fe ff ff          callq  ffffffff8023f6f0
<tg_set_bandwidth>
ffffffff8023f835:       c9                      leaveq 
ffffffff8023f836:       66 90                   xchg   %ax,%ax
ffffffff8023f838:       c3                      retq   
ffffffff8023f839:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

==============
kernel/sched_rt.c

static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
{
        /* Try to pull RT tasks here if we lower this rq's prio */
        if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio)
                pull_rt_task(rq);
}

static void switched_from_rt(struct rq *rq, struct task_struct *p,
                           int running)
{
        /*
         * If there are other RT tasks then we will reschedule
         * and the scheduling of the other RT tasks will handle
         * the balancing. But if we are the last RT task
         * we may need to handle the pulling of RT tasks
         * now.
         */
        if (!rq->rt.rt_nr_running)
                pull_rt_task(rq);
}

ffffffff802452b0 <switched_from_rt>:
ffffffff802452b0:       55                      push   %rbp
ffffffff802452b1:       48 89 e5                mov    %rsp,%rbp
ffffffff802452b4:       48 83 bf 70 07 00 00    cmpq   $0x0,0x770(%rdi)
ffffffff802452bb:       00 
ffffffff802452bc:       74 02                   je     ffffffff802452c0
<switched_from_rt+0x10>
ffffffff802452be:       c9                      leaveq 
ffffffff802452bf:       c3                      retq   
ffffffff802452c0:       e8 6b fd ff ff          callq  ffffffff80245030
<pull_rt_task>
ffffffff802452c5:       c9                      leaveq 
ffffffff802452c6:       66 90                   xchg   %ax,%ax
ffffffff802452c8:       c3                      retq   
ffffffff802452c9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

ffffffff802452d0 <pre_schedule_rt>:
ffffffff802452d0:       55                      push   %rbp
ffffffff802452d1:       8b 46 20                mov    0x20(%rsi),%eax
ffffffff802452d4:       48 89 e5                mov    %rsp,%rbp
ffffffff802452d7:       83 f8 63                cmp    $0x63,%eax
ffffffff802452da:       7f 08                   jg     ffffffff802452e4
<pre_schedule_rt+0x14>
ffffffff802452dc:       39 87 78 07 00 00       cmp    %eax,0x778(%rdi)
ffffffff802452e2:       7f 0c                   jg     ffffffff802452f0
<pre_schedule_rt+0x20>
ffffffff802452e4:       c9                      leaveq 
ffffffff802452e5:       c3                      retq   
ffffffff802452e6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
ffffffff802452ed:       00 00 00 
ffffffff802452f0:       e8 3b fd ff ff          callq  ffffffff80245030
<pull_rt_task>
ffffffff802452f5:       c9                      leaveq 
ffffffff802452f6:       66 90                   xchg   %ax,%ax
ffffffff802452f8:       c3                      retq   
ffffffff802452f9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

=============
linux/kernel/timer.c

void msleep(unsigned int msecs)
{
        unsigned long timeout = msecs_to_jiffies(msecs) + 1;

        while (timeout)
                timeout = schedule_timeout_uninterruptible(timeout);
}

EXPORT_SYMBOL(msleep);

ffffffff80256120 <msleep>:
ffffffff80256120:       55                      push   %rbp
ffffffff80256121:       48 89 e5                mov    %rsp,%rbp
ffffffff80256124:       e8 a7 94 ff ff          callq  ffffffff8024f5d0
<msecs_to_jiffies>
ffffffff80256129:       48 89 c7                mov    %rax,%rdi
ffffffff8025612c:       48 ff c7                inc    %rdi
ffffffff8025612f:       74 14                   je     ffffffff80256145
<msleep+0x25>
ffffffff80256131:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
ffffffff80256138:       e8 c3 ed 30 00          callq  ffffffff80564f00
<schedule_timeout_uninterruptible>
ffffffff8025613d:       48 89 c7                mov    %rax,%rdi
ffffffff80256140:       48 85 c0                test   %rax,%rax
ffffffff80256143:       75 f3                   jne    ffffffff80256138
<msleep+0x18>
ffffffff80256145:       c9                      leaveq 
ffffffff80256146:       66 90                   xchg   %ax,%ax
ffffffff80256148:       c3                      retq   
ffffffff80256149:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

================
mm/shemem.c

static int shmem_xattr_security_get(struct inode *inode, const char *name,
                                    void *buffer, size_t size)
{
        if (strcmp(name, "") == 0)
                return -EINVAL;
        return xattr_getsecurity(inode, name, buffer, size);
}

static int shmem_xattr_security_set(struct inode *inode, const char *name,
                                    const void *value, size_t size, int flags)
{
        if (strcmp(name, "") == 0)
                return -EINVAL;
        return security_inode_setsecurity(inode, name, value, size, flags);
}

ffffffff802b9ff0 <shmem_xattr_security_set>:
ffffffff802b9ff0:       55                      push   %rbp
ffffffff802b9ff1:       b8 ea ff ff ff          mov    $0xffffffea,%eax
ffffffff802b9ff6:       48 89 e5                mov    %rsp,%rbp
ffffffff802b9ff9:       80 3e 00                cmpb   $0x0,(%rsi)
ffffffff802b9ffc:       75 02                   jne    ffffffff802ba000
<shmem_xattr_security_set+0x10>
ffffffff802b9ffe:       c9                      leaveq 
ffffffff802b9fff:       c3                      retq   
ffffffff802ba000:       e8 ab b1 0f 00          callq  ffffffff803b51b0
<security_inode_setsecurity>
ffffffff802ba005:       c9                      leaveq 
ffffffff802ba006:       66 90                   xchg   %ax,%ax
ffffffff802ba008:       c3                      retq   
ffffffff802ba009:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

ffffffff802ba010 <shmem_xattr_security_get>:
ffffffff802ba010:       55                      push   %rbp
ffffffff802ba011:       b8 ea ff ff ff          mov    $0xffffffea,%eax
ffffffff802ba016:       48 89 e5                mov    %rsp,%rbp
ffffffff802ba019:       80 3e 00                cmpb   $0x0,(%rsi)
ffffffff802ba01c:       75 02                   jne    ffffffff802ba020
<shmem_xattr_security_get+0x10>
ffffffff802ba01e:       c9                      leaveq 
ffffffff802ba01f:       c3                      retq   
ffffffff802ba020:       e8 2b b5 04 00          callq  ffffffff80305550
<xattr_getsecurity>
ffffffff802ba025:       c9                      leaveq 
ffffffff802ba026:       66 90                   xchg   %ax,%ax
ffffffff802ba028:       c3                      retq   
ffffffff802ba029:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

==========
linux/fs/file_table.c

void fput(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count))
                __fput(file);
}

EXPORT_SYMBOL(fput);

ffffffff802e8da0 <fput>:
ffffffff802e8da0:       55                      push   %rbp
ffffffff802e8da1:       48 8d 47 28             lea    0x28(%rdi),%rax
ffffffff802e8da5:       48 89 e5                mov    %rsp,%rbp
ffffffff802e8da8:       f0 48 ff 08             lock decq (%rax)
ffffffff802e8dac:       0f 94 c2                sete   %dl
ffffffff802e8daf:       84 d2                   test   %dl,%dl
ffffffff802e8db1:       75 05                   jne    ffffffff802e8db8
<fput+0x18>
ffffffff802e8db3:       c9                      leaveq 
ffffffff802e8db4:       c3                      retq   
ffffffff802e8db5:       0f 1f 00                nopl   (%rax)
ffffffff802e8db8:       e8 03 fe ff ff          callq  ffffffff802e8bc0
<__fput>
ffffffff802e8dbd:       c9                      leaveq 
ffffffff802e8dbe:       66 90                   xchg   %ax,%ax
ffffffff802e8dc0:       c3                      retq   
ffffffff802e8dc1:       66 66 66 66 66 66 2e    nopw   %cs:0x0(%rax,%rax,1)
ffffffff802e8dc8:       0f 1f 84 00 00 00 00 
ffffffff802e8dcf:       00 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (2 preceding siblings ...)
  2009-04-28 17:10 ` vvv at ru dot ru
@ 2009-04-28 17:15 ` vvv at ru dot ru
  2009-04-28 17:37 ` ubizjak at gmail dot com
                   ` (47 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 17:15 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from vvv at ru dot ru  2009-04-28 17:15 -------
Created an attachment (id=17777)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17777&action=view)
Simple example from Linux

See two functons:
static void pre_schedule_rt
static void switched_from_rt


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (3 preceding siblings ...)
  2009-04-28 17:15 ` vvv at ru dot ru
@ 2009-04-28 17:37 ` ubizjak at gmail dot com
  2009-04-28 21:19 ` vvv at ru dot ru
                   ` (46 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 17:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from ubizjak at gmail dot com  2009-04-28 17:37 -------
Unfortunately, all code snippets and dumps are of no use. Please see
http://gcc.gnu.org/bugs.html for the reason why.

As an exercise, please compile *standalone* _preprocessed_ source you will
create with -S added to your compile flags and count the number of .p2align
directives in the code stream.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (4 preceding siblings ...)
  2009-04-28 17:37 ` ubizjak at gmail dot com
@ 2009-04-28 21:19 ` vvv at ru dot ru
  2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
                   ` (45 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-28 21:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from vvv at ru dot ru  2009-04-28 21:18 -------
Let's compile file test.c
//#file test.c

extern int F(int m);

void func(int x)
{
        int u = F(x);
        while (u)
                u = F(u)*3+1;
}


# gcc -o t.out test.c -c -O2
# objdump -d t.out

t.out:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <func>:
   0:   48 83 ec 08             sub    $0x8,%rsp
   4:   e8 00 00 00 00          callq  9 <func+0x9>
   9:   85 c0                   test   %eax,%eax
   b:   89 c7                   mov    %eax,%edi
   d:   74 0e                   je     1d <func+0x1d>
   f:   90                      nop    
  10:   e8 00 00 00 00          callq  15 <func+0x15>
  15:   8d 7c 40 01             lea    0x1(%rax,%rax,2),%edi
  19:   85 ff                   test   %edi,%edi
  1b:   75 f3                   jne    10 <func+0x10>
  1d:   48 83 c4 08             add    $0x8,%rsp
  21:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)   <---- nonoptimal
  28:   c3                      retq   


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (5 preceding siblings ...)
  2009-04-28 21:19 ` vvv at ru dot ru
@ 2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
  2009-04-28 21:47 ` ubizjak at gmail dot com
                   ` (44 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 21:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from pinskia at gcc dot gnu dot org  2009-04-28 21:23 -------
4.1.2 produces:
.L4:
        addq    $8, %rsp
        .p2align 4,,2
        ret

While the trunk produces:
.L1:
        addq    $8, %rsp
        .p2align 4,,2
        .p2align 3
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (6 preceding siblings ...)
  2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
@ 2009-04-28 21:47 ` ubizjak at gmail dot com
  2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
                   ` (43 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 21:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from ubizjak at gmail dot com  2009-04-28 21:47 -------
>From config/i386/i386.c:

/* AMD Athlon works faster
   when RET is not destination of conditional jump or directly preceded
   by other jump instruction.  We avoid the penalty by inserting NOP just
   before the RET instructions in such cases.  */
static void
ix86_pad_returns (void)
...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (7 preceding siblings ...)
  2009-04-28 21:47 ` ubizjak at gmail dot com
@ 2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
  2009-04-28 21:54 ` ubizjak at gmail dot com
                   ` (42 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2009-04-28 21:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from pinskia at gcc dot gnu dot org  2009-04-28 21:52 -------
So that explains it, Use -Os or attribute cold if you want NOPs to be gone.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |RESOLVED
         Resolution|                            |INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (8 preceding siblings ...)
  2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
@ 2009-04-28 21:54 ` ubizjak at gmail dot com
  2009-04-29  7:46 ` vvv at ru dot ru
                   ` (41 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-04-28 21:54 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from ubizjak at gmail dot com  2009-04-28 21:53 -------
Actually, alignment is from ix86_avoid_jump_misspredicts, where:

  /* Look for all minimal intervals of instructions containing 4 jumps.
     The intervals are bounded by START and INSN.  NBYTES is the total
     size of instructions in the interval including INSN and not including
     START.  When the NBYTES is smaller than 16 bytes, it is possible
     that the end of START and INSN ends up in the same 16byte page.

     The smallest offset in the page INSN can start is the case where START
     ends on the offset 0.  Offset of INSN is then NBYTES - sizeof (INSN).
     We add p2align to 16byte window with maxskip 17 - NBYTES + sizeof (INSN).
     */

So, this is by design. Use -Os if code size is important.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (9 preceding siblings ...)
  2009-04-28 21:54 ` ubizjak at gmail dot com
@ 2009-04-29  7:46 ` vvv at ru dot ru
  2009-04-29  7:55 ` vvv at ru dot ru
                   ` (40 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29  7:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from vvv at ru dot ru  2009-04-29 07:46 -------
(In reply to comment #8)
> From config/i386/i386.c:
> /* AMD Athlon works faster
>    when RET is not destination of conditional jump or directly preceded
>    by other jump instruction.  We avoid the penalty by inserting NOP just
>    before the RET instructions in such cases.  */
> static void
> ix86_pad_returns (void)
> ...

But I am using Core 2 Duo.
Why we see multibyte nop, not single byte nop?
Why if change line u = F(u)*3+1; to u = F(u)*4+1; or u = F(u); number of nops
changed?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (10 preceding siblings ...)
  2009-04-29  7:46 ` vvv at ru dot ru
@ 2009-04-29  7:55 ` vvv at ru dot ru
  2009-04-29  9:32 ` jakub at gcc dot gnu dot org
                   ` (39 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29  7:55 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from vvv at ru dot ru  2009-04-29 07:55 -------
(In reply to comment #9)
> So that explains it, Use -Os or attribute cold if you want NOPs to be gone.

But my measurements on Core 2 Duo P8600 show that

push %ebp
mov  %esp,%ebp
leave
ret

_faster_ then

push %ebp
mov  %esp,%ebp
leave
xchg %ax,%ax
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (11 preceding siblings ...)
  2009-04-29  7:55 ` vvv at ru dot ru
@ 2009-04-29  9:32 ` jakub at gcc dot gnu dot org
  2009-04-29 10:13 ` jakub at gcc dot gnu dot org
                   ` (38 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-29  9:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from jakub at gcc dot gnu dot org  2009-04-29 09:32 -------
You are benchmarking something completely unrelated.
What really matters is how code that has 4 branches/calls in one 16-byte block
is able to predict all those branches.  And Core2 similarly to various AMD CPUs
is not able to predict them well.

In the #c6 testcase it considers the je, call, jne and ret whether they can be
in a 16 byte block or not.  They can't, je is 2 bytes, call 5 bytes, leal 4
bytes (but gcc uses min_insn_size, which is 2 in this case), testl 2, jne 2,
addq 4 (but again, min_insn_size is 2 in this case).
min_insn_size seems to be very conservative, I guess teaching it about a bunch
of prefixes couldn't hurt, for non-jump/call insns ATM it estimates just the
displacement size, doesn't consider any prefixes (even those that really can't
change after machine reorg), etc.


-- 

jakub at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (12 preceding siblings ...)
  2009-04-29  9:32 ` jakub at gcc dot gnu dot org
@ 2009-04-29 10:13 ` jakub at gcc dot gnu dot org
  2009-04-29 19:17 ` vvv at ru dot ru
                   ` (37 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-29 10:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from jakub at gcc dot gnu dot org  2009-04-29 10:12 -------
Also, couldn't we use the information computed by compute_alignments and
assume CODE_LABELs are aligned?
Probably would need to add label_to_max_skip (rtx) function to final.c,
so that not just label_to_alignment, but also LABEL_TO_MAX_SKIP is available to
backends.  Then when we know for the label in the testcase that
.p2align 4,,10
.p2align 3
then we know the 16 byte boundary is either at that label, or at most 5 bytes
before it, so all we need is consider any jumps/calls in the last 5 bytes
before the label.

For min_insn_size, is it possible to find out for which non-jump/call insns
get_attr_length might not be exact (i.e. be a maximum guess rather than
guaranteed size (though of course, this is also just an optimization, so 100%
guarantees aren't needed either))?  If so, we could use get_attr_length for the
insns where it is known to be exact...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (13 preceding siblings ...)
  2009-04-29 10:13 ` jakub at gcc dot gnu dot org
@ 2009-04-29 19:17 ` vvv at ru dot ru
  2009-04-30  9:07 ` jakub at gcc dot gnu dot org
                   ` (36 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-04-29 19:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #15 from vvv at ru dot ru  2009-04-29 19:16 -------
One more example 5-bytes nop between leaveq and retq.

# cat test.c

void wait_for_enter()
{
        int u = getchar();
        while (!u)
                u = getchar()-13;
}
main()
{
wait_for_enter();
}

# gcc -o t.out test.c -O2 -march=core2 -fno-omit-frame-pointer
# objdump -d t.out
...
0000000000400540 <wait_for_enter>:
  400540:       55                      push   %rbp
  400541:       31 c0                   xor    %eax,%eax
  400543:       48 89 e5                mov    %rsp,%rbp
  400546:       e8 f5 fe ff ff          callq  400440 <getchar@plt>
  40054b:       85 c0                   test   %eax,%eax
  40054d:       75 13                   jne    400562 <wait_for_enter+0x22>
  40054f:       90                      nop    
  400550:       31 c0                   xor    %eax,%eax
  400552:       e8 e9 fe ff ff          callq  400440 <getchar@plt>
  400557:       83 f8 0d                cmp    $0xd,%eax
  40055a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400560:       74 ee                   je     400550 <wait_for_enter+0x10>
  400562:       c9                      leaveq 
  400563:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)  <--NONOPTIMAL!
  400568:       c3                      retq   
  400569:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000400570 <main>:
  400570:       55                      push   %rbp
  400571:       31 c0                   xor    %eax,%eax
  400573:       48 89 e5                mov    %rsp,%rbp
  400576:       e8 c5 ff ff ff          callq  400540 <wait_for_enter>
  40057b:       c9                      leaveq 
  40057c:       c3                      retq   
  40057d:       90                      nop    
  40057e:       90                      nop    
  40057f:       90                      nop    

So bug unresolved.


-- 

vvv at ru dot ru changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|INVALID                     |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (14 preceding siblings ...)
  2009-04-29 19:17 ` vvv at ru dot ru
@ 2009-04-30  9:07 ` jakub at gcc dot gnu dot org
  2009-05-12 16:41 ` vvv at ru dot ru
                   ` (35 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-04-30  9:07 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #16 from jakub at gcc dot gnu dot org  2009-04-30 09:07 -------
Created an attachment (id=17783)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view)
gcc45-pr39942.patch

Patch that attempts to take into account .p2align directives that are emitted
for (some) CODE_LABELs and also the gen_align insns that the pass itself
inserts.  For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
directive starts a new 16 byte page (then insns before it are never
interesting), or nothing was skipped because more than 10 bytes would need to
be skipped.  But that means the current group could contain only 5 or less
bytes of instructions before the label, so again, we don't have to look at
instructions not in the last 5 bytes.

Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
the second .p2align 3, which might (and often does) skip more than MAX_SKIP
bytes (up to 7).

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (15 preceding siblings ...)
  2009-04-30  9:07 ` jakub at gcc dot gnu dot org
@ 2009-05-12 16:41 ` vvv at ru dot ru
  2009-05-13  8:31 ` jakub at gcc dot gnu dot org
                   ` (34 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-12 16:41 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3551 bytes --]



------- Comment #17 from vvv at ru dot ru  2009-05-12 16:40 -------
(In reply to comment #16)
> Created an attachment (id=17783)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17783&action=view) [edit]
> gcc45-pr39942.patch
> Patch that attempts to take into account .p2align directives that are emitted
> for (some) CODE_LABELs and also the gen_align insns that the pass itself
> inserts.  For a CODE_LABEL, say .p2align 16,,10 means either that the .p2align
> directive starts a new 16 byte page (then insns before it are never
> interesting), or nothing was skipped because more than 10 bytes would need to
> be skipped.  But that means the current group could contain only 5 or less
> bytes of instructions before the label, so again, we don't have to look at
> instructions not in the last 5 bytes.
> Another fix is that for MAX_SKIP < 7, ASM_OUTPUT_MAX_SKIP_ALIGN shouldn't emit
> the second .p2align 3, which might (and often does) skip more than MAX_SKIP
> bytes (up to 7).

Nice path. Code looks better. It checked on Linux kernel 2.6.29.2.
But 2 notes:

1.There is no garanty that .p2align will be translated to NOPs. Example:

# cat test.c
void f(int i)
{
        if (i == 1) F(1);
        if (i == 2) F(2);
        if (i == 3) F(3);
        if (i == 4) F(4);
        if (i == 5) F(5);
}
# gcc -o test.s test.c -O2 -S
# cat test.s
        .file   "test.c"
        .text
        .p2align 4,,15
.globl f
        .type   f, @function
f:
.LFB0:
        .cfi_startproc
        cmpl    $1, %edi
        je      .L7
        cmpl    $2, %edi
        je      .L7
        cmpl    $3, %edi
        je      .L7
        cmpl    $4, %edi
        .p2align 4,,5    <------- attempt of padding
        je      .L7
        cmpl    $5, %edi
        je      .L7
        rep
        ret
        .p2align 4,,10
        .p2align 3
.L7:
        xorl    %eax, %eax
        jmp     F
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .ident  "GCC: (GNU) 4.5.0 20090512 (experimental)"
        .section        .note.GNU-stack,"",@progbits

# gcc -o test.out test.s -O2 -c
# objdump -d test.out
0000000000000000 <f>:
   0:   83 ff 01                cmp    $0x1,%edi
   3:   74 1b                   je     20 <f+0x20>
   5:   83 ff 02                cmp    $0x2,%edi
   8:   74 16                   je     20 <f+0x20>
   a:   83 ff 03                cmp    $0x3,%edi
   d:   74 11                   je     20 <f+0x20>
   f:   83 ff 04                cmp    $0x4,%edi
  12:   74 0c                   je     20 <f+0x20>      <---- no NOP here 
  14:   83 ff 05                cmp    $0x5,%edi
  17:   74 07                   je     20 <f+0x20>
  19:   f3 c3                   repz retq 

IMHO, better to insert not .p2align, but NOPs directly. ( I mean line -
emit_insn_before (gen_align (GEN_INT (padsize)), insn); )

2. IMHO, it's bad idea to insert somthing between CMP and conditional jmp.
Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual

>> 3.4.2.2       Optimizing for Macro-fusion
>> Macro-fusion merges two instructions to a single Î¼op. Intel Core Microarchitecture
>> performs this hardware optimization under limited circumstances.
>> The first instruction of the macro-fused pair must be a CMP or TEST instruction. This
>> instruction can be REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The
>> second instruction (adjacent in the instruction stream) should be a conditional
>> branch.

So if we need to insert NOPs, better to do it _before_ CMP.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (16 preceding siblings ...)
  2009-05-12 16:41 ` vvv at ru dot ru
@ 2009-05-13  8:31 ` jakub at gcc dot gnu dot org
  2009-05-13 11:43 ` vvv at ru dot ru
                   ` (33 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-13  8:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from jakub at gcc dot gnu dot org  2009-05-13 08:30 -------
No, .p2align is the right thing to do, given that GCC doesn't have 100%
accurate information about instruction sizes (for e.g. inline asms it can't
have, for
stuff where branch shortening can decrease the size doesn't have it until the
shortening branch phase which is too late for this machine reorg, and in other
cases the lengths are just upper bounds).  Say .p2align 16,,5 says
insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
otherwise don't insert anything.  But that necessarily means that there were
less than 11 bytes in the same 16 byte page and if the lower bound insn size
estimation determined that in 11 bytes you can't have 3 branch changing
instructions, you are fine.  Breaking of fused compare and jump (32-bit code
only) is unfortunate, but inserting it before the cmp would mean often
unnecessarily large padding.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (17 preceding siblings ...)
  2009-05-13  8:31 ` jakub at gcc dot gnu dot org
@ 2009-05-13 11:43 ` vvv at ru dot ru
  2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
                   ` (32 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 11:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from vvv at ru dot ru  2009-05-13 11:42 -------
(In reply to comment #18)
> No, .p2align is the right thing to do, given that GCC doesn't have 100%
> accurate information about instruction sizes (for e.g. inline asms it can't
> have, for
> stuff where branch shortening can decrease the size doesn't have it until the
> shortening branch phase which is too late for this machine reorg, and in other
> cases the lengths are just upper bounds).  Say .p2align 16,,5 says
> insert a nop up to 5 bytes if you can reach the 16-byte boundary with it,
> otherwise don't insert anything.  But that necessarily means that there were
> less than 11 bytes in the same 16 byte page and if the lower bound insn size
> estimation determined that in 11 bytes you can't have 3 branch changing
> instructions, you are fine.  Breaking of fused compare and jump (32-bit code
> only) is unfortunate, but inserting it before the cmp would mean often
> unnecessarily large padding.

You are rigth, if padding required for every 16-byte page with 4 branches on
it. But Intel writes about "16-byte chunk", not "16-byte page".

Quote from Intel 64 and IA-32 Architectures Optimization Reference Manual:

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put
more than four branches in a 16-byte chunk.

IMHO, here chunk - memory range from x to x+10h, where x - _any_ address. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (18 preceding siblings ...)
  2009-05-13 11:43 ` vvv at ru dot ru
@ 2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
  2009-05-13 17:13 ` vvv at ru dot ru
                   ` (31 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-05-13 13:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from rguenth at gcc dot gnu dot org  2009-05-13 13:31 -------
Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
very very likely refers to a cache-line.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (19 preceding siblings ...)
  2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
@ 2009-05-13 17:13 ` vvv at ru dot ru
  2009-05-13 18:22 ` ubizjak at gmail dot com
                   ` (30 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 17:13 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #21 from vvv at ru dot ru  2009-05-13 17:13 -------
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:

1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF), but
Intel limitation for 16-bytes chunk  (memory range XXXXXXXX - XXXXXXXX+10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel -  maximum of _FOUR_ branches!

Quotation from Software Optimization Guide for AMD64 Processors

6.1          Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.

The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc   rel8
jcc   rel32
jmp   rel8
jmp   rel32
jmp   reg
jmp   WORD PTR
jmp   DWORD PTR
call  rel16
call  r/m16
call  rel32
call  r/m32

Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
     call   label1     ;  1st branch in 16-byte     code   window
     jc     label3     ;  2nd branch in 16-byte     code   window
     call   label2     ;  3rd branch in 16-byte     code   window
     jnz    label4     ;  4th branch in 16-byte     code   window
                       ;  Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (20 preceding siblings ...)
  2009-05-13 17:13 ` vvv at ru dot ru
@ 2009-05-13 18:22 ` ubizjak at gmail dot com
  2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
                   ` (29 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: ubizjak at gmail dot com @ 2009-05-13 18:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from ubizjak at gmail dot com  2009-05-13 18:22 -------
(In reply to comment #21)
> I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
> processors, but it is nonoptimal for Intel processors. Because:

...

CCing H.J for Intel optimization issues.


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl dot tools at gmail dot
                   |                            |com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (21 preceding siblings ...)
  2009-05-13 18:22 ` ubizjak at gmail dot com
@ 2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
  2009-05-13 18:45 ` hjl dot tools at gmail dot com
                   ` (28 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-05-13 18:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from rguenth at gcc dot gnu dot org  2009-05-13 18:34 -------
Note that we need something that works for the generic model as well, which in
this case looks like it is the same as for AMD models.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (22 preceding siblings ...)
  2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
@ 2009-05-13 18:45 ` hjl dot tools at gmail dot com
  2009-05-13 18:57 ` vvv at ru dot ru
                   ` (27 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-13 18:45 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from hjl dot tools at gmail dot com  2009-05-13 18:45 -------
Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
it will increase code size.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (23 preceding siblings ...)
  2009-05-13 18:45 ` hjl dot tools at gmail dot com
@ 2009-05-13 18:57 ` vvv at ru dot ru
  2009-05-13 19:06 ` vvv at ru dot ru
                   ` (26 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 18:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from vvv at ru dot ru  2009-05-13 18:56 -------
(In reply to comment #22)
> CCing H.J for Intel optimization issues.

VVV> 1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF),
but
VVV> Intel limitation for 16-bytes chunk  (memory range XXXXXXXX -
XXXXXXXX+10h)

I have a doubt about this now. Sanks to Richard Guenther (Comment #20). So I am
going to make measurements for check it for Core2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (24 preceding siblings ...)
  2009-05-13 18:57 ` vvv at ru dot ru
@ 2009-05-13 19:06 ` vvv at ru dot ru
  2009-05-13 19:09 ` jakub at gcc dot gnu dot org
                   ` (25 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 19:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from vvv at ru dot ru  2009-05-13 19:05 -------
(In reply to comment #23)
> Note that we need something that works for the generic model as well, which in
> this case looks like it is the same as for AMD models.

There is processor property TARGET_FOUR_JUMP_LIMIT, may be create new one -
TARGET_FIVE_JUMP_LIMIT? 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (25 preceding siblings ...)
  2009-05-13 19:06 ` vvv at ru dot ru
@ 2009-05-13 19:09 ` jakub at gcc dot gnu dot org
  2009-05-13 19:19 ` vvv at ru dot ru
                   ` (24 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-13 19:09 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from jakub at gcc dot gnu dot org  2009-05-13 19:08 -------
If inserting the padding isn't worth it for say core2, m_CORE2 could be dropped
from X86_TUNE_FOUR_JUMP_LIMIT, but certainly it would be interesting to see
SPEC numbers backing that up.  Similarly for AMD CPUs, and if on at least one
of these it is beneficial, probably m_GENERIC should keep it, though with far
improved min_insn_size so that it doesn't trigger unnecessarily so often.

Vlad/Honza, could one of you SPEC test say current 4.4 with one (or both of):
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
on top of it (no need to compare clearly unneeded paddings to no paddings,
better compare only somewhat needed paddings against no paddings) compared with
X86_TUNE_FOUR_JUMP_LIMIT cleared for the used CPU?


-- 

jakub at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (26 preceding siblings ...)
  2009-05-13 19:09 ` jakub at gcc dot gnu dot org
@ 2009-05-13 19:19 ` vvv at ru dot ru
  2009-05-13 21:44 ` hjl dot tools at gmail dot com
                   ` (23 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-13 19:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from vvv at ru dot ru  2009-05-13 19:18 -------
(In reply to comment #24)
> Using padding to avoid 4 branches in 16byte chunk may not be a good idea since
> it will increase code size.
It's enough only one byte NOP per 16-byte chunk for padding. But, IMHO, four
branches in 16 byte chunk - is very-very infrequent. Especially for 64-bit
mode.

BTW, it's difficult to understand, what Intel mean ander term "branch". Is it
CALL, JMP, conditional branches, or returns (same as AMD), or only JMP and
conditional branches. I beleave last case right.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (27 preceding siblings ...)
  2009-05-13 19:19 ` vvv at ru dot ru
@ 2009-05-13 21:44 ` hjl dot tools at gmail dot com
  2009-05-14  9:01 ` vvv at ru dot ru
                   ` (22 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-13 21:44 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from hjl dot tools at gmail dot com  2009-05-13 21:44 -------
Created an attachment (id=17858)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17858&action=view)
Impact of X86_TUNE_FOUR_JUMP_LIMIT on SPEC CPU 2K

This is my old data of X86_TUNE_FOUR_JUMP_LIMIT on Penryn and Nehalem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (28 preceding siblings ...)
  2009-05-13 21:44 ` hjl dot tools at gmail dot com
@ 2009-05-14  9:01 ` vvv at ru dot ru
  2009-05-14 15:16 ` jakub at gcc dot gnu dot org
                   ` (21 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-14  9:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from vvv at ru dot ru  2009-05-14 09:01 -------
Created an attachment (id=17863)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view)
Testing tool.

Here is results of my testing.
Code:
align   128
test_cikl:
        rept 14         ; 14 if SH=0, 15 if SH=1, 16 if SH=2
        {
        nop
        }
         cmp  al,0           ; 2 bytes

         jz   $+10h+NOPS     ; 2 bytes offset=xxxx0
         cmp  al,1           ; 2 bytes offset=xxxx2
         jz   $+0Ch+NOPS     ; 2 bytes offset=xxxx4
         cmp  al,2           ; 2 bytes offset=xxxx6
         jz   $+08h+NOPS     ; 2 bytes offset=xxxx8
         cmp  al,3           ; 2 bytes offset=xxxxA
match =1, NOPS
{
   nop
}
match =2, NOPS
{
   xchg eax,eax         ; 2-bytes NOP
}
         jz   $+04h          ; 2 bytes offset=xxxxC
         ja   $+02h          ; 2 bytes offset=xxxxE

         mov  eax,ecx
         and  eax,7h
         loop test_cikl

This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks.

; Core 2 Duo
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0  0/571/729      1/306/594       2/315/630
; SH=1  0/338/612      1/338/648       2/339/648
; SH=2  0/339/666      1/339/675       2/333/693

; Xeon 3110
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0  0/586/693      1/310/675       2/310/675
; SH=1  0/333/657      1/330/648       2/464/630
; SH=2  0/333/657      1/470/594       2/474/603

; P4
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0 0/1027/1317     1/1094/1258     2/1028/1207
; SH=1 0/1151/1377     1/1068/1352     2/902/1275
; SH=2 0/1124/1275     1/1148/1335     2/979/1139

Conclusion:
1. Core2 and Xeon - similar results. P4 - something strange.
For Core2 & Xeon padding very effective. Code with padding almoust 2 times
faster. No sence for P4?
2. My previous sentence

VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but
VVV> Intel limitation for 16-bytes chunk  (memory range XXXX - XXXX+10h)

is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means
memory range XXX0 - XXXF.

Unfortunately, I can't test AMD.

PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode,
switch to 64-bit mode and measure rdtsc ticks for test code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (29 preceding siblings ...)
  2009-05-14  9:01 ` vvv at ru dot ru
@ 2009-05-14 15:16 ` jakub at gcc dot gnu dot org
  2009-05-14 15:58 ` hjl dot tools at gmail dot com
                   ` (20 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-14 15:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #31 from jakub at gcc dot gnu dot org  2009-05-14 15:15 -------
Some -O2 code size data from today's trunk bootstraps.  The first .text line
is always vanilla bootstrap, the second one with
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00702.html
only, the third one with that patch and
http://gcc.gnu.org/ml/gcc-patches/2009-05/msg00703.html
and the last with additional:
--- i386.c.jj3  2009-05-14 12:41:24.000000000 +0200
+++ i386.c      2009-05-14 14:48:24.000000000 +0200
@@ -27202,7 +27202,7 @@ x86_function_profiler (FILE *file, int l
 static int
 min_insn_size (rtx insn)
 {
-  int l = 0;
+  int l = 0, len;

   if (!INSN_P (insn) || !active_insn_p (insn))
     return 0;
@@ -27222,7 +27222,8 @@ min_insn_size (rtx insn)
       && symbolic_reference_mentioned_p (PATTERN (insn))
       && !SIBLING_CALL_P (insn))
     return 5;
-  if (get_attr_length (insn) <= 1)
+  len = get_attr_length (insn);
+  if (len <= 1)
     return 1;

   /* For normal instructions we may rely on the sizes of addresses
@@ -27230,6 +27231,9 @@ min_insn_size (rtx insn)
      This is not the case for jumps where references are PC relative.  */
   if (!JUMP_P (insn))
     {
+      if (get_attr_type (insn) != TYPE_MULTI)
+       return len;
+
       l = get_attr_length_address (insn);
       if (l < 4 && symbolic_reference_mentioned_p (PATTERN (insn)))
        l = 4;
to see how the code size changes with much more accurate (though sometimes not
minimum but maximum bound) insn sizing for the algorithm.
64-bit cc1plus
  [12] .text             PROGBITS        000000000047f990 07f990 8c3ba8 00  AX 
0   0 16
  [12] .text             PROGBITS        000000000047f990 07f990 89b1e8 00  AX 
0   0 16
  [12] .text             PROGBITS        000000000047f9c0 07f9c0 899f78 00  AX 
0   0 16
  [12] .text             PROGBITS        000000000047f9c0 07f9c0 88eaf8 00  AX 
0   0 16
32-bit cc1plus
  [12] .text             PROGBITS        080b24e0 06a4e0 8f8cac 00  AX  0   0
16
  [12] .text             PROGBITS        080b24e0 06a4e0 8d516c 00  AX  0   0
16
  [12] .text             PROGBITS        080b2510 06a510 8d507c 00  AX  0   0
16
  [12] .text             PROGBITS        080b2510 06a510 8cbd7c 00  AX  0   0
16
For 64-bit cc1plus that's 1.8%, 1.86%, 2.36% smaller binary with the 1, 2 resp.
3 patches, for 32-bit cc1plus 1.55%, 1.56%, 1.96% smaller binary.
So the first patch is the most important and something like the third one,
perhaps with more exceptions, also makes a difference.  I'll now try to update
my awk script to check for the AMD rules, namely that the last byte of the
branch insn counts rather than the first.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (30 preceding siblings ...)
  2009-05-14 15:16 ` jakub at gcc dot gnu dot org
@ 2009-05-14 15:58 ` hjl dot tools at gmail dot com
  2009-05-14 18:37 ` hjl dot tools at gmail dot com
                   ` (19 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-14 15:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from hjl dot tools at gmail dot com  2009-05-14 15:58 -------
(In reply to comment #30)
> Created an attachment (id=17863)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view) [edit]
> Testing tool.
> 

Please make sure that you only test nop paddings for branch insns,
not nop paddings for branch targets, which prefer 16byte alignment.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (31 preceding siblings ...)
  2009-05-14 15:58 ` hjl dot tools at gmail dot com
@ 2009-05-14 18:37 ` hjl dot tools at gmail dot com
  2009-05-14 19:44 ` vvv at ru dot ru
                   ` (18 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-14 18:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from hjl dot tools at gmail dot com  2009-05-14 18:37 -------
(In reply to comment #20)
> Instruction decoders generally operate on whole cache-lines, so 16-byte chunk
> very very likely refers to a cache-line.
> 

That is true. For Intel CPUs, "16-bytes chunk" means memory range XXX0 - XXXF.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (32 preceding siblings ...)
  2009-05-14 18:37 ` hjl dot tools at gmail dot com
@ 2009-05-14 19:44 ` vvv at ru dot ru
  2009-05-15  2:23 ` hjl dot tools at gmail dot com
                   ` (17 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-14 19:44 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #34 from vvv at ru dot ru  2009-05-14 19:43 -------
(In reply to comment #32)
> Please make sure that you only test nop paddings for branch insns,
> not nop paddings for branch targets, which prefer 16byte alignment.

Additional tests (for Core2) results:
1. Execution time don't depend on paddings for branch target.
2. Execution time don't depend on position of NOP within 16-byte chunk with 4
branch. Even if NOP inserted between CMP and conditional jump.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (33 preceding siblings ...)
  2009-05-14 19:44 ` vvv at ru dot ru
@ 2009-05-15  2:23 ` hjl dot tools at gmail dot com
  2009-05-15  4:32 ` hjl dot tools at gmail dot com
                   ` (16 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15  2:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #35 from hjl dot tools at gmail dot com  2009-05-15 02:23 -------
Created an attachment (id=17870)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17870&action=view)
A patch

This patch limits 3 branches per 16byte page.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (34 preceding siblings ...)
  2009-05-15  2:23 ` hjl dot tools at gmail dot com
@ 2009-05-15  4:32 ` hjl dot tools at gmail dot com
  2009-05-15  7:56 ` jakub at gcc dot gnu dot org
                   ` (15 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15  4:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #36 from hjl dot tools at gmail dot com  2009-05-15 04:32 -------
Created an attachment (id=17871)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17871&action=view)
An updated patch

A few comments:

1. 3 branch limit is per 16byte page, not 16byte window.
2. We should allow 3 branches in a 16byte page.
3. When we have 4 branches in a 16byte page, we only need to
pad to the 16byte page boundary before the 4th branch, which
will start at the next 16byte page.


-- 

hjl dot tools at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #17870|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (35 preceding siblings ...)
  2009-05-15  4:32 ` hjl dot tools at gmail dot com
@ 2009-05-15  7:56 ` jakub at gcc dot gnu dot org
  2009-05-15 12:11 ` jakub at gcc dot gnu dot org
                   ` (14 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15  7:56 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #37 from jakub at gcc dot gnu dot org  2009-05-15 07:56 -------
This patch looks very wrong.  It assumes that min_insn_size gives exact insn
sizes (current min_insn_size is very far from that, but even get_attr_length
isn't exact), doesn't take into account label alignments nor branch shortening
which can change the insn sizes afterwards and assumes that a p2align always
aligns to 16 bytes (it does not).
While the previous algorithm works with estimated 16 consecutive bytes rather
than 16 byte pages 0xXXXX0 ... 0xXXXF, that's because during machine reorg
you simply can't know in most cases where exactly the 16 byte page will start,
so you assume it can start (almost) anywhere (and use .p2align behavior to
align when needed).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (36 preceding siblings ...)
  2009-05-15  7:56 ` jakub at gcc dot gnu dot org
@ 2009-05-15 12:11 ` jakub at gcc dot gnu dot org
  2009-05-15 12:12 ` jakub at gcc dot gnu dot org
                   ` (13 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 12:11 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #38 from jakub at gcc dot gnu dot org  2009-05-15 12:11 -------
To extend #c31, I've also built the same tree with another patch which made
sure ix86_avoid_jump_mispredicts is never called (change "&& optimize" into "&&
optimize > 4" in ix86_reorg).  cc1plus sizes were then
0x88d6d8 bytes for 64-bit cc1plus and 0x8c980c bytes for 32-bit cc1plus.
That is 2.42% smaller than vanilla resp. 2.06%.

I've also changed my testing script, so that it (hopefully, point me to errors)
follows the AMD rules more closely (namely that the last byte in the branch
insn counts, not the first one), will attach.
With that script, I got following number of violations for 64-bit cc1plus
(vanilla, +1patch, +2patches, +3patches, +4patches):
6, 7, 6, 51, 138
and for 32-bit cc1plus:
1, 2, 3, 34, 159.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (37 preceding siblings ...)
  2009-05-15 12:11 ` jakub at gcc dot gnu dot org
@ 2009-05-15 12:12 ` jakub at gcc dot gnu dot org
  2009-05-15 14:35 ` hjl dot tools at gmail dot com
                   ` (12 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 12:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #39 from jakub at gcc dot gnu dot org  2009-05-15 12:12 -------
Created an attachment (id=17874)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17874&action=view)
test4jmp.sh


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (38 preceding siblings ...)
  2009-05-15 12:12 ` jakub at gcc dot gnu dot org
@ 2009-05-15 14:35 ` hjl dot tools at gmail dot com
  2009-05-15 16:25 ` jakub at gcc dot gnu dot org
                   ` (11 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 14:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #40 from hjl dot tools at gmail dot com  2009-05-15 14:35 -------
(In reply to comment #37)
> This patch looks very wrong.  It assumes that min_insn_size gives exact insn
> sizes (current min_insn_size is very far from that, but even get_attr_length
> isn't exact), doesn't take into account label alignments nor branch shortening
> which can change the insn sizes afterwards and assumes that a p2align always
> aligns to 16 bytes (it does not).
> While the previous algorithm works with estimated 16 consecutive bytes rather
> than 16 byte pages 0xXXXX0 ... 0xXXXF, that's because during machine reorg
> you simply can't know in most cases where exactly the 16 byte page will start,
> so you assume it can start (almost) anywhere (and use .p2align behavior to
> align when needed).
> 

There is no perfect solution here. Let's list pros/cons:

The current algorithm:

pros:

1. Very conservative. Catch most of 4 branches in 16byte windows.

Cons:

1. It works on 16byte window, not 16byte page.
2. When it gets wrong, it increases code sizes by adding unnecessary
nops.

My proposal:

Pros:

1. Work on 16byte page.
2. Even if it gets wrong, it doesn't increase code size.

Cons:

1. Rely on inaccurate instruction length data.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (39 preceding siblings ...)
  2009-05-15 14:35 ` hjl dot tools at gmail dot com
@ 2009-05-15 16:25 ` jakub at gcc dot gnu dot org
  2009-05-15 18:18 ` jakub at gcc dot gnu dot org
                   ` (10 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 16:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #41 from jakub at gcc dot gnu dot org  2009-05-15 16:24 -------
The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
thinks it has modrm byte when it has not.
Testing:
--- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.000000000 +0200
+++ gcc/config/i386/i386.md2009-05-15 18:06:40.000000000 +0200
@@ -504,6 +504,9 @@
  (and (eq_attr "type" "callv")
       (match_operand 1 "constant_call_address_operand" ""))
      (const_int 0)
+ (and (eq_attr "type" "alu,alu1,icmp,test")
+      (match_operand 0 "ax_reg_operand" ""))
+     (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) !=
MODE_QI))")
  ]
  (const_int 1)))

now on top of the 3 patches (without the 4th) to see what it does to code size
and number of 4+ branches in 16 byte page.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (40 preceding siblings ...)
  2009-05-15 16:25 ` jakub at gcc dot gnu dot org
@ 2009-05-15 18:18 ` jakub at gcc dot gnu dot org
  2009-05-15 18:23 ` jakub at gcc dot gnu dot org
                   ` (9 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 18:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #42 from jakub at gcc dot gnu dot org  2009-05-15 18:18 -------
Sizes with the #c41 patch together with the 3 patches mentioned in #c31 are:
0x890038 (64-bit) and 0x8ce08c (32-bit), 44 bad 16-byte pages in 64-bit, 35 in
32-bit.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (41 preceding siblings ...)
  2009-05-15 18:18 ` jakub at gcc dot gnu dot org
@ 2009-05-15 18:23 ` jakub at gcc dot gnu dot org
  2009-05-15 23:06 ` hjl dot tools at gmail dot com
                   ` (8 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-15 18:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #43 from jakub at gcc dot gnu dot org  2009-05-15 18:23 -------
Some code size growth is from enlarged get_attr_modrm though, 292 bytes for
64-bit, 1338 bytes for 32-bit.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (42 preceding siblings ...)
  2009-05-15 18:23 ` jakub at gcc dot gnu dot org
@ 2009-05-15 23:06 ` hjl dot tools at gmail dot com
  2009-05-16  6:38 ` jakub at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl dot tools at gmail dot com @ 2009-05-15 23:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #44 from hjl dot tools at gmail dot com  2009-05-15 23:05 -------
(In reply to comment #41)
> The 34 resp. 51 4 branches in 16 byte page with the 3 patches together made me
> look at one of the cases which was wrong and the problem is that cmp $0x1d, %al
> has too large get_attr_lenght (insn) returned, 3 instead of 2, because GCC
> thinks it has modrm byte when it has not.
> Testing:
> --- gcc/config/i386/i386.md.jj2009-05-13 08:42:51.000000000 +0200
> +++ gcc/config/i386/i386.md2009-05-15 18:06:40.000000000 +0200
> @@ -504,6 +504,9 @@
>   (and (eq_attr "type" "callv")
>        (match_operand 1 "constant_call_address_operand" ""))
>       (const_int 0)
> + (and (eq_attr "type" "alu,alu1,icmp,test")
> +      (match_operand 0 "ax_reg_operand" ""))
> +     (symbol_ref "(get_attr_length_immediate (insn) > (get_attr_mode (insn) !=
> MODE_QI))")
>   ]
>   (const_int 1)))
> 

"cmp imm,%al/%ax/%eax/%rax" doesn't have the modrm byte. I think
this patch works better:

--- i386.md.branch      2009-05-15 11:30:42.000000000 -0700
+++ i386.md     2009-05-15 14:44:11.000000000 -0700
@@ -504,6 +504,10 @@
         (and (eq_attr "type" "callv")
              (match_operand 1 "constant_call_address_operand" ""))
             (const_int 0)
+        (and (eq_attr "type" "alu,alu1,icmp,test")
+             (match_operand 0 "ax_reg_operand" "")
+             (match_operand 1 "immediate_operand" ""))
+            (const_int 0)
         ]
         (const_int 1)))



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (43 preceding siblings ...)
  2009-05-15 23:06 ` hjl dot tools at gmail dot com
@ 2009-05-16  6:38 ` jakub at gcc dot gnu dot org
  2009-05-16  7:10 ` jakub at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16  6:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #45 from jakub at gcc dot gnu dot org  2009-05-16 06:37 -------
cmpl $1, %eax does have the modrm byte:
83 f8 01             cmp    $0x1,%eax
compared to cmpl $0xdeadbeef, $eax which doesn't have it:
3d ef be ad de       cmp    $0xdeadbeef,%eax
So I think what I wrote is more precise.  modrm byte is there if the insn has
ax_reg_operand destination and immediate source which hasn't been shortened.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (44 preceding siblings ...)
  2009-05-16  6:38 ` jakub at gcc dot gnu dot org
@ 2009-05-16  7:10 ` jakub at gcc dot gnu dot org
  2009-05-16  7:12 ` jakub at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16  7:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #46 from jakub at gcc dot gnu dot org  2009-05-16 07:10 -------
Subject: Bug 39942

Author: jakub
Date: Sat May 16 07:09:52 2009
New Revision: 147606

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147606
Log:
        PR target/39942
        * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
        .p2align 3 if MAX_SKIP is smaller than 7.
        * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/linux.h
    trunk/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (45 preceding siblings ...)
  2009-05-16  7:10 ` jakub at gcc dot gnu dot org
@ 2009-05-16  7:12 ` jakub at gcc dot gnu dot org
  2009-05-18 17:21 ` hjl at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-16  7:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #47 from jakub at gcc dot gnu dot org  2009-05-16 07:12 -------
Subject: Bug 39942

Author: jakub
Date: Sat May 16 07:12:02 2009
New Revision: 147607

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147607
Log:
        PR target/39942
        * final.c (label_to_max_skip): New function.
        (label_to_alignment): Only use LABEL_TO_ALIGNMENT if
        CODE_LABEL_NUMBER <= max_labelno.
        * output.h (label_to_max_skip): New prototype.
        * config/i386/i386.c (ix86_avoid_jump_misspredicts): Renamed to...
        (ix86_avoid_jump_mispredicts): ... this.  Don't define if
        ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined.  Update comment.
        Handle CODE_LABELs with >= 16 byte alignment or with
        max_skip == (1 << align) - 1.
        (ix86_reorg): Don't call ix86_avoid_jump_mispredicts if
        ASM_OUTPUT_MAX_SKIP_ALIGN isn't defined.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/final.c
    trunk/gcc/output.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (46 preceding siblings ...)
  2009-05-16  7:12 ` jakub at gcc dot gnu dot org
@ 2009-05-18 17:21 ` hjl at gcc dot gnu dot org
  2009-05-20 21:38 ` vvv at ru dot ru
                   ` (3 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: hjl at gcc dot gnu dot org @ 2009-05-18 17:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #48 from hjl at gcc dot gnu dot org  2009-05-18 17:21 -------
Subject: Bug 39942

Author: hjl
Date: Mon May 18 17:21:13 2009
New Revision: 147671

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147671
Log:
2009-05-18  H.J. Lu  <hongjiu.lu@intel.com>

        PR target/39942
        * config/i386/i386.c (ix86_avoid_jump_misspredicts): Replace
        gen_align with gen_pad.
        (ix86_reorg): Check ASM_OUTPUT_MAX_SKIP_PAD instead of
        #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN.

        * config/i386/i386.h (ASM_OUTPUT_MAX_SKIP_PAD): New.
        * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_PAD): Likewise.

        * config/i386/i386.md (align): Renamed to ...
        (pad): This.  Replace ASM_OUTPUT_MAX_SKIP_ALIGN with
        ASM_OUTPUT_MAX_SKIP_PAD.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (47 preceding siblings ...)
  2009-05-18 17:21 ` hjl at gcc dot gnu dot org
@ 2009-05-20 21:38 ` vvv at ru dot ru
  2009-05-20 22:09 ` jakub at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  51 siblings, 0 replies; 54+ messages in thread
From: vvv at ru dot ru @ 2009-05-20 21:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #49 from vvv at ru dot ru  2009-05-20 21:38 -------
(In reply to comment #48)
How this patches work? Is it required some special options?

# /media/disk-1/B/bin/gcc --version
gcc (GCC) 4.5.0 20090520 (experimental)
# cat test.c
void f(int i)
{
        if (i == 1) F(1);
        if (i == 2) F(2);
        if (i == 3) F(3);
        if (i == 4) F(4);
        if (i == 5) F(5);
}
extern int F(int m);
void func(int x)
{
        int u = F(x);
        while (u)
                u = F(u)*3+1;
}
# /media/disk-1/B/bin/gcc -o t test.c -O2 -c -mtune=k8
# objdump -d t
0000000000000000 <f>:
   0:   83 ff 01                cmp    $0x1,%edi
   3:   74 1b                   je     20 <f+0x20>
   5:   83 ff 02                cmp    $0x2,%edi
   8:   74 16                   je     20 <f+0x20>
   a:   83 ff 03                cmp    $0x3,%edi
   d:   74 11                   je     20 <f+0x20>
   f:   83 ff 04                cmp    $0x4,%edi
  12:   74 0c                   je     20 <f+0x20>
  14:   83 ff 05                cmp    $0x5,%edi
  17:   74 07                   je     20 <f+0x20>
  19:   f3 c3                   repz retq 
  1b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  20:   31 c0                   xor    %eax,%eax
  22:   e9 00 00 00 00          jmpq   27 <f+0x27>
  27:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  2e:   00 00 

0000000000000030 <func>:
  30:   48 83 ec 08             sub    $0x8,%rsp
  34:   e8 00 00 00 00          callq  39 <func+0x9>
  39:   85 c0                   test   %eax,%eax
  3b:   89 c7                   mov    %eax,%edi
  3d:   74 0e                   je     4d <func+0x1d>
  3f:   90                      nop    
  40:   e8 00 00 00 00          callq  45 <func+0x15>
  45:   8d 7c 40 01             lea    0x1(%rax,%rax,2),%edi
  49:   85 ff                   test   %edi,%edi
  4b:   75 f3                   jne    40 <func+0x10>
  4d:   48 83 c4 08             add    $0x8,%rsp
  51:   c3                      retq   

I can't see any padding in function f :(

PS. In file config/i386/i386.c (ix86_avoid_jump_mispredicts)

  /* Look for all minimal intervals of instructions containing 4 jumps.
...

Not jumps, but _branches_ (CALL, JMP, conditional branches, or returns) 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (48 preceding siblings ...)
  2009-05-20 21:38 ` vvv at ru dot ru
@ 2009-05-20 22:09 ` jakub at gcc dot gnu dot org
  2009-05-21 13:22 ` jakub at gcc dot gnu dot org
  2009-05-21 13:26 ` jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-20 22:09 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #50 from jakub at gcc dot gnu dot org  2009-05-20 22:09 -------
nopl   0x0(%rax,%rax,1) and nopw   0x0(%rax,%rax,1) aren't padding (though, it
has been added in this case for label alignment or function entry alignment,
not to avoid 4+ jumps in one 16byte page)?
Anyway, you want to look at both -S output and objdump -d of -c output, there
you'll see needed .p2align added.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (49 preceding siblings ...)
  2009-05-20 22:09 ` jakub at gcc dot gnu dot org
@ 2009-05-21 13:22 ` jakub at gcc dot gnu dot org
  2009-05-21 13:26 ` jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-21 13:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #51 from jakub at gcc dot gnu dot org  2009-05-21 13:21 -------
Subject: Bug 39942

Author: jakub
Date: Thu May 21 13:21:30 2009
New Revision: 147765

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147765
Log:
        PR target/39942
        * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
        .p2align 3 if MAX_SKIP is smaller than 7.
        * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
    branches/gcc-4_4-branch/gcc/ChangeLog
    branches/gcc-4_4-branch/gcc/config/i386/linux.h
    branches/gcc-4_4-branch/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
  2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
                   ` (50 preceding siblings ...)
  2009-05-21 13:22 ` jakub at gcc dot gnu dot org
@ 2009-05-21 13:26 ` jakub at gcc dot gnu dot org
  51 siblings, 0 replies; 54+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-21 13:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #52 from jakub at gcc dot gnu dot org  2009-05-21 13:26 -------
Subject: Bug 39942

Author: jakub
Date: Thu May 21 13:26:13 2009
New Revision: 147766

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147766
Log:
        PR target/39942
        * config/i386/x86-64.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Don't emit second
        .p2align 3 if MAX_SKIP is smaller than 7.
        * config/i386/linux.h (ASM_OUTPUT_MAX_SKIP_ALIGN): Likewise.

Modified:
    branches/gcc-4_3-branch/gcc/ChangeLog
    branches/gcc-4_3-branch/gcc/config/i386/linux.h
    branches/gcc-4_3-branch/gcc/config/i386/x86-64.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


^ permalink raw reply	[flat|nested] 54+ messages in thread

[parent not found: <bug-39942-4@http.gcc.gnu.org/bugzilla/>]

* [Bug target/39942] Nonoptimal code - leaveq; xchg   %ax,%ax; retq
       [not found] <bug-39942-4@http.gcc.gnu.org/bugzilla/>
@ 2020-04-14 21:20 ` peter at cordes dot ca
  0 siblings, 0 replies; 54+ messages in thread
From: peter at cordes dot ca @ 2020-04-14 21:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #53 from Peter Cordes <peter at cordes dot ca> ---
I think we can close this as fixed at some point.  The last activity on this
bug was some patches that sound like they were supposed to fix, and the MCVEs
from comments I tested no longer has a problem.

GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align
the top of the loop, not between leave and ret or between cmp/jcc.

void wait_for_enter()
{
    volatile int foo = 0;  // to get a LEAVE instruction emitted at all
        int u = getchar();
        while (!u)
                u = getchar()-13;
}

https://godbolt.org/z/RvxzZv

(Note that Godbolt normally filters .p2align so you have to either compile to
binary or not filter directives in the asm source.  Otherwise you'll never see
NOPs except in the unusual case where GCC actually emits a nop mnemonic.)

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2020-04-14 21:20 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28 12:20 [Bug c/39942] New: Nonoptimal code - leaveq; xchg %ax,%ax; retq vvv at ru dot ru
2009-04-28 13:42 ` [Bug target/39942] " pinskia at gcc dot gnu dot org
2009-04-28 17:05 ` vvv at ru dot ru
2009-04-28 17:10 ` vvv at ru dot ru
2009-04-28 17:15 ` vvv at ru dot ru
2009-04-28 17:37 ` ubizjak at gmail dot com
2009-04-28 21:19 ` vvv at ru dot ru
2009-04-28 21:23 ` pinskia at gcc dot gnu dot org
2009-04-28 21:47 ` ubizjak at gmail dot com
2009-04-28 21:53 ` pinskia at gcc dot gnu dot org
2009-04-28 21:54 ` ubizjak at gmail dot com
2009-04-29  7:46 ` vvv at ru dot ru
2009-04-29  7:55 ` vvv at ru dot ru
2009-04-29  9:32 ` jakub at gcc dot gnu dot org
2009-04-29 10:13 ` jakub at gcc dot gnu dot org
2009-04-29 19:17 ` vvv at ru dot ru
2009-04-30  9:07 ` jakub at gcc dot gnu dot org
2009-05-12 16:41 ` vvv at ru dot ru
2009-05-13  8:31 ` jakub at gcc dot gnu dot org
2009-05-13 11:43 ` vvv at ru dot ru
2009-05-13 13:32 ` rguenth at gcc dot gnu dot org
2009-05-13 17:13 ` vvv at ru dot ru
2009-05-13 18:22 ` ubizjak at gmail dot com
2009-05-13 18:34 ` rguenth at gcc dot gnu dot org
2009-05-13 18:45 ` hjl dot tools at gmail dot com
2009-05-13 18:57 ` vvv at ru dot ru
2009-05-13 19:06 ` vvv at ru dot ru
2009-05-13 19:09 ` jakub at gcc dot gnu dot org
2009-05-13 19:19 ` vvv at ru dot ru
2009-05-13 21:44 ` hjl dot tools at gmail dot com
2009-05-14  9:01 ` vvv at ru dot ru
2009-05-14 15:16 ` jakub at gcc dot gnu dot org
2009-05-14 15:58 ` hjl dot tools at gmail dot com
2009-05-14 18:37 ` hjl dot tools at gmail dot com
2009-05-14 19:44 ` vvv at ru dot ru
2009-05-15  2:23 ` hjl dot tools at gmail dot com
2009-05-15  4:32 ` hjl dot tools at gmail dot com
2009-05-15  7:56 ` jakub at gcc dot gnu dot org
2009-05-15 12:11 ` jakub at gcc dot gnu dot org
2009-05-15 12:12 ` jakub at gcc dot gnu dot org
2009-05-15 14:35 ` hjl dot tools at gmail dot com
2009-05-15 16:25 ` jakub at gcc dot gnu dot org
2009-05-15 18:18 ` jakub at gcc dot gnu dot org
2009-05-15 18:23 ` jakub at gcc dot gnu dot org
2009-05-15 23:06 ` hjl dot tools at gmail dot com
2009-05-16  6:38 ` jakub at gcc dot gnu dot org
2009-05-16  7:10 ` jakub at gcc dot gnu dot org
2009-05-16  7:12 ` jakub at gcc dot gnu dot org
2009-05-18 17:21 ` hjl at gcc dot gnu dot org
2009-05-20 21:38 ` vvv at ru dot ru
2009-05-20 22:09 ` jakub at gcc dot gnu dot org
2009-05-21 13:22 ` jakub at gcc dot gnu dot org
2009-05-21 13:26 ` jakub at gcc dot gnu dot org
     [not found] <bug-39942-4@http.gcc.gnu.org/bugzilla/>
2020-04-14 21:20 ` peter at cordes dot ca

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).