public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64)
@ 2016-11-15 20:00 Daniel Santos
  2016-11-15 20:03 ` [PATCH 2/9] Minor refactor in ix86_compute_frame_layout Daniel Santos
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Daniel Santos @ 2016-11-15 20:00 UTC (permalink / raw)
  To: gcc-patches

Due to differences between the 64-bit Microsoft and System V ABIs, any 
msabi function that calls a sysv function must consider RSI, RDI and 
XMM6-15 as clobbered. The result is that such functions are bloated with 
SSE saves/restores costing as much as 106 bytes each (up to 200-ish 
bytes per function). This patch set targets 64-bit Wine and aims to 
mitigate some of those costs.

A few save & restore stubs are added to the static portion of libgcc and 
the pro/epilogues of such functions uses these stubs instead, thus 
reducing .text size. While we're already tinkering with stubs, it also 
manages the save/restore of up to 6 additional registers. Analysis of 
building Wine 64 demonstrates a reduction of .text by around 20%. While 
I haven't produce performance data yet, this is my first attempt to 
modify gcc so I would rather ask for comments earlier in this process.

The basic theory is that a reduction of I-cache misses will offset the 
extra instructions required for implementation. In addition, since there 
are only a handful of stubs that will be in memory, I'm using the larger 
mov instructions instead of push/pop to facilitate better parallelization.

Here is a sample of what these prologues/epilogues look like:

Prologue (in this case, SP adjustment was properly combined with later 
stack allocation):
     7b833800:   48 8d 44 24 88          lea -0x78(%rsp),%rax
     7b833805:   48 81 ec 58 01 00 00    sub    $0x158,%rsp
     7b83380c:   e8 95 6f 05 00          callq  7b88a7a6 <__savms64_17>

Epilogue (r10 stores the value to restore the stack pointer to):
     7b83386c:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
     7b833873:   00
     7b833874:   4c 8d 56 78             lea 0x78(%rsi),%r10
     7b833878:   e9 c9 6f 05 00          jmpq   7b88a846 <__resms64x_17>

Prologue, stack realignment case (this shows the uncombined SP 
modifications, described below):
     7b833800:   55                      push   %rbp
     7b833801:   48 8d 44 24 90          lea -0x70(%rsp),%rax
     7b833806:   48 89 e5                mov    %rsp,%rbp
     7b833809:   48 83 e0 f0             and $0xfffffffffffffff0,%rax
     7b83380d:   48 8d 60 90             lea -0x70(%rax),%rsp
     7b833811:   e8 cc 79 05 00          callq  7b88b1e2 <__savms64r_17>
     7b833816:   48 89 cb                mov    %rcx,%rbx# reordered 
insn from body
     7b833819:   48 83 ec 70             sub    $0x70,%rsp

Epilogue, stack realignment case:
     7b833875:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
     7b83387c:   00
    7b83387d:   e9 ac 79 05 00 jmpq   7b88b22e <__resms64rx_17>


Questions and (known) outstanding issues:

 1. I have added the new -f optimization to common.opt, but being that
    it only impacts x86_64, should this be a machine-specific -m option
    instead?
 2. In the prologues that realign the stack, stack pointer modifications
    aren't combining, presumably since I'm using a lea after realigning
    using rax.
 3. My x86 assembly expertise is limited, so I would appreciate any
    feedback on my stubs & emitted code.
 4. Documentation is still missing.
 5. A Changelog entry is still missing.
 6. This is my first major work on a GNU project and I have not yet
    fully reviewed all of the relevant GNU coding conventions, so I
    might still have some non-compliance code.
 7. Regression tests only run on my old Phenom. Have not yet tested on
    AVX cpu (which should use vmovaps instead of movaps).
 8. My test program is inadequate (and is not included in this patch
    set).  During development it failed to produce many optimization
    errors that I got when building Wine.  I've been building 64-bit
    Wine and running Wine's tests in the mean time.
 9. I need to devise a meaningful benchmarking strategy.
10. I have not yet examined how this may or may not affect -flto or
    where additional optimization opportunities in the lto driver may exist.
11. There are a few more optimization opportunities that I haven't
    attempted to exploit yet and prefer to leave for later projects.
      * In the case of stack realignment and all 17 registers being
        clobbered, I can combine the majority of the prologue
        (alignment, saving frame pointer, etc.) in the stub.
      * With these stubs being in the static portion of libgcc, each
        Wine "dll" gets a separate copy. The average number of dlls a
        Windows program loads seems to be at least 15, allowing a
        mechanism for them to be linked dynamically from libwine.so
        could save a little bit more .text and icache.
      * Ultimately, good static analysis of local sysv functions can
        completely eliminate the need to save SSE registers in some cases.
12. Use of hard frame pointers disables the optimization unless we're
    also realigning the stack. I've implemented this in another (local)
    branch, but haven't tested it yet.


gcc/common.opt                 |   7 +
  gcc/config/i386/i386.c         | 729 
++++++++++++++++++++++++++++++++++++++---
  gcc/config/i386/i386.h         |  22 +-
  gcc/config/i386/predicates.md  | 148 +++++++++
  gcc/config/i386/sse.md         |  56 ++++
  libgcc/config.host             |   2 +-
  libgcc/config/i386/i386-asm.h  |  82 +++++
  libgcc/config/i386/resms64.S   |  63 ++++
  libgcc/config/i386/resms64f.S  |  59 ++++
  libgcc/config/i386/resms64fx.S |  61 ++++
  libgcc/config/i386/resms64x.S  |  65 ++++
  libgcc/config/i386/savms64.S   |  63 ++++
  libgcc/config/i386/savms64f.S  |  64 ++++
  libgcc/config/i386/t-msabi     |   7 +
  14 files changed, 1379 insertions(+), 49 deletions(-)

Feedback and comments would be most appreciated!

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-15 21:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 20:00 [PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64) Daniel Santos
2016-11-15 20:03 ` [PATCH 2/9] Minor refactor in ix86_compute_frame_layout Daniel Santos
2016-11-15 20:03 ` [PATCH 7/9] Modify ix86_save_reg to optionally omit stub-managed registers Daniel Santos
2016-11-15 20:03 ` [PATCH 4/9] Add struct fields and option for foutline-msabi-xlouges Daniel Santos
2016-11-15 20:03 ` [PATCH 1/9] Change type of x86_64_ms_sysv_extra_clobbered_registers Daniel Santos
2016-11-15 20:03 ` [PATCH 3/9] Add msabi pro/epilogue stubs to libgcc Daniel Santos
2016-11-15 20:03 ` [PATCH 5/9] Add patterns and predicates foutline-msabi-xlouges Daniel Santos
2016-11-15 21:06   ` Daniel Santos
2016-11-15 20:04 ` [PATCH 6/9] Adds class xlouge_layout to i386.c Daniel Santos
2016-11-15 20:04 ` [PATCH 9/9] Add remainder of foutline-msabi-xlogues implementation Daniel Santos
2016-11-15 20:04 ` [PATCH 8/9] Modify ix86_compute_frame_layout for foutline-msabi-xlogues Daniel Santos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).