public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64)
@ 2016-11-15 20:00 Daniel Santos
  2016-11-15 20:03 ` [PATCH 1/9] Change type of x86_64_ms_sysv_extra_clobbered_registers Daniel Santos
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Daniel Santos @ 2016-11-15 20:00 UTC (permalink / raw)
  To: gcc-patches

Due to differences between the 64-bit Microsoft and System V ABIs, any 
msabi function that calls a sysv function must consider RSI, RDI and 
XMM6-15 as clobbered. The result is that such functions are bloated with 
SSE saves/restores costing as much as 106 bytes each (up to 200-ish 
bytes per function). This patch set targets 64-bit Wine and aims to 
mitigate some of those costs.

A few save & restore stubs are added to the static portion of libgcc and 
the pro/epilogues of such functions uses these stubs instead, thus 
reducing .text size. While we're already tinkering with stubs, it also 
manages the save/restore of up to 6 additional registers. Analysis of 
building Wine 64 demonstrates a reduction of .text by around 20%. While 
I haven't produce performance data yet, this is my first attempt to 
modify gcc so I would rather ask for comments earlier in this process.

The basic theory is that a reduction of I-cache misses will offset the 
extra instructions required for implementation. In addition, since there 
are only a handful of stubs that will be in memory, I'm using the larger 
mov instructions instead of push/pop to facilitate better parallelization.

Here is a sample of what these prologues/epilogues look like:

Prologue (in this case, SP adjustment was properly combined with later 
stack allocation):
     7b833800:   48 8d 44 24 88          lea -0x78(%rsp),%rax
     7b833805:   48 81 ec 58 01 00 00    sub    $0x158,%rsp
     7b83380c:   e8 95 6f 05 00          callq  7b88a7a6 <__savms64_17>

Epilogue (r10 stores the value to restore the stack pointer to):
     7b83386c:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
     7b833873:   00
     7b833874:   4c 8d 56 78             lea 0x78(%rsi),%r10
     7b833878:   e9 c9 6f 05 00          jmpq   7b88a846 <__resms64x_17>

Prologue, stack realignment case (this shows the uncombined SP 
modifications, described below):
     7b833800:   55                      push   %rbp
     7b833801:   48 8d 44 24 90          lea -0x70(%rsp),%rax
     7b833806:   48 89 e5                mov    %rsp,%rbp
     7b833809:   48 83 e0 f0             and $0xfffffffffffffff0,%rax
     7b83380d:   48 8d 60 90             lea -0x70(%rax),%rsp
     7b833811:   e8 cc 79 05 00          callq  7b88b1e2 <__savms64r_17>
     7b833816:   48 89 cb                mov    %rcx,%rbx# reordered 
insn from body
     7b833819:   48 83 ec 70             sub    $0x70,%rsp

Epilogue, stack realignment case:
     7b833875:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
     7b83387c:   00
    7b83387d:   e9 ac 79 05 00 jmpq   7b88b22e <__resms64rx_17>


Questions and (known) outstanding issues:

 1. I have added the new -f optimization to common.opt, but being that
    it only impacts x86_64, should this be a machine-specific -m option
    instead?
 2. In the prologues that realign the stack, stack pointer modifications
    aren't combining, presumably since I'm using a lea after realigning
    using rax.
 3. My x86 assembly expertise is limited, so I would appreciate any
    feedback on my stubs & emitted code.
 4. Documentation is still missing.
 5. A Changelog entry is still missing.
 6. This is my first major work on a GNU project and I have not yet
    fully reviewed all of the relevant GNU coding conventions, so I
    might still have some non-compliance code.
 7. Regression tests only run on my old Phenom. Have not yet tested on
    AVX cpu (which should use vmovaps instead of movaps).
 8. My test program is inadequate (and is not included in this patch
    set).  During development it failed to produce many optimization
    errors that I got when building Wine.  I've been building 64-bit
    Wine and running Wine's tests in the mean time.
 9. I need to devise a meaningful benchmarking strategy.
10. I have not yet examined how this may or may not affect -flto or
    where additional optimization opportunities in the lto driver may exist.
11. There are a few more optimization opportunities that I haven't
    attempted to exploit yet and prefer to leave for later projects.
      * In the case of stack realignment and all 17 registers being
        clobbered, I can combine the majority of the prologue
        (alignment, saving frame pointer, etc.) in the stub.
      * With these stubs being in the static portion of libgcc, each
        Wine "dll" gets a separate copy. The average number of dlls a
        Windows program loads seems to be at least 15, allowing a
        mechanism for them to be linked dynamically from libwine.so
        could save a little bit more .text and icache.
      * Ultimately, good static analysis of local sysv functions can
        completely eliminate the need to save SSE registers in some cases.
12. Use of hard frame pointers disables the optimization unless we're
    also realigning the stack. I've implemented this in another (local)
    branch, but haven't tested it yet.


gcc/common.opt                 |   7 +
  gcc/config/i386/i386.c         | 729 
++++++++++++++++++++++++++++++++++++++---
  gcc/config/i386/i386.h         |  22 +-
  gcc/config/i386/predicates.md  | 148 +++++++++
  gcc/config/i386/sse.md         |  56 ++++
  libgcc/config.host             |   2 +-
  libgcc/config/i386/i386-asm.h  |  82 +++++
  libgcc/config/i386/resms64.S   |  63 ++++
  libgcc/config/i386/resms64f.S  |  59 ++++
  libgcc/config/i386/resms64fx.S |  61 ++++
  libgcc/config/i386/resms64x.S  |  65 ++++
  libgcc/config/i386/savms64.S   |  63 ++++
  libgcc/config/i386/savms64f.S  |  64 ++++
  libgcc/config/i386/t-msabi     |   7 +
  14 files changed, 1379 insertions(+), 49 deletions(-)

Feedback and comments would be most appreciated!

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread
* [PATCH v2 0/9] Add optimization -moutline-msabi-xlougues (for Wine 64)
@ 2016-11-23  5:11 Daniel Santos
  2016-11-23  5:16 ` [PATCH 1/9] Change type of x86_64_ms_sysv_extra_clobbered_registers Daniel Santos
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Santos @ 2016-11-23  5:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Ian Lance Taylor

[-- Attachment #1: Type: text/plain, Size: 4950 bytes --]

Due to ABI differences, when a 64-bit Microsoft function calls and 
System V function, it  must consider RSI, RDI and XMM6-15 as clobbered. 
Saving these registers can cost as much as 109 bytes and a similar 
amount for restoring. This patch set targets 64-bit Wine and aims to 
mitigate some of these costs by adding ms-->sysv save & restore stubs to 
libgcc, which are called from pro/epilogues rather than emitting the 
code inline.  And since we're already tinkering with stubs, they will 
also manages the save/restore of up to 6 additional registers. Analysis 
of building Wine 64 demonstrates a reduction of .text by around 20%.

The basic theory is that a reduction of I-cache misses will offset the 
extra instructions required for implementation. And since there are only 
a handful of stubs that will be in memory, I'm using the larger mov 
instructions instead of push/pop to facilitate better parallelization. I 
have not yet produced actual performance data.

Here is a sample of some generated code:

Prologue:
    23c20:       48 8d 44 24 88          lea -0x78(%rsp),%rax
    23c25:       48 81 ec 08 01 00 00    sub    $0x108,%rsp
    23c2c:       e8 1a 4b 03 00          callq  5874b <__savms64_15>

Epilogue (r10 stores the value to restore the stack pointer to):
    23c7c:       48 8d b4 24 90 00 00    lea 0x90(%rsp),%rsi
    23c83:       00
    23c84:       4c 8d 56 78             lea 0x78(%rsi),%r10
    23c88:       e9 5e 4b 03 00          jmpq   587eb <__resms64x_15>

It would appear that forced stack realignment has become the new normal 
for Wine 64, since there are many Windows programs that violate the 
16-byte alignment requirement, but just so *happen* to not crash on 
Windows (and therefore claim that Wine should work as Windows happens to 
behave given the UB).

Prologue, stack realignment case:
    23c20:       55                      push   %rbp
    23c21:       48 89 e5                mov    %rsp,%rbp
    23c24:       48 83 e4 f0             and $0xfffffffffffffff0,%rsp
    23c28:       48 8d 44 24 90          lea -0x70(%rsp),%rax
    23c2d:       48 81 ec 00 01 00 00    sub    $0x100,%rsp
    23c34:       e8 8e 43 03 00          callq  57fc7 <__savms64f_15>

Epilogue, stack realignment case:
    23c86:       48 8d b4 24 90 00 00    lea 0x90(%rsp),%rsi
    23c8d:       00
    23c8e:       e9 80 43 03 00          jmpq   58013 <__resms64fx_15>

No additional regression tests fail with this patch set. I have tested 
about 12 builds Wine (with varying optimizations & options) and no 
additional tests fails for that either. (Actually, there appears to be 
some type of regression prior to this patch set because it magically 
fixes about 30 failed Wine tests, that don't fail when building with 
Wine with gcc-5.4.0.)

Outstanding issues:

 1. My x86 assembly expertise is limited, so I would appreciate
    examination of my stubs & emitted code!
 2. Regression tests only run on my old Phenom. Have not yet tested on
    AVX cpu (which should use vmovaps instead of movaps).
 3. My test program is inadequate (and is not included in this patch
    set) and needs a lot of cleanup.  During development it failed to
    produce many optimization errors that I got when building Wine. 
    I've been building 64-bit Wine and running Wine's tests in the mean
    time.
 4. It would help to write a benchmarking program/script.
 5. I haven't yet figured out how to get Wine building with -flto and I
    thus haven't tested how these changes affect it yet.
 6. I'm not 100% certain yet, but the stubs __resms64f* (restore with
    hard frame pointer, but return to the function) doesn't appear to
    ever be used because enabling hard frame pointers disables sibling
    calls, which is what it's intended to facilitate.


  gcc/config/i386/i386.c         | 704 
++++++++++++++++++++++++++++++++++++++---
  gcc/config/i386/i386.h         |  22 +-
  gcc/config/i386/i386.opt       |   5 +
  gcc/config/i386/predicates.md  | 155 +++++++++
  gcc/config/i386/sse.md         |  46 +++
  gcc/doc/invoke.texi            |  11 +-
  libgcc/config.host             |   2 +-
  libgcc/config/i386/i386-asm.h  |  82 +++++
  libgcc/config/i386/resms64.S   |  63 ++++
  libgcc/config/i386/resms64f.S  |  59 ++++
  libgcc/config/i386/resms64fx.S |  61 ++++
  libgcc/config/i386/resms64x.S  |  65 ++++
  libgcc/config/i386/savms64.S   |  63 ++++
  libgcc/config/i386/savms64f.S  |  64 ++++
  libgcc/config/i386/t-msabi     |   7 +
  15 files changed, 1358 insertions(+), 51 deletions(-)


Changes in Version 2:

  * Added ChangeLogs (attached).
  * Changed option from -f to -m and moved from gcc/common.opt to
    gcc/config/i386/i386.opt.
  * Solved problem with uncombined SP modifications.
  * Optimization now works when hard frame pointers are used and stack
    realignment is not needed.
  * Added documentation to gcc/doc/invoke.texi

Feedback and comments would be most appreciated!

Thanks,
Daniel






[-- Attachment #2: ChangeLog-moutline-msabi-xlogues.gcc --]
[-- Type: text/plain, Size: 2066 bytes --]

	* config/i386/i386.opt: Add option -moutline-msabi-xlogues.

	* config/i386/i386.h
	(x86_64_ms_sysv_extra_clobbered_registers): Change type to unsigned.
	(NUM_X86_64_MS_CLOBBERED_REGS): New macro.
	(struct machine_function): Add new members outline_ms_sysv,
	outline_ms_sysv_pad_in, outline_ms_sysv_pad_out and
	outline_ms_sysv_extra_regs.

	* config/i386/i386.c
	(enum xlogue_stub): New enum.
	(enum xlogue_stub_sets): New enum.
	(class xlogue_layout): New class.
	(struct ix86_frame): Add outlined_save_offset member, modify comments
	to detail stack layout when using out-of-line stubs.
	(ix86_target_string): Add -moutline-msabi-xlogues option.

	(stub_managed_regs): New static variable.
	(ix86_save_reg): Add new parameter ignore_outlined to optionally omit
	registers managed by out-of-line stub.
	(ix86_nsaved_regs): Modify to accommodate changes to ix86_save_reg.
	(ix86_nsaved_sseregs): Likewise.
	(ix86_emit_save_regs): Likewise.
	(ix86_emit_save_regs_using_mov): Likewise.
	(ix86_emit_save_sse_regs_using_mov): Likewise.
	(get_scratch_register_on_entry): Likewise.
	(ix86_compute_frame_layout): Modify to disable m->outline_ms_sysv when
	appropriate and compute frame layout for out-of-line stubs.
	(gen_frame_set): New function.
	(gen_frame_load): Likewise.
	(gen_frame_store): Likewise.
	(emit_msabi_outlined_save): Likewise.
	(ix86_expand_prologue): Modify to call emit_msabi_outlined_save when
	appropriate.
	(ix86_emit_leave): Add parameter rtx_insn *insn, allowing it to be used
	to only generate the notes.
	(emit_msabi_outlined_restore): New function.
	(ix86_expand_epilogue): Modify to call emit_msabi_outlined_restore when
	appropriate.
	(ix86_expand_call): Modify to enable m->outline_ms_sysv when
	appropriate.

	* config/i386/predicates.md
	(save_multiple): New predicate.
	(restore_multiple): Likewise.
	* config/i386/sse.md
	(save_multiple<mode>): New pattern.
	(save_multiple_realign<mode>): Likewise.
	(restore_multiple<mode>): Likewise.
	(restore_multiple_and_return<mode>): Likewise.
	(restore_multiple_leave_return<mode>): Likewise.

[-- Attachment #3: ChangeLog-moutline-msabi-xlogues.libgcc --]
[-- Type: text/plain, Size: 352 bytes --]

	* config.host: Add i386/t-msabi to i386/t-linux file list.
	* config/i386/i386-asm.h: New file.
	* config/i386/resms64.S: New file.
	* config/i386/resms64f.S: New file.
	* config/i386/resms64fx.S: New file.
	* config/i386/resms64x.S: New file.
	* config/i386/savms64.S: New file.
	* config/i386/savms64f.S: New file.
	* config/i386/t-msabi: New file.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-11-23  5:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 20:00 [PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64) Daniel Santos
2016-11-15 20:03 ` [PATCH 1/9] Change type of x86_64_ms_sysv_extra_clobbered_registers Daniel Santos
2016-11-15 20:03 ` [PATCH 3/9] Add msabi pro/epilogue stubs to libgcc Daniel Santos
2016-11-15 20:03 ` [PATCH 5/9] Add patterns and predicates foutline-msabi-xlouges Daniel Santos
2016-11-15 21:06   ` Daniel Santos
2016-11-15 20:03 ` [PATCH 4/9] Add struct fields and option for foutline-msabi-xlouges Daniel Santos
2016-11-15 20:03 ` [PATCH 7/9] Modify ix86_save_reg to optionally omit stub-managed registers Daniel Santos
2016-11-15 20:03 ` [PATCH 2/9] Minor refactor in ix86_compute_frame_layout Daniel Santos
2016-11-15 20:04 ` [PATCH 9/9] Add remainder of foutline-msabi-xlogues implementation Daniel Santos
2016-11-15 20:04 ` [PATCH 8/9] Modify ix86_compute_frame_layout for foutline-msabi-xlogues Daniel Santos
2016-11-15 20:04 ` [PATCH 6/9] Adds class xlouge_layout to i386.c Daniel Santos
2016-11-23  5:11 [PATCH v2 0/9] Add optimization -moutline-msabi-xlougues (for Wine 64) Daniel Santos
2016-11-23  5:16 ` [PATCH 1/9] Change type of x86_64_ms_sysv_extra_clobbered_registers Daniel Santos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).