* Help with implementing Wine optimization experiment
@ 2016-08-14 6:20 Daniel Santos
2016-08-14 7:46 ` Daniel Santos
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-14 6:20 UTC (permalink / raw)
To: gcc
I'm experimenting with ways to optimize wine (x86 target only) and I
believe I can shrink wine's total text size by around 7% by outlining
the lengthy pro- and epilogues required for ms_abi functions making
sysv_abi calls. Theoretically, fewer instruction cache misses will
offset the extra 4 instructions per function and result in a net
performance gain. However, I'm new to the gcc project and a novice x86
assembly programmer as well (have been wanting to work on gcc for a
while now!) In short, I want to:
1. Replace the prologue that pushes di, sp and xmm6-15 with a single
call to a global "ms_abi_push_regs" routine
2. Replace the epilogue that pops these regs with a jmp to a global
"ms_abi_pop_regs" routine
3. Add the two routines somewhere so that they are linked into the output.
I have this working in a small-scale experiment (writing the ms_abi
function in assembly), but I'm not certain how I would add these
routines. Should I make them built-ins?
I have found the code that adds the clobber RTL instructions in
ix86_expand_call() (gcc/config/i386/i386.c:25832), and I see that
thread_prologue_and_epilogue_insns() (gcc/function.c) is where these
clobbers are expanded into the prologue and epilogue, but I'm not sure
what the cleanest way to convert this is. My thought was to replace the
clobber_reg() calls with one that would add an insn_call, or would it be
better to do this in thread_prologue_and_epilogue_insns() where prologue
and epilogue generation belongs? But that function is for all targets.
Any pointers greatly appreciated!
For reference, this is my 64-bit test case:
outline_test.h:
extern void my_sysv_func(void);
extern int __attribute__((ms_abi)) my_ms_abi_func(void);
outline_test_asm.s:
.global ms_abi_push_regs
.global ms_abi_pop_regs
.global my_ms_abi_func
ms_abi_push_regs:
pop %rax
push %rdi
push %rsi
sub $0xa8,%rsp
movaps %xmm6,(%rsp)
movaps %xmm7,0x10(%rsp)
movaps %xmm8,0x20(%rsp)
movaps %xmm9,0x30(%rsp)
movaps %xmm10,0x40(%rsp)
movaps %xmm11,0x50(%rsp)
movaps %xmm12,0x60(%rsp)
movaps %xmm13,0x70(%rsp)
movaps %xmm14,0x80(%rsp)
movaps %xmm15,0x90(%rsp)
jmp *(%rax)
ms_abi_pop_regs:
movaps (%rsp),%xmm6
movaps 0x10(%rsp),%xmm7
movaps 0x20(%rsp),%xmm8
movaps 0x30(%rsp),%xmm9
movaps 0x40(%rsp),%xmm10
movaps 0x50(%rsp),%xmm11
movaps 0x60(%rsp),%xmm12
movaps 0x70(%rsp),%xmm13
movaps 0x80(%rsp),%xmm14
movaps 0x90(%rsp),%xmm15
add $0xa8,%rsp
pop %rsi
pop %rdi
retq
my_ms_abi_func:
callq ms_abi_push_regs
callq my_sysv_func
xor %eax, %eax
jmp ms_abi_pop_regs
Thanks!
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
@ 2016-08-14 7:46 ` Daniel Santos
2016-08-14 7:49 ` Trevor Saunders
` (3 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-14 7:46 UTC (permalink / raw)
To: gcc
Just an update. I did discover how the pass functions call back into the
target-specific code, it's generated from gcc/config/i386/i386.md. So
thread_prologue_and_epilogue_insns() --> gen_prologue() -->
ix86_expand_prologue(), which is implemented in i386.c. So that problem
is solved, but still working on the rest. Currently, I'm guessing that
I'll look through the clobber list to see if all of the regs in the
array x86_64_ms_sysv_extra_clobbered_registers are present and if so,
replace those with the ms_abi_push_regs, figure out how to let everybody
know the new state of the stack and then allow any other clobbered regs
to get pushed/moved after that.
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
2016-08-14 7:46 ` Daniel Santos
@ 2016-08-14 7:49 ` Trevor Saunders
2016-08-15 0:16 ` Jeff Law
2016-08-15 10:47 ` Florian Weimer
` (2 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Trevor Saunders @ 2016-08-14 7:49 UTC (permalink / raw)
To: Daniel Santos; +Cc: gcc
On Sun, Aug 14, 2016 at 01:23:16AM -0500, Daniel Santos wrote:
> I'm experimenting with ways to optimize wine (x86 target only) and I believe
> I can shrink wine's total text size by around 7% by outlining the lengthy
> pro- and epilogues required for ms_abi functions making sysv_abi calls.
> Theoretically, fewer instruction cache misses will offset the extra 4
> instructions per function and result in a net performance gain. However, I'm
> new to the gcc project and a novice x86 assembly programmer as well (have
> been wanting to work on gcc for a while now!) In short, I want to:
>
> 1. Replace the prologue that pushes di, sp and xmm6-15 with a single call to
> a global "ms_abi_push_regs" routine
> 2. Replace the epilogue that pops these regs with a jmp to a global
> "ms_abi_pop_regs" routine
> 3. Add the two routines somewhere so that they are linked into the output.
I think you want to put those into libgcc then.
>
> I have this working in a small-scale experiment (writing the ms_abi function
> in assembly), but I'm not certain how I would add these routines. Should I
> make them built-ins?
>
> I have found the code that adds the clobber RTL instructions in
> ix86_expand_call() (gcc/config/i386/i386.c:25832), and I see that
> thread_prologue_and_epilogue_insns() (gcc/function.c) is where these
> clobbers are expanded into the prologue and epilogue, but I'm not sure what
> the cleanest way to convert this is. My thought was to replace the
> clobber_reg() calls with one that would add an insn_call, or would it be
> better to do this in thread_prologue_and_epilogue_insns() where prologue and
> epilogue generation belongs? But that function is for all targets. Any
> pointers greatly appreciated!
I think you probably want to look at ix86_expand_prologue.
Hope that helps, but I'm no expert, so take it with some salt.
Trev
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 7:49 ` Trevor Saunders
@ 2016-08-15 0:16 ` Jeff Law
2016-08-15 10:56 ` Richard Biener
0 siblings, 1 reply; 11+ messages in thread
From: Jeff Law @ 2016-08-15 0:16 UTC (permalink / raw)
To: Trevor Saunders, Daniel Santos; +Cc: gcc
On 08/14/2016 01:57 AM, Trevor Saunders wrote:
> On Sun, Aug 14, 2016 at 01:23:16AM -0500, Daniel Santos wrote:
>> I'm experimenting with ways to optimize wine (x86 target only) and I believe
>> I can shrink wine's total text size by around 7% by outlining the lengthy
>> pro- and epilogues required for ms_abi functions making sysv_abi calls.
>> Theoretically, fewer instruction cache misses will offset the extra 4
>> instructions per function and result in a net performance gain. However, I'm
>> new to the gcc project and a novice x86 assembly programmer as well (have
>> been wanting to work on gcc for a while now!) In short, I want to:
>>
>> 1. Replace the prologue that pushes di, sp and xmm6-15 with a single call to
>> a global "ms_abi_push_regs" routine
>> 2. Replace the epilogue that pops these regs with a jmp to a global
>> "ms_abi_pop_regs" routine
>> 3. Add the two routines somewhere so that they are linked into the output.
>
> I think you want to put those into libgcc then.
Right. That's what I've done with out-of-line prologues/epilogues in
the past.
Jeff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-15 0:16 ` Jeff Law
@ 2016-08-15 10:56 ` Richard Biener
2016-08-16 4:21 ` Daniel Santos
0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2016-08-15 10:56 UTC (permalink / raw)
To: Jeff Law; +Cc: Trevor Saunders, Daniel Santos, gcc
On Mon, Aug 15, 2016 at 2:16 AM, Jeff Law <law@redhat.com> wrote:
> On 08/14/2016 01:57 AM, Trevor Saunders wrote:
>>
>> On Sun, Aug 14, 2016 at 01:23:16AM -0500, Daniel Santos wrote:
>>>
>>> I'm experimenting with ways to optimize wine (x86 target only) and I
>>> believe
>>> I can shrink wine's total text size by around 7% by outlining the lengthy
>>> pro- and epilogues required for ms_abi functions making sysv_abi calls.
>>> Theoretically, fewer instruction cache misses will offset the extra 4
>>> instructions per function and result in a net performance gain. However,
>>> I'm
>>> new to the gcc project and a novice x86 assembly programmer as well (have
>>> been wanting to work on gcc for a while now!) In short, I want to:
>>>
>>> 1. Replace the prologue that pushes di, sp and xmm6-15 with a single call
>>> to
>>> a global "ms_abi_push_regs" routine
>>> 2. Replace the epilogue that pops these regs with a jmp to a global
>>> "ms_abi_pop_regs" routine
>>> 3. Add the two routines somewhere so that they are linked into the
>>> output.
>>
>>
>> I think you want to put those into libgcc then.
>
> Right. That's what I've done with out-of-line prologues/epilogues in the
> past.
In the static part, of course. Not sure if we always have/link that
on x86_64/i?86.
Richard.
> Jeff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-15 10:56 ` Richard Biener
@ 2016-08-16 4:21 ` Daniel Santos
0 siblings, 0 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-16 4:21 UTC (permalink / raw)
To: Richard Biener, Jeff Law; +Cc: Trevor Saunders, gcc
On 08/15/2016 05:56 AM, Richard Biener wrote:
> On Mon, Aug 15, 2016 at 2:16 AM, Jeff Law <law@redhat.com> wrote:
>> On 08/14/2016 01:57 AM, Trevor Saunders wrote:
>>> On Sun, Aug 14, 2016 at 01:23:16AM -0500, Daniel Santos wrote:
>>>> I'm experimenting with ways to optimize wine (x86 target only) and I
>>>> believe
>>>> I can shrink wine's total text size by around 7% by outlining the lengthy
>>>> pro- and epilogues required for ms_abi functions making sysv_abi calls.
>>>> Theoretically, fewer instruction cache misses will offset the extra 4
>>>> instructions per function and result in a net performance gain. However,
>>>> I'm
>>>> new to the gcc project and a novice x86 assembly programmer as well (have
>>>> been wanting to work on gcc for a while now!) In short, I want to:
>>>>
>>>> 1. Replace the prologue that pushes di, sp and xmm6-15 with a single call
>>>> to
>>>> a global "ms_abi_push_regs" routine
>>>> 2. Replace the epilogue that pops these regs with a jmp to a global
>>>> "ms_abi_pop_regs" routine
>>>> 3. Add the two routines somewhere so that they are linked into the
>>>> output.
>>>
>>> I think you want to put those into libgcc then.
>> Right. That's what I've done with out-of-line prologues/epilogues in the
>> past.
> In the static part, of course. Not sure if we always have/link that
> on x86_64/i?86.
>
> Richard.
Thanks all! Well, Wine's libs certainly do not appear to be dynamically
linked with libgcc, and I didn't know that it had such a static portion,
so thank you for this! Getting this in will solve half of the problem.
Also, I should have mentioned that I see this as a stop-gap to actually
performing static analysis and completely disabling floating point in
Wine's libs where ever possible, but that's a much larger project. So
I'm hoping to be able to show some improvements from this mechanism.
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
2016-08-14 7:46 ` Daniel Santos
2016-08-14 7:49 ` Trevor Saunders
@ 2016-08-15 10:47 ` Florian Weimer
2016-08-17 22:56 ` Daniel Santos
2016-08-15 11:36 ` André Hentschel
2016-08-17 23:04 ` Daniel Santos
4 siblings, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2016-08-15 10:47 UTC (permalink / raw)
To: Daniel Santos, gcc
On 08/14/2016 08:23 AM, Daniel Santos wrote:
>
> ms_abi_push_regs:
> pop %rax
> push %rdi
> push %rsi
> sub $0xa8,%rsp
> movaps %xmm6,(%rsp)
> movaps %xmm7,0x10(%rsp)
> movaps %xmm8,0x20(%rsp)
> movaps %xmm9,0x30(%rsp)
> movaps %xmm10,0x40(%rsp)
> movaps %xmm11,0x50(%rsp)
> movaps %xmm12,0x60(%rsp)
> movaps %xmm13,0x70(%rsp)
> movaps %xmm14,0x80(%rsp)
> movaps %xmm15,0x90(%rsp)
> jmp *(%rax)
I think this will be quite slow because it breaks the return stack
optimization in the CPU. I think you should push the return address and
use RET.
Florian
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-15 10:47 ` Florian Weimer
@ 2016-08-17 22:56 ` Daniel Santos
0 siblings, 0 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-17 22:56 UTC (permalink / raw)
To: Florian Weimer, gcc
On 08/15/2016 05:46 AM, Florian Weimer wrote:
> On 08/14/2016 08:23 AM, Daniel Santos wrote:
>>
>> ms_abi_push_regs:
>> pop %rax
>> push %rdi
>> push %rsi
>> sub $0xa8,%rsp
>> movaps %xmm6,(%rsp)
>> movaps %xmm7,0x10(%rsp)
>> movaps %xmm8,0x20(%rsp)
>> movaps %xmm9,0x30(%rsp)
>> movaps %xmm10,0x40(%rsp)
>> movaps %xmm11,0x50(%rsp)
>> movaps %xmm12,0x60(%rsp)
>> movaps %xmm13,0x70(%rsp)
>> movaps %xmm14,0x80(%rsp)
>> movaps %xmm15,0x90(%rsp)
>> jmp *(%rax)
>
> I think this will be quite slow because it breaks the return stack
> optimization in the CPU. I think you should push the return address
> and use RET.
>
> Florian
>
Looks like I forgot to reply-all on my last reply, but thanks again for
the advice here. Would there be any performance hit to reshuffling the
push/pops to save the 8 byte alignment padding? My assumption is that
the stack will always be 16-byte aligned with the 8-byte return address
of the last call on it, so offset by 8 bytes. (Also, not sure that I
need the .type directive, was copying other code in libgcc :)
.text
.global __msabi_save
.hidden __msabi_save
#ifdef __ELF__
.type __msabi_save,@function
#endif
/* TODO: implement vmovaps when supported?*/
__msabi_save:
#ifdef __x86_64__
pop %rax
push %rdi
sub $0xa0,%rsp
movaps %xmm6,(%rsp)
movaps %xmm7,0x10(%rsp)
movaps %xmm8,0x20(%rsp)
movaps %xmm9,0x30(%rsp)
movaps %xmm10,0x40(%rsp)
movaps %xmm11,0x50(%rsp)
movaps %xmm12,0x60(%rsp)
movaps %xmm13,0x70(%rsp)
movaps %xmm14,0x80(%rsp)
movaps %xmm15,0x90(%rsp)
push %rsi
push %rax
#endif /* __x86_64__ */
ret
.text
.global __msabi_restore
.hidden __msabi_restore
#ifdef __ELF__
.type __msabi_restore,@function
#endif
__msabi_restore:
#ifdef __x86_64__
pop %rsi
movaps (%rsp),%xmm6
movaps 0x10(%rsp),%xmm7
movaps 0x20(%rsp),%xmm8
movaps 0x30(%rsp),%xmm9
movaps 0x40(%rsp),%xmm10
movaps 0x50(%rsp),%xmm11
movaps 0x60(%rsp),%xmm12
movaps 0x70(%rsp),%xmm13
movaps 0x80(%rsp),%xmm14
movaps 0x90(%rsp),%xmm15
add $0xa0,%rsp
pop %rdi
#endif /* __x86_64__ */
ret
Thanks!
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
` (2 preceding siblings ...)
2016-08-15 10:47 ` Florian Weimer
@ 2016-08-15 11:36 ` André Hentschel
2016-08-16 3:13 ` Daniel Santos
2016-08-17 23:04 ` Daniel Santos
4 siblings, 1 reply; 11+ messages in thread
From: André Hentschel @ 2016-08-15 11:36 UTC (permalink / raw)
To: gcc
Am 14.08.2016 um 08:23 schrieb Daniel Santos:
> I'm experimenting with ways to optimize wine (x86 target only) and I believe I can shrink wine's total text size by around 7% by outlining the lengthy pro- and epilogues required for ms_abi functions making sysv_abi calls. Theoretically, fewer instruction cache misses will offset the extra 4 instructions per function and result in a net performance gain. However, I'm new to the gcc project and a novice x86 assembly programmer as well (have been wanting to work on gcc for a while now!) In short, I want to:
>
> 1. Replace the prologue that pushes di, sp and xmm6-15 with a single call to a global "ms_abi_push_regs" routine
> 2. Replace the epilogue that pops these regs with a jmp to a global "ms_abi_pop_regs" routine
> 3. Add the two routines somewhere so that they are linked into the output.
>
> I have this working in a small-scale experiment (writing the ms_abi function in assembly), but I'm not certain how I would add these routines. Should I make them built-ins?
>
> I have found the code that adds the clobber RTL instructions in ix86_expand_call() (gcc/config/i386/i386.c:25832), and I see that thread_prologue_and_epilogue_insns() (gcc/function.c) is where these clobbers are expanded into the prologue and epilogue, but I'm not sure what the cleanest way to convert this is. My thought was to replace the clobber_reg() calls with one that would add an insn_call, or would it be better to do this in thread_prologue_and_epilogue_insns() where prologue and epilogue generation belongs? But that function is for all targets. Any pointers greatly appreciated!
>
Hi,
Thanks for working on this, but I haven't seen some discussion on wine-devel recently.
I'm also not an expert on that area, but isn't this risking to break copy protections and hotpatching.
Just wanted to remind you about those two things, so the implementation will be usefull.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-15 11:36 ` André Hentschel
@ 2016-08-16 3:13 ` Daniel Santos
0 siblings, 0 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-16 3:13 UTC (permalink / raw)
To: André Hentschel, gcc
On 08/15/2016 06:35 AM, André Hentschel wrote:
> Hi,
> Thanks for working on this, but I haven't seen some discussion on wine-devel recently.
> I'm also not an expert on that area, but isn't this risking to break copy protections and hotpatching.
> Just wanted to remind you about those two things, so the implementation will be usefull.
Thanks for your response! I've run into the hot-patching code a lot
while working on this. Not breaking these should be easy since these
functions are explicitly marked with the ms_hook_prologue attribute, so
I can just skip altering these functions for now. This attribute is
assigned in Wine via expansion of the macro DECLSPEC_HOTPATCH and I'm
currently only counting 171 such functions.
I'm not sure about breaking copy protections however and I don't really
know what the issues are pertaining to this. I'm mostly doing this as an
experiment for now, and admittedly as an excuse to finally start hacking
away at gcc, which I've been wanting to do for several years now. Until
I know more about what the various copy protection mechanisms look for,
I'm going to ignore it and address it later. Presuming that this
experiment turns out to be useful, it might be implemented as a function
attribute so that functions that need to appear a certain way to copy
protection software can omit the optimization, similar to ms_hook_prologue.
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Help with implementing Wine optimization experiment
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
` (3 preceding siblings ...)
2016-08-15 11:36 ` André Hentschel
@ 2016-08-17 23:04 ` Daniel Santos
4 siblings, 0 replies; 11+ messages in thread
From: Daniel Santos @ 2016-08-17 23:04 UTC (permalink / raw)
To: gcc
I'm stuck on generating a jmp to the epilogue as I can't find any
examples of this. This is the summarized version of what I'm doing:
rtx msabi_restore_fn, jump_insn;
msabi_restore_fn = gen_rtx_SYMBOL_REF (Pmode, "__msabi_restore");
SYMBOL_REF_FLAGS (msabi_restore_fn) |= SYMBOL_FLAG_LOCAL;
jump_insn = gen_rtx_SET (VOIDmode, pc_rtx, gen_rtx_MEM (QImode,
msabi_restore_fn));
emit_insn (jump_insn);
Unfortunately, it dies with:
../a.c: In function ‘my_ms_sysv’:
../a.c:7:1: error: unrecognizable insn:
}
^
(insn 15 14 8 2 (set/f (pc)
(mem:QI (symbol_ref:DI ("__msabi_restore") [flags 0x2]) [0 S1
A8])) ../a.c:7 -1
(nil))
../a.c:7:1: internal compiler error: in extract_insn, at recog.c:2343
0xc195e1 _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
../../gcc/rtl-error.c:110
0xc19622 _fatal_insn_not_found(rtx_def const*, char const*, int, char
const*)
../../gcc/rtl-error.c:118
0xbcd683 extract_insn(rtx_insn*)
../../gcc/recog.c:2343
0xbcd37c extract_constrain_insn(rtx_insn*)
../../gcc/recog.c:2244
0xbdc310 copyprop_hardreg_forward_1
../../gcc/regcprop.c:793
0xbdd97d execute
../../gcc/regcprop.c:1289
How should I generate this jmp? All of the various helper functions for
generating a jump appear to be tailored for using a label and I'm using
a symbol.
I haven't yet attached all of the various notes to the insn yet. My call
to the prologue routine is working great though!
Thanks,
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-08-17 23:04 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-14 6:20 Help with implementing Wine optimization experiment Daniel Santos
2016-08-14 7:46 ` Daniel Santos
2016-08-14 7:49 ` Trevor Saunders
2016-08-15 0:16 ` Jeff Law
2016-08-15 10:56 ` Richard Biener
2016-08-16 4:21 ` Daniel Santos
2016-08-15 10:47 ` Florian Weimer
2016-08-17 22:56 ` Daniel Santos
2016-08-15 11:36 ` André Hentschel
2016-08-16 3:13 ` Daniel Santos
2016-08-17 23:04 ` Daniel Santos
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).